In which condition DCU prefetcher start prefetching?
The DCU prefetcher does not prefetch lines in a deterministic manner. It appears to have a confidence value associated with each potential prefetch request. If the confidence is larger than some threshold only then is the prefetch triggered. Moreover, it seems that if both L1 prefetchers are enabled, only one of them can issue a prefetch request in the same cycle. Perhaps the prefetch from the one with higher confidence is accepted. The answer below does not take these observations into consideration. (A lot more experimentation work needs to be done. I will rewrite it in the future.)
The Intel manual tells us a few things about the DCU prefetcher. Section 2.4.5.4 and Section 2.5.4.2 of the optimization manual both say the following:
Data cache unit (DCU) prefetcher -- This prefetcher, also known as the streaming prefetcher, is triggered by an ascending access to very recently loaded data. The processor assumes that this access is part of a streaming algorithm and automatically fetches the next line.
Note that Section 2.4.5.4 is part of the section on Sandy Bridge and Section 2.5.4.2 is part of the section on Intel Core. The DCU prefetcher was first supported on the Intel Core microarchitecture and it's also supported on all later microarchitectures. There is no indication as far as I know that the DCU prefetcher have changed over time. So I think it works exactly the same on all microarchitectures up to Skylake at least.
That quote doesn't really say much. The "ascending access" part suggests that the prefetcher is triggered by multiple accesses with increasing offsets. The "recently loaded data" part is vague. It may refer to one or more lines that immediately precede the line to be prefetched in the address space. It's also not clear whether that refers to virtual or physical addresses. The "fetches the next line" part suggests that it fetches only a single line every time it's triggered and that line is the line that succeeds the line(s) that triggered the prefetch.
I've conducted some experiments on Haswell with all prefetchers disabled except for the DCU prefetcher. I've also disabled hyperthreading. This enables me to study the DCU prefetcher in isolation. The results show the following:
- The DCU prefetcher tracks accesses for up to 4 different 4KB (probably physical) pages.
- The DCU prefetcher gets triggered when there are three or more accesses to one or more lines within the same cache set. The accesses must be either demand loads or software prefetches (any prefetch instruction including
prefetchnta
) or a combination of both. The accesses can be either hits or misses in the L1D or a combination of both. When it's triggered, for the 4 pages that are currently being tracked, it will prefetch the immediate next line within each of the respective pages. For example, consider the following three demand load misses: 0xF1000, 0xF2008, and 0xF3004. Assume that the 4 pages being tracked are 0xF1000, 0xF2000, 0xF3000, and 0xF4000. Then the DCU prefetcher will prefetch the following lines: 0xF1040, 0xF2040, 0xF3040, and 0xF4040. - The DCU prefetcher gets triggered when there are three or more accesses to one or more lines within two consecutive cache sets. Just like before, the accesses must be either demand loads or software prefetches. The accesses can be either hits or misses in the L1D. When it's triggered, for the 4 pages that are currently being tracked, it will prefetch the immediate next line within each of the respective pages with respect to the accessed cache set that has a smaller physical address. For example, consider the following three demand load misses: 0xF1040, 0xF2048, and 0xF3004. Assume that the 4 pages being tracked are 0xF1000, 0xF2000, 0xF3000, and 0xF4000. Then the DCU prefetcher will prefetch the following lines: 0xF3040 and 0xF4040. There is no need to prefetch 0xF1040 or 0xF2040 because there are already requests for them.
- The prefetcher will not prefetch into the next 4KB page. So if the three accesses are to the last line in the page, the prefetcher will not be triggered.
- The pages to be tracked are selected as follows. Whenever a demand load or a software prefetch accesses a page, that page will be tracked and it will replace one of the 4 pages currently being tracked. I've not investigated further the algorithm used to decide which of the 4 pages to replace. It's probably simple though.
- When a new page gets tracked because of an access of the type mentioned in the previous bullet point, at least two more accesses are required to the same page and same line to trigger the prefetcher to prefetch the next line. Otherwise, a subsequent access to the next line will miss in the L1 if the line was not already there. After that, either way, the DCU prefetcher behaves as described in the second and third bullet points. For example, consider the following three demand load misses: 0xF1040, 0xF2048, and 0xF3004. There are two accesses to the same line and the third one is to the same cache set but different line. These accesses will make the DCU prefetcher track the two pages, but it will not trigger it just yet. When the prefetcher sees another three accesses to any line in the same cache set, it will prefetch the next line for those pages that are currently being tracked. As another example, consider the following three demand load misses: 0xF1040, 0xF2048, and 0xF3030. These accesses are all to the same line so they will not only make the prefetcher track the page but also trigger a next line prefetch for that page and any other pages that are already being tracked.
- It seems to me that the prefetcher is receiving the dirty flag from the page table entry of the page being accessed (from the TLB). The flag indicates whether page is dirty or not. If it's dirty, the prefetcher will not track the page and accesses to the page will not be counted towards the three accesses for the triggering condition to be satisfied. So it seems that the DCU prefetcher simply ignores dirty pages. That said, the page doesn't have to be read-only though to be supported by the prefetcher. However, more thorough investigation is required to understand more accurately how stores may interact with the DCU prefetcher.
So the accesses that trigger the prefetcher don't have to be "ascending" or follow any order. The cache line offset itself seems to be ignored by the prefetcher. Only the physical page number matters.
I think the DCU prefetcher has a fully associative buffer that contains 4 entries. Each entry is tagged with the (probably physical) page number and has a valid bit to indicate whether the entry contains a valid page number. In addition, each cache set of the L1D is associated with a 2-bit saturating counter that is incremented whenever a demand load or a software prefetch request accesses the corresponding cache set and the dirty flag of the accessed page is not set. When the counter reaches a value of 3, the prefetcher is triggered. The prefetcher already has the physical page numbers from which it needs to prefetch; it can obtain them from the buffer entry that corresponds to the counter. So it can immediately issue prefetch requests to the next cache lines for each of the pages being tracked by the buffer. However, if a fill buffer is not available for a triggered prefetch request, the prefetch will be dropped. Then the counter will be reset to zero. Page tables might be modified though. It's possible that the prefetcher flushes its buffer whenever the TLB is flushed.
It could be the case that there are two DCU prefetchers, one for each logical core. When hyperthreading is disabled, one of the prefetchers would be disabled too. It could also be the case the 4 buffer entries that contain the page numbers are statically partitioned between the two logical cores and combined when hyperthreading is disabled. I don't know for sure, but such design makes sense to me. Another possible design would be each prefetcher has a dedicated 4-entry buffer. It's not hard to determine how the DCU prefetcher works when hyperthreading is enabled. I just didn't spend the effort to study it.
All in all, the DCU pefetcher is by far the simplest among the 4 data prefetchers that are available in modern high-performance Intel processors. It seems that it's only effective when sequentially, but slowly, accessing small chunks of read-only data (such as read-only files and statically initialized global arrays) or accessing multiple read-only objects at the same time that may contain many small fields and span a few consecutive cache lines within the same page.
Section 2.4.5.4 also provides additional information on L1D prefetching in general, so it applies to the DCU prefetcher.
Data prefetching is triggered by load operations when the following conditions are met:
- Load is from writeback memory type.
This means that the DCU prefetcher will not track accesses to the WP and WT cacheable memory types.
- The prefetched data is within the same 4K byte page as the load instruction that triggered it.
This has been verified experimentally.
- No fence is in progress in the pipeline.
I don't know what this means. See: https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/805373.
- Not many other load misses are in progress.
There are only 10 fill buffers that can hold requests that missed the L1D. This raises the question though that if there was only a single available fill buffer, would the hardware prefetcher use it or leave it for anticipated demand accesses? I don't know.
- There is not a continuous stream of stores.
This suggests that if there is a stream of a large number of stores intertwined with few loads, the L1 prefetcher will ignore the loads and basically temporarily switch off until the stores become a minority. However, my experimental results show that even a single store to a page will turn the prefetcher off for that page.
All Intel Atom microarchitectures have the DCU prefetcher. Although the prefetcher might track less than 4 pages in these microarchitectures.
All Xeon Phi microarchitectures up to and including Knights Landing don't have the DCU prefetcher. I don't know about later Xeon Phi microarchitectures.
AFAIK, Intel CPUs don't have an L1 adjacent-line prefetcher.
It has one in L2, though, which tries to complete a 128-byte aligned pair of 64-byte cache lines. (So it's not necessarily next, it could be the previous line if the demand-miss or other prefetch that caused one line to be cached was for the high half of a pair.)
See also https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/714832, and the many "related" links here on SO, e.g. prefetching data at L1 and L2. Not sure if either of those have any more details than the prefetch section of Intel's optimization manual, though: https://software.intel.com/en-us/articles/intel-sdm#optimization
I'm not sure if it has any heuristic to avoid wasting bandwidth and cache footprint when only one of a pair of lines is needed, other than not prefetching when there are enough demand misses outstanding.