Is L2 HW prefetcher really helpful?
Yes, the L2 streamer is really helpful a lot of the time.
memcpy doesn't have any computational latency to hide, so I guess it can afford to let OoO exec resources (ROB size) handle the extra load latency you get from more L2 misses, at least in this case where you get all L3 hits from using a medium-size working set (1MiB) that fits in L3, no prefetching needed to make L3 hits happen.
And the only instructions are load/store (and loop overhead), so the OoO window includes demand loads for pretty far ahead.
IDK if the L2 spatial prefetcher and L1d prefetcher are helping any here.
Prediction to test this hypothesis: make your array bigger so you get L3 misses and you'll probably see a difference in overall time once OoO exec isn't enough to hide the load latency of going all the way to DRAM. HW prefetch triggering farther ahead can help some.
The other big benefits of HW prefetching come when it can keep up with your computation, so you get L2 hits. (In a loop that has computation with a medium-length but not loop-carried dependency chain.)
Demand loads and OoO exec can do a lot as far as using the available (single threaded) memory bandwidth, when there isn't other pressure on ROB capacity.
Also note that on Intel CPUs, every cache miss can cost a back-end replay (from the RS/scheduler) of dependent uops, one each for L1d and L2 misses when the data is expected to arrive. And after that, apparently the core optimistically spams uops while waiting for data to arrive from L3.
(See https://chat.stackoverflow.com/rooms/206639/discussion-on-question-by-beeonrope-are-load-ops-deallocated-from-the-rs-when-th and Are load ops deallocated from the RS when they dispatch, complete or some other time?)
Not the cache-miss load itself; in this case it would be the store instructions. More specifically, the store-data uop for port 4. That doesn't matter here; using 32-byte stores and bottlenecking on L3 bandwidth means we're not close to 1 port 4 uop per clock.
Yes, the L2 HW prefetcher is very helpful!
For example, find below results on my machine (i7-6700HQ) running tinymembench. The first column of results is with all prefetchers on, the second result column is with the L2 streamer off (but all other prefetchers still on).
This test uses 32 MiB source and destination buffers, which are much larger than the L3 on my machine, so it will be testing mostly misses to DRAM.
==========================================================================
== Memory bandwidth tests ==
== ==
== Note 1: 1MB = 1000000 bytes ==
== Note 2: Results for 'copy' tests show how many bytes can be ==
== copied per second (adding together read and writen ==
== bytes would have provided twice higher numbers) ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
== to first fetch data into it, and only then write it to the ==
== destination (source -> L1 cache, L1 cache -> destination) ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in ==
== brackets ==
==========================================================================
L2 streamer ON OFF
C copy backwards : 7962.4 MB/s 4430.5 MB/s
C copy backwards (32 byte blocks) : 7993.5 MB/s 4467.0 MB/s
C copy backwards (64 byte blocks) : 7989.9 MB/s 4438.0 MB/s
C copy : 8503.1 MB/s 4466.6 MB/s
C copy prefetched (32 bytes step) : 8729.2 MB/s 4958.4 MB/s
C copy prefetched (64 bytes step) : 8730.7 MB/s 4958.4 MB/s
C 2-pass copy : 6171.2 MB/s 3368.7 MB/s
C 2-pass copy prefetched (32 bytes step) : 6193.1 MB/s 4104.2 MB/s
C 2-pass copy prefetched (64 bytes step) : 6198.8 MB/s 4101.6 MB/s
C fill : 13372.4 MB/s 10610.5 MB/s
C fill (shuffle within 16 byte blocks) : 13379.4 MB/s 10547.5 MB/s
C fill (shuffle within 32 byte blocks) : 13365.8 MB/s 10636.9 MB/s
C fill (shuffle within 64 byte blocks) : 13588.7 MB/s 10588.3 MB/s
-
standard memcpy : 11550.7 MB/s 8216.3 MB/s
standard memset : 23188.7 MB/s 22686.8 MB/s
-
MOVSB copy : 9458.4 MB/s 6523.7 MB/s
MOVSD copy : 9474.5 MB/s 6510.7 MB/s
STOSB fill : 23329.0 MB/s 22901.5 MB/s
SSE2 copy : 9073.1 MB/s 4970.3 MB/s
SSE2 nontemporal copy : 12647.1 MB/s 7492.5 MB/s
SSE2 copy prefetched (32 bytes step) : 9106.0 MB/s 5069.8 MB/s
SSE2 copy prefetched (64 bytes step) : 9113.5 MB/s 5063.1 MB/s
SSE2 nontemporal copy prefetched (32 bytes step) : 11770.8 MB/s 7453.4 MB/s
SSE2 nontemporal copy prefetched (64 bytes step) : 11937.1 MB/s 7712.1 MB/s
SSE2 2-pass copy : 7092.8 MB/s 4355.2 MB/s
SSE2 2-pass copy prefetched (32 bytes step) : 7001.4 MB/s 4585.1 MB/s
SSE2 2-pass copy prefetched (64 bytes step) : 7055.1 MB/s 4557.9 MB/s
SSE2 2-pass nontemporal copy : 5043.2 MB/s 3263.3 MB/s
SSE2 fill : 14087.3 MB/s 10947.1 MB/s
SSE2 nontemporal fill : 33134.5 MB/s 32774.3 MB/s
In these tests having the L2 streamer is never slower and is often nearly twice as fast.
In general, you might notice the following patterns in the results:
- Copies generally seem to be more affected than fills.
- The
standard memset
andSTOSB fill
(these boil down to the same thing on this platform) are the least affected, with the prefetched result being only a few % faster than without. - Standard
memcpy
is probably the only copy here that uses 32-byte AVX instructions, and it is among the least affected of the copies - but prefetching on is still ~40% faster than without.
I also tried turning on and off the other three prefetchers, but they generally had almost no measurable effect for this benchmark.