Why does using the same cache-line from multiple threads not cause serious slowdown?
What you are seeing is basically the effect of the store buffer combined with store-to-load forwarding allowing each core to work mostly independently, despite sharing a cache line. As we will see below, it is truly a weird case where more contention is bad, up to a point, then even more contention suddenly makes things really fast!
Now with the conventional view of contention your code seems like something that will be high contention and therefore much slower than ideal. What happens, however, is that as soon as each core gets a single pending write in its write buffer, all later reads can be satisfied from the write buffer (store forwarding), and later writes just go into the buffer as well even after the core has lost ownership of the cache line. This turns most of the work into a totally local operation. The cache line is still bouncing around between the cores, but it's decoupled from the core execution path and is only needed to actually commit the stores now and then1.
The std::atomic
version can't use this magic at all since it has to use lock
ed operations to maintain atomicity and defeat the store buffer, so you see both the full cost of contention and the cost of the long-latency atomic operations2.
Let's try to actually collect some evidence that this is what's occurring. All of the discussion below deals with the non-atomic
version of the benchmark that uses volatile
to force reads and writes from buffer
.
Let's first check the assembly, to make sure it's what we expect:
0000000000400c00 <fn(unsigned char volatile*)>:
400c00: ba 00 65 cd 1d mov edx,0x1dcd6500
400c05: 0f 1f 00 nop DWORD PTR [rax]
400c08: 0f b6 07 movzx eax,BYTE PTR [rdi]
400c0b: 83 c0 01 add eax,0x1
400c0e: 83 ea 01 sub edx,0x1
400c11: 88 07 mov BYTE PTR [rdi],al
400c13: 75 f3 jne 400c08 <fn(unsigned char volatile*)+0x8>
400c15: f3 c3 repz ret
It's straightforward: a five instruction loop with a byte load, an increment of the loaded byte, a byte store, and finally the loop increment and conditional jump back to the top. Here, gcc has missed an optimization by breaking up the sub
and jne
, inhibiting macro-fusion, but overall it's OK and the store-forwarding latency is going to limit the loop in any case.
Next, let's take a look at the number of L1D misses. Every time a core needs to write into the line that has been stolen away, it will suffer an L1D miss, which we can measure with perf
. First, the single threaded (N=1
) case:
$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment
Performance counter stats for './cache-line-increment':
1070.188749 task-clock (msec) # 0.998 CPUs utilized
2,775,874,257 cycles # 2.594 GHz
2,504,256,018 instructions # 0.90 insn per cycle
501,139,187 L1-dcache-loads # 468.272 M/sec
69,351 L1-dcache-load-misses # 0.01% of all L1-dcache hits
1.072119673 seconds time elapsed
It is about what we expect: essentially zero L1D misses (0.01% of the total, probably mostly from interrupts and other code outside the loop), and just over 500,000,000 hits (matching almost exactly the number of loop iterations). Note also that we can easily calculate the cycles per iteration: about 5.55. This primarily reflects the cost of store-to-load forwarding, plus one cycle for the increment, which is a carried dependency chain as the same location is repeatedly updated (and volatile
means it can't be hoisted into a register).
Let's take a look at the N=4
case:
$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment
Performance counter stats for './cache-line-increment':
5920.758885 task-clock (msec) # 3.773 CPUs utilized
15,356,014,570 cycles # 2.594 GHz
10,012,249,418 instructions # 0.65 insn per cycle
2,003,487,964 L1-dcache-loads # 338.384 M/sec
61,450,818 L1-dcache-load-misses # 3.07% of all L1-dcache hits
1.569040529 seconds time elapsed
As expected the L1 loads jumps from 500 million to 2 billion, since there are 4 threads each doing the 500 million loads. The number of L1D misses also jumped by about a factor of 1,000, to about 60 million. Still, that number is not a lot compared to the 2 billion loads (and 2 billion stores - not shown, but we know they are there). That's ~33 loads and ~33 stores for every miss. It also means 250 cycles between each miss.
That doesn't really fit the model of the cache line bouncing around erratically between the cores, where as soon as a core gets the line, another core demands it. We know that lines bounce around between cores sharing an L2 in perhaps 20-50 cycles, so the ratio of one miss every 250 cycles seems way to low.
Two Hypotheses
A couple ideas spring to mind for the above described behavior:
Perhaps the MESI protocol variant used in this chip is "smart" and recognizes that one line is hot among several cores, but only a small amount of work is being done each time a core gets the lock and the line spends more time moving between L1 and L2 than actually satisfying loads and stores for some core. In light of this, some smart component in the coherence protocol decides to enforce some kind of minimum "ownership time" for each line: after a core gets the line, it will keep it for N cycles, even if demanded by another core (the other cores just have to wait).
This would help balance out the overhead of cache line ping-pong with real work, at the cost of "fairness" and responsiveness of the other cores, kind of like the trade-off between unfair and fair locks, and counteracting the effect described here, where the faster & fairer the coherency protocol is, the worse some (usually synthetic) loops may perform.
Now I've never heard of anything like that (and the immediately previous link shows that at least in the Sandy-Bridge era things were moving in the opposite direction), but it's certainly possible!
The store-buffer effect described is actually occurring, so most operations can complete almost locally.
Some Tests
Let's try to distinguish two cases with some modifications.
Reading and Writing Distinct Bytes
The obvious approach is to change the fn()
work function so that the threads still contend on the same cache line, but where store-forwarding can't kick in.
How about we just read from location x
and then write to location x + 1
? We'll give each thread two consecutive locations (i.e., thr[i] = std::thread(&fn, &buffer[i*2])
) so each thread is operating on two private bytes. The modified fn()
looks like:
for (int i=0; i<500000000; i++)
unsigned char temp = p[0];
p[1] = temp + 1;
}
The core loop is pretty much identical to earlier:
400d78: 0f b6 07 movzx eax,BYTE PTR [rdi]
400d7b: 83 c0 01 add eax,0x1
400d7e: 83 ea 01 sub edx,0x1
400d81: 88 47 01 mov BYTE PTR [rdi+0x1],al
400d84: 75 f2 jne 400d78
The only thing that's changed is that we write to [rdi+0x1]
rather than [rdi]
.
Now as I mentioned above, the original (same location) loop is actually running fairly slowly at about 5.5 cycles per iteration even in the best-case single-threaded case, because of the loop-carried load->add->store->load...
dependency. This new code breaks that chain! The load no longer depends on the store so we can execute everything pretty much in parallel and I expect this loop to run at about 1.25 cycles per iteration (5 instructions / CPU width of 4).
Here's the single threaded case:
$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment
Performance counter stats for './cache-line-increment':
318.722631 task-clock (msec) # 0.989 CPUs utilized
826,349,333 cycles # 2.593 GHz
2,503,706,989 instructions # 3.03 insn per cycle
500,973,018 L1-dcache-loads # 1571.815 M/sec
63,507 L1-dcache-load-misses # 0.01% of all L1-dcache hits
0.322146774 seconds time elapsed
So about 1.65 cycles per iteration3, about about three times faster versus incrementing the same location.
How about 4 threads?
$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment
Performance counter stats for './cache-line-increment':
22299.699256 task-clock (msec) # 3.469 CPUs utilized
57,834,005,721 cycles # 2.593 GHz
10,038,366,836 instructions # 0.17 insn per cycle
2,011,160,602 L1-dcache-loads # 90.188 M/sec
237,664,926 L1-dcache-load-misses # 11.82% of all L1-dcache hits
6.428730614 seconds time elapsed
So it's about 4 times slower than the same location case. Now rather than being just a bit slower than the single-threaded case it is about 20 times slower. This is the contention you've been looking for! Now also that the number of L1D misses has increased by a factor of 4 as well, nicely explaining the performance degradation and consistent with the idea that when store-to-load forwarding can't hide the contention, misses will increase by a lot.
Increasing the Distance Between Stores
Another approach would be to increase the distance in time/instructions between the store and the subsequent load. We can do this by incrementing SPAN
consecutive locations in the fn()
method, rather than always the same location. E.g, if SPAN
is 4, increment consecutively 4 locations like:
for (long i=0; i<500000000 / 4; i++) {
p[0]++;
p[1]++;
p[2]++;
p[3]++;
}
Note that we are still incrementing 500 million locations in total, just spreading out the increments among 4 bytes. Intuitively you would expect overall performance to increase since you now have SPAN
parallel dependency with length 1/SPAN
, so in the case above you might expect performance to improve by a factor of 4, since the 4 parallel chains can proceed at about 4 times the total throughput.
Here's what we actually get for time (measured in cycles) for the 1 thread and 3 thread4, for SPAN
values from 1 to 20:
Initially you see performance increase substantially in both single and multi-threaded cases; the increase from a SPAN
of one to two and three is close to the theoretical expected in the case of perfect parallelism for both cases.
The single-threaded case reaches an asymptote of about 4.25x faster than the single-location write: at this point the store-forwarding latency isn't the bottleneck and other bottlenecks have taken over (max IPC and store port contention, mostly).
The multi-threaded case is very different, however! Once you hit a SPAN
of about 7, the performance rapidly gets worse, leveling out at about 2.5 times worse than the SPAN=1
case and almost 10x worse compared to the best performance at SPAN=5
. What happens is that store-to-load forwarding stops occurring because the store and subsequent load are far enough apart in time/cycles that the store has retired to L1, so the load actually has to get the line and participate in MESI.
Also plotted is the L1D misses, which as mentioned above is indicative of "cache line transfers" between cores. The single-threaded case has essentially zero, and they are uncorrelated with the performance. The performance of the multi-threaded case, however, pretty much tracks exactly the cache misses. With SPAN
values in the 2 to 6 range, where store-forwarding is still working, there are proportionally fewer misses. Evidently the core is able to "buffer up" more stores between each cache line transfer since the core loop is faster.
Another way to think of it is that in the contended case L1D misses are basically constant per unit-time (which makes sense, since they are basically tied to the L1->L2->L1 latency, plus some coherency protocol overhead), so the more work you can do in between the cache line transfers, the better.
Here's the code for the multi-span case:
void fn(Type *p) {
for (long i=0; i<500000000 / SPAN; i++) {
for (int j = 0; j < SPAN; j++) {
p[j]++;
}
}
}
The bash script to run perf
for all SPAN
value from 1 to 20:
PERF_ARGS=${1:--x, -r10}
for span in {1..20}; do
g++ -std=c++11 -g -O2 -march=native -DSPAN=$span cache-line-increment.cpp -lpthread -o cache-line-increment
perf stat ${PERF_ARGS} -e cycles,L1-dcache-loads,L1-dcache-load-misses,machine_clears.count,machine_clears.memory_ordering ./cache-line-increment
done
Finally, "transpose" the results into proper CSV:
FILE=result1.csv; for metric in cycles L1-dcache-loads L1-dcache-load-misses; do { echo $metric; grep $metric $FILE | cut -f1 -d,; } > ${metric}.tmp; done && paste -d, *.tmp
A Final Test
There's a final test that you can do to show that each core is effectively doing most of its work in private: use the version of the benchmark where the threads work on the same location (which doesn't change the performance characteristics) examine the sum of the final counter values (you'd need int
counters rather than char
). If everything was atomic, you'd have a sum of 2 billion, and in the non-atomic case how close the total is to that value is a rough measure of how frequently the cores were passing around the lines. If the cores are working almost totally privately, the value would be closer to 500 million than 2 billion, and I guess that's what you'll find (a value fairly close to 500 million).
With some more clever incrementing, you can even have each thread track how often the value they incremented came from their last increment rather than another threads increment (e.g., by using a few bits of the value to stash a thread identifier). With an even more clever test you could practically reconstruct thee way the cache line moved around between the cores (is there a pattern, e.g., does core A prefer to hand off to core B?) and which cores contributed most to the final value, etc.
That's all left as an exercise :).
1 On top of that, if Intel has a coalescing store buffer where later stores that fully overlap earlier ones kill the earlier stores, it would only have to commit one value to L1 (the latest store) every time it gets the line.
2 You can't really separate the two effects here, but we will do it later by defeating store-to-load forwarding.
3 A bit more than I expected, perhaps bad scheduling leading to port pressure. If gcc
would just all the sub
and jne
to fuse, it runs at 1.1 cycles per iteration (still worse than the 1.0 I'd expect). It will do that I use -march=haswell
instead of -march=native
but I'm not going to go back and change all the numbers.
4 The results hold with 4 threads as well: but I only have 4 cores and I'm running stuff like Firefox in the background, so using 1 less core makes the measurements a lot less noisy. Measuring time in cycles helps a lot too.
5 On this CPU architecture, store forwarding where the load arrives before the store data is ready seem to alternate between 4 and 5 cycles, for an average of 4.5 cycles.
The atomic version has to ensure that some other thread will be able to read the result in a sequentially consistent fashion. So there are fences for each write.
The volatile version does not make any relationships visible to the other cores, so does not try and synchronize the memory so it is visible on other cores. For a multi-threaded system using C++11 or newer, volatile is not a mechanism for communicating between threads.