How do memory_order_seq_cst and memory_order_acq_rel differ?
http://en.cppreference.com/w/cpp/atomic/memory_order has a good example at the bottom that only works with memory_order_seq_cst
. Essentially memory_order_acq_rel
provides read and write orderings relative to the atomic variable, while memory_order_seq_cst
provides read and write ordering globally. That is, the sequentially consistent operations are visible in the same order across all threads.
The example boils down to this:
bool x= false;
bool y= false;
int z= 0;
a() { x= true; }
b() { y= true; }
c() { while (!x); if (y) z++; }
d() { while (!y); if (x) z++; }
// kick off a, b, c, d, join all threads
assert(z!=0);
Operations on z
are guarded by two atomic variables, not one, so you can't use acquire-release semantics to enforce that z
is always incremented.
On ISAs like x86 where atomics map to barriers, and the actual machine model includes a store buffer:
seq_cst
stores require flushing the store buffer so this thread's later reads are delayed until after the store is globally visible.acquire
orrelease
do not have to flush the store buffer. Normal x86 loads and stores have essentially acq and rel semantics. (seq_cst plus a store buffer with store forwarding.)But x86 atomic RMW operations always get promoted to
seq_cst
because the x86 asmlock
prefix is a full memory barrier. Other ISAs can do relaxed oracq_rel
RMWs in asm, with the store side being able to do limited reordering with later stores. (But not in ways that would make the RMW appear non-atomic: For purposes of ordering, is atomic read-modify-write one operation or two?)
https://preshing.com/20120515/memory-reordering-caught-in-the-act is an instructive example of the difference between a seq_cst store and a plain release store. (It's actually mov
+ mfence
vs. plain mov
in x86 asm. In practice xchg
is a more efficient way to do a seq_cst store on most x86 CPUs, but GCC does use mov
+mfence
)
Fun fact: AArch64's LDAR acquire-load instruction is actually a sequential-acquire, having a special interaction with STLR. Not until ARMv8.3 LDAPR can arm64 do plain acquire operations that can reorder with earlier release and seq_cst stores (STLR). (seq_cst
loads still use LDAR because they need that interaction with STLR to recover sequential consistency; seq_cst
and release
stores both use STLR).
With STLR / LDAR you get sequential consistency, but only having to drain the store buffer before the next LDAR, not right away after each seq_cst store before other operations. I think real AArch64 HW does implement it this way, rather than simply draining the store buffer before committing an STLR.
Strengthening rel or acq_rel to seq_cst by using LDAR / STLR doesn't need to be expensive, unless you seq_cst store something, and then seq_cst load something else. Then it's just as bad as x86.
Some other ISAs (like PowerPC) have more choices of barriers and can strengthen up to mo_rel
or mo_acq_rel
more cheaply than mo_seq_cst
, but their seq_cst
can't be as cheap as AArch64; seq-cst stores need a full barrier.
So AArch64 is an exception to the rule that seq_cst
stores drain the store buffer on the spot, either with a special instruction or a barrier instruction after. It's not a coincidence that ARMv8 was designed after C++11 / Java / etc. basically settled on seq_cst being the default for lockless atomic operations, so making them efficient was important. And after CPU architects had a few years to think about alternatives to providing barrier instructions or just acquire/release vs. relaxed load/store instructions.