Acquire/release semantics with 4 threads

You are thinking in terms of sequential consistency, the strongest (and default) memory order. If this memory order is used, all accesses to atomic variables constitute a total order, and the assertion indeed cannot be triggered.

However, in this program, a weaker memory order is used (release stores and acquire loads). This means, by definition that you cannot assume a total order of operations. In particular, you cannot assume that changes become visible to other threads in the same order. (Only a total order on each individual variable is guaranteed for any atomic memory order, including memory_order_relaxed.)

The stores to x and y occur on different threads, with no synchronization between them. The loads of x and y occur on different threads, with no synchronization between them. This means it is entirely allowed that thread c sees x && ! y and thread d sees y && ! x. (I'm just abbreviating the acquire-loads here, don't take this syntax to mean sequentially consistent loads.)

Bottom line: Once you use a weaker memory order than sequentially consistent, you can kiss your notion of a global state of all atomics, that is consistent between all threads, goodbye. Which is exactly why so many people recommend sticking with sequential consistency unless you need the performance (BTW, remember to measure if it's even faster!) and are certain of what you are doing. Also, get a second opinion.

Now, whether you will get burned by this, is a different question. The standard simply allows a scenario where the assertion fails, based on the abstract machine that is used to describe the standard requirements. However, your compiler and/or CPU may not exploit this allowance for one reason or another. So it is possible that for a given compiler and CPU, you may never see that the assertion is triggered, in practice. Keep in mind that a compiler or CPU may always use a stricter memory order than the one you asked for, because this can never introduce violations of the minimum requirements from the standard. It may only cost you some performance – but that is not covered by the standard anyway.

UPDATE in response to comment: The standard defines no hard upper limit on how long it takes for one thread to see changes to an atomic by another thread. There is a recommendation to implementers that values should become visible eventually.

There are sequencing guarantees, but the ones pertinent to your example do not prevent the assertion from firing. The basic acquire-release guarantee is that if:

  • Thread e performs a release-store to an atomic variable x
  • Thread f performs an acquire-load from the same atomic variable
  • Then if the value read by f is the one that was stored by e, the store in e synchronizes-with the load in f. This means that any (atomic and non-atomic) store in e that was, in this thread, sequenced before the given store to x, is visible to any operation in f that is, in this thread, sequenced after the given load. [Note that there are no guarantees given regarding threads other than these two!]

So, there is no guarantee that f will read the value stored by e, as opposed to e.g. some older value of x. If it doesn't read the updated value, then also the load does not synchronize with the store, and there are no sequencing guarantees for any of the dependent operations mentioned above.

I liken atomics with lesser memory order than sequentially consistent to the Theory of Relativity, where there is no global notion of simultaneousness.

PS: That said, an atomic load cannot just read an arbitrary older value. For example, if one thread performs periodic increments (e.g. with release order) of an atomic<unsigned> variable, initialized to 0, and another thread periodically loads from this variable (e.g. with acquire order), then, except for eventual wrapping, the values seen by the latter thread must be monotonically increasing. But this follows from the given sequencing rules: Once the latter thread reads a 5, anything that happened before the increment from 4 to 5 is in the relative past of anything that follows the read of 5. In fact, a decrease other than wrapping is not even allowed for memory_order_relaxed, but this memory order does not make any promises to the relative sequencing (if any) of accesses to other variables.


The release-acquire synchronization has (at least) this guarantee: side-effects before a release on a memory location are visible after an acquire on this memory location.

There is no such guarantee if the memory location is not the same. More importantly, there's no total (think global) ordering guarantee.

Looking at the example, thread A makes thread C come out of its loop, and thread B makes thread D come out of its loop.

However, the way a release may "publish" to an acquire (or the way an acquire may "observe" a release) on the same memory location doesn't require total ordering. It's possible for thread C to observe A's release and thread D to observe B's release, and only somewhere in the future for C to observe B's release and for D to observe A's release.


The example has 4 threads because that's the minimum example you can force such non-intuitive behavior. If any of the atomic operations were done in the same thread, there would be an ordering you couldn't violate.

For instance, if write_x and write_y happened on the same thread, it would require that whatever thread observed a change in y would have to observe a change in x.

Similarly, if read_x_then_y and read_y_then_x happened on the same thread, you would observe both changed in x and y at least in read_y_then_x.

Having write_x and read_x_then_y in the same thread would be pointless for the exercise, as it would become obvious it's not synchronizing correctly, as would be having write_x and read_y_then_x, which would always read the latest x.


EDIT:

The way, I am reasoning about this is that if thread a (write_x) stores to x then all the work it has done so far is synced with any other thread that reads x with acquire ordering.

(...) I can't think of any 'run' or memory ordering where z is never incremented. Can someone explain where my reasoning is flawed?

Also, I know The loop read will always be before the if statement read because the acquire prevents this reordering.

That's sequentially consistent order, which imposes a total order. That is, it imposes that write_x and write_y both be visible to all threads one after the other; either x then y or y then x, but the same order for all threads.

With release-acquire, there is no total order. The effects of a release are only guaranteed to be visible to a corresponding acquire on the same memory location. With release-acquire, the effects of write_x are guaranteed to be visible to whoever notices x has changed.

This noticing something changed is very important. If you don't notice a change, you're not synchronizing. As such, thread C is not synchronizing on y and thread D is not synchronizing on x.

Essentially, it's way easier to think of release-acquire as a change notification system that only works if you synchronize properly. If you don't synchronize, you may or may not observe side-effects.

Strong memory model hardware architectures with cache coherence even in NUMA, or languages/frameworks that synchronize in terms of total order, make it difficult to think in these terms, because it's practically impossible to observe this effect.