Can atomic loads be merged in the C++ memory model?

Yes, because we can not observe the difference!

An implementation is allowed to turn your snippet into the following (pseudo-implementation).

int __loaded_foo = foo;

int x = __loaded_foo;
int y = __loaded_foo;

The reason is that there is no way for you to observe the difference between the above, and two separate loads of foo given the guarantees of sequential-consistency.

Note: It is not just the compiler that can make such an optimization, the processor can simply reason that there is no way in which you can observe the difference and load the value of foo once — even though the compiler might have asked it to do it twice.





Explanation

Given a thread that keeps on updating foo in an incremental fashion, what you are guaranteed is that y will have either the same, or a later written value, when compared to the contents of x.

// thread 1 - The Writer
while (true) {
  foo += 1;
}
// thread 2 - The Reader
while (true) {
  int x = foo;
  int y = foo;

  assert (y >= x); // will never fire, unless UB (foo has reached max value)
}                  

Imagine the writing thread for some reason pauses its execution on every iteration (because of a context-switch or other implementation defined reason); there is no way in which you can prove that this is what is causing both x and y to have the same value, or if it is because of a "merge optimization".


In other words, we have to potential outcomes given the code in this section:

  1. No new value is written to foo between the two reads (x == y).
  2. A new value is written to foo between the two reads (x < y).

Since any of the two can happen, an implementation is free to narrow down the scope to simply always execute one of them; we can in no way observe the difference.





What does the Standard say?

An implementation can make whatever changes it wants as long as we cannot observe any difference between the behavior which we expressed, and the behavior during execution.

This is covered in [intro.execution]p1:

The semantic descriptions in this International Standard define a parameterized nondeterministic abstract machine. This International Standard places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below.

Another section which makes it even more clear [intro.execution]p5:

A conforming implementation executing a well-formed program shall produce the same observable behavior as one of the possible executions of the corresponding instance of the abstract machine with the same program and the same input.

Further Reading:





What about polling in a loop?

// initial state
std::atomic<int> foo = 0;
// thread 1
while (true) {
  if (foo)
    break;
}
// thread 2
foo = 1

Question: Given the reasoning in the previous sections, could an implementation simply read foo once in thread 1, and then never break out of the loop even if thread 2 writes to foo?

The answer; No.

In a sequentially-consistent environment we are guaranteed that a write to foo in thread 2 will become visible in thread 1; this means that when that write has happened, thread 1 must observe this change of state.

Note: An implementation can turn two reads into a single one because we cannot observe the difference (one fence is just as effective as two), but it cannot completely disregard a read that exists by itself.

Note: The contents of this section is guaranteed by [atomics.order]p3-4.





What if I really want to prevent this form of "optimization"?

If you would like to force the implementation to actually read the value of some variable at every point where you have written it you should look into usage of volatile (note that this in no way enhances thread-safety).

But in practice compilers don't optimize atomics, and the standards group has recommended against using volatile atomic for this kind of reason until the dust settles on this issue. See

  • http://wg21.link/n4455
  • http://wg21.link/p0062
  • Why don't compilers merge redundant std::atomic writes?
  • and a duplicate of this question, Can and does the compiler optimize out two atomic loads?