MCU programming - C++ O2 optimization breaks while loop
The code optimizer has analyzed the code and from what it can see the value of choice
will never change. And since it will never change, there's no point in checking it in the first place.
The fix is to declare the variable volatile
so that the compiler is forced to emit code that checks its value regardless of the optimization level used.
(Cross site duplicate on SO about the thread case, rather than interrupt/signal-handler case). Also related: When to use volatile with multi threading?
A data race on a non-atomic
variable1 is Undefined Behaviour in C++112. i.e. potentially-concurrent read+write or write+write without any synchronization to provide a happens-before relationship, e.g. a mutex or release/acquire synchronization.
The compiler is allowed to assume that no other thread has modified choice
between two reads of it (because that would be data-race UB (Undefined Behaviour)), so it can CSE and hoist the check out of the loop.
This is in fact what gcc does (and most other compilers too):
while(!choice){}
optimizes into asm that looks like this:
if(!choice) // conditional branch outside the loop to skip it
while(1){} // infinite loop, like ARM .L2: b .L2
This happens in the target-independent part of gcc, so it applies to all architectures.
You want the compiler to be able to do this kind of optimization, because real code contains stuff like for (int i=0 ; i < global_size ; i++ ) { ... }
. You want the compiler to be able to load the global outside the loop, not keep re-loading it every loop iteration, or for every access later in a function. Data has to be in registers for the CPU to work with it, not memory.
The compiler could even assume the code is never reached with choice == 0
, because an infinite loop with no side effects is Undefined Behaviour. (Reads / writes of non-volatile
variables don't count as side effects). Stuff like printf
is a side-effect, but calling a non-inline function would also stop the compiler from optimizing away the re-reads of choice
, unless it was static int choice
. (Then the compiler would know that printf
couldn't modify it, unless something in this compilation unit passed &choice
to a non-inline function. i.e. escape analysis might allow the compiler to prove that static int choice
couldn't be modified by a call to an "unknown" non-inline function.)
In practice real compilers don't optimize away simple infinite loops, they assume (as a quality-of-implementation issue or something) that you meant to write while(42){}
. But an example in https://en.cppreference.com/w/cpp/language/ub shows that clang will optimize away an infinite loop if there was code with no side effects in it which it optimized away.
Officially supported 100% portable / legal C++11 ways to do this:
You don't really have multiple threads, you have an interrupt handler. In C++11 terms, that's exactly like a signal handler: it can run asynchronously with your main program, but on the same core.
C and C++ have had a solution for that for a long time: volatile sig_atomic_t
is guaranteed to be ok to write in a signal handler and read in your main program
An integer type which can be accessed as an atomic entity even in the presence of asynchronous interrupts made by signals.
void reader() {
volatile sig_atomic_t shared_choice;
auto handler = a lambda that sets shared_choice;
... register lambda as interrupt handler
sig_atomic_t choice; // non-volatile local to read it into
while((choice=shared_choice) == 0){
// if your CPU has any kind of power-saving instruction like x86 pause, do it here.
// or a sleep-until-next-interrupt like x86 hlt
}
... unregister it.
switch(choice) {
case 1: goto constant;
...
case 0: // you could build the loop around this switch instead of a separate spinloop
// but it doesn't matter much
}
}
Other volatile
types are not guaranteed by the standard to be atomic (although in practice they are up to at least pointer width on normal architectures like x86 and ARM, because locals will be naturally aligned. uint8_t
is a single byte, and modern ISAs can atomically store a byte without a read/modify/write of the surrounding word, despite any misinformation you may have heard about word-oriented CPUs).
What you'd really like is a way to make a specific access volatile, instead of needing a separate variable. You might be able to do that with *(volatile sig_atomic_t*)&choice
, like the Linux kernel's ACCESS_ONCE
macro, but Linux compiles with strict-aliasing disabled to make that kind of thing safe. I think in practice that would work on gcc/clang, but I think it's not strictly legal C++.
With std::atomic<T>
for lock-free T
(with std::memory_order_relaxed
to get efficient asm without barrier instructions, like you can get from volatile
)
C++11 introduce a standard mechanism to handle the case where one thread reads a variable while another thread (or signal handler) writes it.
It provides control over memory-ordering, with sequential-consistency by default, which is expensive and not needed for your case. std::memory_order_relaxed
atomic loads/stores will compile to the same asm (for your K60 ARM Cortex-M4 CPU) as volatile uint8_t
, with the advantage of letting you use a uint8_t
instead of whatever width sig_atomic_t
is, while still avoiding even a hint of C++11 data race UB.
(Of course it's only portable to platforms where atomic<T>
is lock-free for your T; otherwise async access from the main program and an interrupt handler can deadlock. C++ implementations aren't allowed to invent writes to surrounding objects, so if they have uint8_t
at all, it should be lock-free atomic. Or just use unsigned char
. But for types too wide to be naturally atomic, atomic<T>
will use a hidden lock. With regular code unable to ever wake up and release a lock while the only CPU core is stuck in an interrupt handler, you're screwed if a signal/interrupt arrives while that lock is held.)
#include <atomic>
#include <stdint.h>
volatile uint8_t v;
std::atomic<uint8_t> a;
void a_reader() {
while (a.load(std::memory_order_relaxed) == 0) {}
// std::atomic_signal_fence(std::memory_order_acquire); // optional
}
void v_reader() {
while (v == 0) {}
}
Both compile to the same asm, with gcc7.2 -O3 for ARM, on the Godbolt compiler explorer
a_reader():
ldr r2, .L7 @ load the address of the global
.L2: @ do {
ldrb r3, [r2] @ zero_extendqisi2
cmp r3, #0
beq .L2 @ }while(choice eq 0)
bx lr
.L7:
.word .LANCHOR0
void v_writer() {
v = 1;
}
void a_writer() {
// a = 1; // seq_cst needs a DMB, or x86 xchg or mfence
a.store(1, std::memory_order_relaxed);
}
ARM asm for both:
ldr r3, .L15
movs r2, #1
strb r2, [r3, #1]
bx lr
So in this case for this implementation, volatile
can do the same thing as std::atomic
. On some platforms, volatile
might imply using special instructions necessary for accessing memory-mapped I/O registers. (I'm not aware of any platforms like that, and it's not the case on ARM. But that's one feature of volatile
you definitely don't want).
With atomic
, you can even block compile-time reordering with respect to non-atomic variables, with no extra runtime cost if you're careful.
Don't use .load(mo_acquire)
, that will make asm that's safe with respect to other threads running on other cores at the same time. Instead, use relaxed loads/stores and use atomic_signal_fence
(not thread_fence) after a relaxed load, or before a relaxed store, to get acquire or release ordering.
A possible use-case would be an interrupt handler that writes a small buffer and then sets an atomic flag to indicate that it's ready. Or an atomic index to specify which of a set of buffers.
Note that if the interrupt handler can run again while the main code is still reading the buffer, you have data race UB (and an actual bug on real hardware) In pure C++ where there are no timing restrictions or guarantees, you might have theoretical potential UB (which the compiler should assume never happens).
But it's only UB if it actually happens at runtime; If your embedded system has realtime guarantees then you may be able to guarantee that the reader can always finish checking the flag and reading the non-atomic data before the interrupt can fire again, even in the worst-case where some other interrupt comes in and delays things. You might need some kind of memory barrier to make sure the compiler doesn't optimize by continuing to reference the buffer, instead of whatever other object you read the buffer into. The compiler doesn't understand that UB-avoidance requires reading the buffer once right away, unless you tell it that somehow. (Something like GNU C asm("":::"memory")
should do the trick, or even asm(""::"m"(shared_buffer[0]):"memory")
).
Of course, read/modify/write operations like a++
will compile differently from v++
, to a thread-safe atomic RMW, using an LL/SC retry loop, or an x86 lock add [mem], 1
. The volatile
version will compile to a load, then a separate store. You can express this with atomics like:
uint8_t non_atomic_inc() {
auto tmp = a.load(std::memory_order_relaxed);
uint8_t old_val = tmp;
tmp++;
a.store(tmp, std::memory_order_relaxed);
return old_val;
}
If you actually want to increment choice
in memory ever, you might consider volatile
to avoid syntax pain if that's what you want instead of actual atomic increments. But remember that every access to a volatile
or atomic
is an extra load or store, so you should really just choose when to read it into a non-atomic / non-volatile local.
Compilers don't currently optimize atomics, but the standard allows it in cases that are safe unless you use volatile atomic<uint8_t> choice
.
Again what we're really like is atomic
access while the interrupt handler is registered, then normal access.
C++20 provides this with std::atomic_ref<>
But neither gcc nor clang actually support this in their standard library yet (libstdc++ or libc++). no member named 'atomic_ref' in namespace 'std'
, with gcc and clang -std=gnu++2a
. There shouldn't be a problem actually implementing it, though; GNU C builtins like __atomic_load
work on regular objects, so atomicity is on a per-access basis rather than a per-object basis.
void reader(){
uint8_t choice;
{ // limited scope for the atomic reference
std::atomic_ref<uint8_t> atomic_choice(choice);
auto choice_setter = [&atomic_choice] (int x) { atomic_choice = x; };
ui::Context::addEventListener(ui::EventType::JOYSTICK_DOWN, &choice_setter);
while(!atomic_choice) {}
ui::Context::removeEventListener(ui::EventType::JOYSTICK_DOWN, &choice_setter);
}
switch(choice) { // then it's a normal non-atomic / non-volatile variable
}
}
You probably end up with one extra load of the variable vs. while(!(choice = shared_choice)) ;
, but if you're calling a function between the spinloop and when you use it, it's probably easier not to force the compiler to record the last read result in another local (which it may have to spill). Or I guess after the deregister you could do a final choice = shared_choice;
to make it possible for the compiler to keep choice
in a register only, and re-read the atomic or volatile.
Footnote 1: volatile
Even data-races on volatile
are technically UB, but in that case the behaviour you get in practice on real implementations is useful, and normally identical to atomic
with memory_order_relaxed
, if you avoid atomic read-modify-write operations.
When to use volatile with multi threading? explains in more detail for the multi-core case: basically never, use std::atomic
instead (with memory_order relaxed).
Compiler-generated code that loads or stores uint8_t
is atomic on your ARM CPU. Read/modify/write like choice++
would not be an atomic RMW on volatile uint8_t choice
, just an atomic load, then a later atomic store which could step on other atomic stores.
Footnote 2: C++03:
Before C++11 the ISO C++ standard didn't say anything about threads, but older compilers worked the same way; C++11 basically just made it official that the way compilers already work is correct, applying the as-if rule to preserve the behaviour of a single thread only unless you use special language features.