What exactly is std::atomic?

Each instantiation and full specialization of std::atomic<> represents a type that different threads can simultaneously operate on (their instances), without raising undefined behavior:

Objects of atomic types are the only C++ objects that are free from data races; that is, if one thread writes to an atomic object while another thread reads from it, the behavior is well-defined.

In addition, accesses to atomic objects may establish inter-thread synchronization and order non-atomic memory accesses as specified by std::memory_order.

std::atomic<> wraps operations that, in pre-C++ 11 times, had to be performed using (for example) interlocked functions with MSVC or atomic bultins in case of GCC.

Also, std::atomic<> gives you more control by allowing various memory orders that specify synchronization and ordering constraints. If you want to read more about C++ 11 atomics and memory model, these links may be useful:

  • C++ atomics and memory ordering
  • Comparison: Lockless programming with atomics in C++ 11 vs. mutex and RW-locks
  • C++11 introduced a standardized memory model. What does it mean? And how is it going to affect C++ programming?
  • Concurrency in C++11

Note that, for typical use cases, you would probably use overloaded arithmetic operators or another set of them:

std::atomic<long> value(0);
value++; //This is an atomic op
value += 5; //And so is this

Because operator syntax does not allow you to specify the memory order, these operations will be performed with std::memory_order_seq_cst, as this is the default order for all atomic operations in C++ 11. It guarantees sequential consistency (total global ordering) between all atomic operations.

In some cases, however, this may not be required (and nothing comes for free), so you may want to use more explicit form:

std::atomic<long> value {0};
value.fetch_add(1, std::memory_order_relaxed); // Atomic, but there are no synchronization or ordering constraints
value.fetch_add(5, std::memory_order_release); // Atomic, performs 'release' operation

Now, your example:

a = a + 12;

will not evaluate to a single atomic op: it will result in a.load() (which is atomic itself), then addition between this value and 12 and a.store() (also atomic) of final result. As I noted earlier, std::memory_order_seq_cst will be used here.

However, if you write a += 12, it will be an atomic operation (as I noted before) and is roughly equivalent to a.fetch_add(12, std::memory_order_seq_cst).

As for your comment:

A regular int has atomic loads and stores. Whats the point of wrapping it with atomic<>?

Your statement is only true for architectures that provide such guarantee of atomicity for stores and/or loads. There are architectures that do not do this. Also, it is usually required that operations must be performed on word-/dword-aligned address to be atomic std::atomic<> is something that is guaranteed to be atomic on every platform, without additional requirements. Moreover, it allows you to write code like this:

void* sharedData = nullptr;
std::atomic<int> ready_flag = 0;

// Thread 1
void produce()
{
    sharedData = generateData();
    ready_flag.store(1, std::memory_order_release);
}

// Thread 2
void consume()
{
    while (ready_flag.load(std::memory_order_acquire) == 0)
    {
        std::this_thread::yield();
    }

    assert(sharedData != nullptr); // will never trigger
    processData(sharedData);
}

Note that assertion condition will always be true (and thus, will never trigger), so you can always be sure that data is ready after while loop exits. That is because:

  • store() to the flag is performed after sharedData is set (we assume that generateData() always returns something useful, in particular, never returns NULL) and uses std::memory_order_release order:

memory_order_release

A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable

  • sharedData is used after while loop exits, and thus after load() from flag will return a non-zero value. load() uses std::memory_order_acquire order:

std::memory_order_acquire

A load operation with this memory order performs the acquire operation on the affected memory location: no reads or writes in the current thread can be reordered before this load. All writes in other threads that release the same atomic variable are visible in the current thread.

This gives you precise control over the synchronization and allows you to explicitly specify how your code may/may not/will/will not behave. This would not be possible if only guarantee was the atomicity itself. Especially when it comes to very interesting sync models like the release-consume ordering.


I understand that std::atomic<> makes an object atomic.

That's a matter of perspective... you can't apply it to arbitrary objects and have their operations become atomic, but the provided specialisations for (most) integral types and pointers can be used.

a = a + 12;

std::atomic<> does not (use template expressions to) simplify this to a single atomic operation, instead the operator T() const volatile noexcept member does an atomic load() of a, then twelve is added, and operator=(T t) noexcept does a store(t).


std::atomic exists because many ISAs have direct hardware support for it

What the C++ standard says about std::atomic has been analyzed in other answers.

So now let's see what std::atomic compiles to to get a different kind of insight.

The main takeaway from this experiment is that modern CPUs have direct support for atomic integer operations, for example the LOCK prefix in x86, and std::atomic basically exists as a portable interface to those intructions: What does the "lock" instruction mean in x86 assembly? In aarch64, LDADD would be used.

This support allows for faster alternatives to more general methods such as std::mutex, which can make more complex multi-instruction sections atomic, at the cost of being slower than std::atomic because std::mutex it makes futex system calls in Linux, which is way slower than the userland instructions emitted by std::atomic, see also: Does std::mutex create a fence?

Let's consider the following multi-threaded program which increments a global variable across multiple threads, with different synchronization mechanisms depending on which preprocessor define is used.

main.cpp

#include <atomic>
#include <iostream>
#include <thread>
#include <vector>

size_t niters;

#if STD_ATOMIC
std::atomic_ulong global(0);
#else
uint64_t global = 0;
#endif

void threadMain() {
    for (size_t i = 0; i < niters; ++i) {
#if LOCK
        __asm__ __volatile__ (
            "lock incq %0;"
            : "+m" (global),
              "+g" (i) // to prevent loop unrolling
            :
            :
        );
#else
        __asm__ __volatile__ (
            ""
            : "+g" (i) // to prevent he loop from being optimized to a single add
            : "g" (global)
            :
        );
        global++;
#endif
    }
}

int main(int argc, char **argv) {
    size_t nthreads;
    if (argc > 1) {
        nthreads = std::stoull(argv[1], NULL, 0);
    } else {
        nthreads = 2;
    }
    if (argc > 2) {
        niters = std::stoull(argv[2], NULL, 0);
    } else {
        niters = 10;
    }
    std::vector<std::thread> threads(nthreads);
    for (size_t i = 0; i < nthreads; ++i)
        threads[i] = std::thread(threadMain);
    for (size_t i = 0; i < nthreads; ++i)
        threads[i].join();
    uint64_t expect = nthreads * niters;
    std::cout << "expect " << expect << std::endl;
    std::cout << "global " << global << std::endl;
}

GitHub upstream.

Compile, run and disassemble:

comon="-ggdb3 -O3 -std=c++11 -Wall -Wextra -pedantic main.cpp -pthread"
g++ -o main_fail.out                    $common
g++ -o main_std_atomic.out -DSTD_ATOMIC $common
g++ -o main_lock.out       -DLOCK       $common

./main_fail.out       4 100000
./main_std_atomic.out 4 100000
./main_lock.out       4 100000

gdb -batch -ex "disassemble threadMain" main_fail.out
gdb -batch -ex "disassemble threadMain" main_std_atomic.out
gdb -batch -ex "disassemble threadMain" main_lock.out

Extremely likely "wrong" race condition output for main_fail.out:

expect 400000
global 100000

and deterministic "right" output of the others:

expect 400000
global 400000

Disassembly of main_fail.out:

   0x0000000000002780 <+0>:     endbr64 
   0x0000000000002784 <+4>:     mov    0x29b5(%rip),%rcx        # 0x5140 <niters>
   0x000000000000278b <+11>:    test   %rcx,%rcx
   0x000000000000278e <+14>:    je     0x27b4 <threadMain()+52>
   0x0000000000002790 <+16>:    mov    0x29a1(%rip),%rdx        # 0x5138 <global>
   0x0000000000002797 <+23>:    xor    %eax,%eax
   0x0000000000002799 <+25>:    nopl   0x0(%rax)
   0x00000000000027a0 <+32>:    add    $0x1,%rax
   0x00000000000027a4 <+36>:    add    $0x1,%rdx
   0x00000000000027a8 <+40>:    cmp    %rcx,%rax
   0x00000000000027ab <+43>:    jb     0x27a0 <threadMain()+32>
   0x00000000000027ad <+45>:    mov    %rdx,0x2984(%rip)        # 0x5138 <global>
   0x00000000000027b4 <+52>:    retq

Disassembly of main_std_atomic.out:

   0x0000000000002780 <+0>:     endbr64 
   0x0000000000002784 <+4>:     cmpq   $0x0,0x29b4(%rip)        # 0x5140 <niters>
   0x000000000000278c <+12>:    je     0x27a6 <threadMain()+38>
   0x000000000000278e <+14>:    xor    %eax,%eax
   0x0000000000002790 <+16>:    lock addq $0x1,0x299f(%rip)        # 0x5138 <global>
   0x0000000000002799 <+25>:    add    $0x1,%rax
   0x000000000000279d <+29>:    cmp    %rax,0x299c(%rip)        # 0x5140 <niters>
   0x00000000000027a4 <+36>:    ja     0x2790 <threadMain()+16>
   0x00000000000027a6 <+38>:    retq   

Disassembly of main_lock.out:

Dump of assembler code for function threadMain():
   0x0000000000002780 <+0>:     endbr64 
   0x0000000000002784 <+4>:     cmpq   $0x0,0x29b4(%rip)        # 0x5140 <niters>
   0x000000000000278c <+12>:    je     0x27a5 <threadMain()+37>
   0x000000000000278e <+14>:    xor    %eax,%eax
   0x0000000000002790 <+16>:    lock incq 0x29a0(%rip)        # 0x5138 <global>
   0x0000000000002798 <+24>:    add    $0x1,%rax
   0x000000000000279c <+28>:    cmp    %rax,0x299d(%rip)        # 0x5140 <niters>
   0x00000000000027a3 <+35>:    ja     0x2790 <threadMain()+16>
   0x00000000000027a5 <+37>:    retq

Conclusions:

  • the non-atomic version saves the global to a register, and increments the register.

    Therefore, at the end, very likely four writes happen back to global with the same "wrong" value of 100000.

  • std::atomic compiles to lock addq. The LOCK prefix makes the following inc fetch, modify and update memory atomically.

  • our explicit inline assembly LOCK prefix compiles to almost the same thing as std::atomic, except that our inc is used instead of add. Not sure why GCC chose add, considering that our INC generated a decoding 1 byte smaller.

ARMv8 could use either LDAXR + STLXR or LDADD in newer CPUs: How do I start threads in plain C?

Tested in Ubuntu 19.10 AMD64, GCC 9.2.1, Lenovo ThinkPad P51.