GCC memory barrier __sync_synchronize vs asm volatile("": : :"memory")
There's a significant difference - the first option (inline asm) actually does nothing at runtime, there's no command performed there and the CPU doesn't know about it. it only serves at compile time, to tell the compiler not to move loads or stores beyond this point (in any direction) as part of its optimizations. It's called a SW barrier.
The second barrier (builtin sync), would simply translate into a HW barrier, probably a fence (mfence/sfence) operations if you're on x86, or its equivalents in other architectures. The CPU may also do various optimizations at runtime, the most important one is actually performing operations out-of-order - this instruction tells it to make sure that loads or stores can't pass this point and must be observed in the correct side of the sync point.
Here's another good explanation:
Types of Memory Barriers
As mentioned above, both compilers and processors can optimize the execution of instructions in a way that necessitates the use of a memory barrier. A memory barrier that affects both the compiler and the processor is a hardware memory barrier, and a memory barrier that only affects the compiler is a software memory barrier.
In addition to hardware and software memory barriers, a memory barrier can be restricted to memory reads, memory writes, or both. A memory barrier that affects both reads and writes is a full memory barrier.
There is also a class of memory barrier that is specific to multi-processor environments. The name of these memory barriers are prefixed with "smp". On a multi-processor system, these barriers are hardware memory barriers and on uni-processor systems, they are software memory barriers.
The barrier() macro is the only software memory barrier, and it is a full memory barrier. All other memory barriers in the Linux kernel are hardware barriers. A hardware memory barrier is an implied software barrier.
An example for when SW barrier is useful: consider the following code -
for (i = 0; i < N; ++i) {
a[i]++;
}
This simple loop, compiled with optimizations, would most likely be unrolled and vectorized. Here's the assembly code gcc 4.8.0 -O3 generated packed (vector) operations:
400420: 66 0f 6f 00 movdqa (%rax),%xmm0
400424: 48 83 c0 10 add $0x10,%rax
400428: 66 0f fe c1 paddd %xmm1,%xmm0
40042c: 66 0f 7f 40 f0 movdqa %xmm0,0xfffffffffffffff0(%rax)
400431: 48 39 d0 cmp %rdx,%rax
400434: 75 ea jne 400420 <main+0x30>
However, when adding your inline assembly on each iteration, gcc is not permitted to change the order of the operations past the barrier, so it can't group them, and the assembly becomes the scalar version of the loop:
400418: 83 00 01 addl $0x1,(%rax)
40041b: 48 83 c0 04 add $0x4,%rax
40041f: 48 39 d0 cmp %rdx,%rax
400422: 75 f4 jne 400418 <main+0x28>
However, when the CPU performes this code, it's permitted to reorder the operations "under the hood", as long as it does not break memory ordering model. This means that performing the operations can be done out of order (if the CPU supports that, as most do these days). A HW fence would have prevented that.
A comment on the usefulness of SW-only barriers:
On some micro-controllers, and other embedded platforms, you may have multitasking, but no cache system or cache latency, and hence no HW barrier instructions. So you need to do things like SW spin-locks. The SW barrier prevents compiler optimizations (read/write combining and reordering) in these algorithms.