Relative performance of swap vs compare-and-swap locks on x86

I found this Intel document, stating that there is no difference in practice:

http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/

One common myth is that the lock utilizing a cmpxchg instruction is cheaper than a lock utilizing an xchg instruction. This is used because cmpxchg will not attempt to get the lock in exclusive mode since the cmp will go through first. Figure 9 shows that the cmpxchg is just as expensive as the xchg instruction.


I assume atomic_swap(lockaddr, 1) gets translated to a xchg reg,mem instruction and atomic_compare_and_swap(lockaddr, 0, val) gets translated to a cmpxchg[8b|16b].

Some linux kernel developers think cmpxchg ist faster, because the lock prefix isn't implied as with xchg. So if you are on a uniprocessor, multithread or can otherwise make sure the lock isn't needed, you are probably better of with cmpxchg.

But chances are your compiler will translate it to a "lock cmpxchg" and in that case it doesn't really matter. Also note that while latencies for this instructions are low (1 cycle without lock and about 20 with lock), if you happen to use are common sync variable between two threads, which is quite usual, some additional bus cycles will be enforced, which last forever compared to the instruction latencies. These will most likely completly be hidden by a 200 or 500 cpu cycles long cache snoop/sync/mem access/bus lock/whatever.


On x86, any instruction with a LOCK prefix does all memory operations as read-modify-write cycles. This means that XCHG (with its implicit LOCK) and LOCK CMPXCHG (in all cases, even if the comparison fails) always get an exclusive lock on the cache line. The result is that there is basically no difference in performance.

Note that many CPUs all spinning on the same lock can cause a lot of bus overhead in this model. This is one reason that spin-lock loops should contain PAUSE instructions. Some other architectures have better operations for this.


Are you sure you didn't mean

 if (!atomic_load(lockaddr)) {
       if (!atomic_swap(lockaddr, val)) /* got the lock */

for the second one?

Test and test and set locks (see Wikipedia https://en.wikipedia.org/wiki/Test_and_test-and-set ) are a quite common optimization for many platforms.

Depending on how compare and exchange is implemented it could be faster or slower than a test and test and set.

As x86 is a relatively stronger ordered platform HW optimizations that may make test and test and set locks faster may be less possible.

Figure 8 from the document that Bo Persson found http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/ shows that Test and Test and Set locks are superior in performance.