How does x86 pause instruction work in spinlock *and* can it be used in other scenarios?
PAUSE notifies the CPU that this is a spinlock wait loop so memory and cache accesses may be optimized. See also pause instruction in x86 for some more details about avoiding the memory-order mis-speculation when leaving the spin-loop.
PAUSE may actually stop CPU for some time to save power. Older CPUs decode it as REP NOP, so you don't have to check if its supported. Older CPUs will simply do nothing (NOP) as fast as possible.
See also https://software.intel.com/en-us/articles/benefitting-power-and-performance-sleep-loops
Update: I don't think it's a good idea to use PAUSE in queue checking unless you are going to make your queue spinlock-like (and there is no obvious way to do it).
Spinning for a very long time is still very bad, even with PAUSE.
A processor suffers a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. For this reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops. An additional function of the PAUSE instruction is to reduce the power consumed by Intel processors.
[source: Intel manual]
Pause-based spin-wait loops
As I understood from your questions, the waits in your case are known in advance to be very long. In this case, spin-wait loops are not recommended at all. But if you are using a spin-loop that keeps checking a value from memory (e.g. a byte-sized synchronization variable), use PAUSE
. See the Section 11.4.2 "Synchronization for Short Periods" of the Intel 64 and IA-32 Architectures Optimization Reference Manual.
You wrote that you have a "thread which keeps scanning some places (e.g. a queue) to retrieve new nodes".
In such a case (i.e. the long wait), Intel recommends using synchronization API functions of your operating system. For example, you can create an event when a new node appears in a queue, and just wait for this event using the WaitForSingleObject(Handle, INFINITE)
. The queue will trigger this event whenever a new node will appear.
According to the Intel Optimization Reference Manual, Section, 2.3.4 "Pause Latency in Skylake Client Microarchitecture",
The PAUSE instruction is typically used with software threads executing on two logical processors located in the same processor core, waiting for a lock to be released. Such short wait loops tend to last between tens and a few hundreds of cycles, so performance-wise it is better to wait while occupying the CPU than yielding to the OS.
By "tens and a few hundreds of cycles" of the above quote I understand from 20 to 500 CPU cycles.
500 CPU cycles on a 4500 MHz Intel Core i7 7700K processor (released on January 2017, based on Kaby-Lake-S microarchitecture) is 0.0000001 seconds, i.e. 1/10000000th of a second: the CPU can make 10 million times per second this 500-CPU-cycles loop.
This 500 cycle limit recommended by Intel is theoretical, and all depends on particular use case, i.e. on the logic of the code that needs to be synchronized by spin-wait loops. Some scenarios like FastMM4-AVX memory manger for Delphi work better with the value of 5000, according to the benchmarks. Even though, these benchmarks do not always reflect real-world scenario, and real program use cases should be measured.
As you see, this PAUSE
-based spin-wait loop is for really short periods of time.
On the other hand, each call to an API function like Sleep() experiences the expensive cost of a context switch, which can be 10000+ cycles; it also suffers the cost of ring 3 to ring 0 transitions, which can be 1000+ cycles.
If there are more threads then the processor cores (multiplied to hyperthreading feature, if present) are available, and a thread will get switched to another one in the middle of a critical section, waiting for the critical section from another thread may really take looong, at least 10000+ cycles, so the PAUSE
-based spin-wait loop will be futile.
In addition to the relevant chapters of the Intel Optimization Reference Manual, please see the following articles for more information:
- https://software.intel.com/en-us/articles/long-duration-spin-wait-loops-on-hyper-threading-technology-enabled-intel-processors
- https://software.intel.com/en-us/articles/benefitting-power-and-performance-sleep-loops
When the wait loop is expected to last for thousands of cycles or more, it is
preferable to yield to the operating system by calling one of the OS synchronization API functions, such as WaitForSingleObject
or SwitchToThread
on Windows OS.
As a conclusion: in your scenario, the PAUSE
-based spin-wait loop won't be the best choice, since your waiting time is long while the spin-wait loop is intended for very short loops.
The PAUSE
instruction takes about 140 CPU cycles on processors based on Skylake microarchitecture, or later processors. For example, it is just or 35.10ns on Intel Core i7-6700K CPU (4GHz) released on August 2015, or 49.47ns on Intel Core i7-1165G7 CPU for mobile devices released on September 2020. On earlier processors (prior to Skylake), like those based on Haswell microarchitecture, it has about 9 cycles. It is 2.81ns on Intel Core i5-4430 (3GHz) released on June 2013. So, for the long loops, it's better to relinquish control to other threads using the OS synchronization API functions than to occupy CPU with the PAUSE
loop, regardless of the microarchitecture.
Test, Test-and-Set
Please note that the spin-wait loops have also to be implemented properly. Intel recommends the so-called "test, test-and-set" technique (see Section 11.4.3 "Optimization with Spin-Locks" of the Intel 64 and IA-32 Architectures Optimization Reference Manual) to determine the availability of the synchronization variable. According to this technique, the first "test" is done via the normal (non-locking) memory load to prevent excessive bus locking during the spin-wait loop; if the variable is available upon the non-locking memory load of the first step ("test"), proceed to the second step ("test-and-set") which is done via the bus-locking atomic xchg
instruction.
But be aware that this two-steps approach of using "test" before "test-and-set" can increase the cost for the un-contended case comparing to just single-step "test-and-set". The initial read-only access might only get the cache line in Shared state, so the atomic operation like test-and-set (xchg
) or compare-and-swap (cmpxchg
) still needs a ''Read For Ownership'' (RFO) operation to get exclusive ownership of the cache line. This operation is issued by a processor trying to write into a cache line that is in the Shared state.
- Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock?
- atomic operation cost