How do you use the pause assembly instruction in 64-bit C++ code?
Wow, this was a very hard problem to track down, but in case anybody else needs the x86-64 pause
instruction:
The YieldProcessor()
macro from windows.h
expands to the undocumented _mm_pause
intrinsic, which ultimately expands to the pause
instruction in 32-bit and 64-bit code.
This is completely undocumented, by the way, with partial (and incorrect for VC++ 2010 documentation) for YieldProcessor() appearing in MSDN.
Here is an example of what a block of YieldProcessor() macros compiles into:
19: ::YieldProcessor();
000000013FDB18A0 F3 90 pause
20: ::YieldProcessor();
000000013FDB18A2 F3 90 pause
21: ::YieldProcessor();
000000013FDB18A4 F3 90 pause
22: ::YieldProcessor();
000000013FDB18A6 F3 90 pause
23: ::YieldProcessor();
000000013FDB18A8 F3 90 pause
By the way, each pause instruction seems to produce about a 9 cycle delay on the Nehalem architecture, on the average (i.e., 3 ns on a 3.3 GHz CPU).
The _mm_pause()
intrinsic is fully documented by Intel and supported by all the major x86 compilers portably across OSes. IDK if MS's docs were lacking in the past, or if you just missed it ~7 years go.
#include <immintrin.h>
and use it. (Or for ancient compilers #include <emmintrin.h>
for SSE2).
#include <immintrin.h>
void test() {
_mm_pause();
_mm_pause();
}
compiles to this asm on all 4 of gcc/clang/ICC/MSVC (on the Godbolt compiler explorer):
test(): # @test()
pause
pause
ret
On CPUs without SSE2, it decodes as rep nop
which is just a nop
. Cross-platform implementation of the x86 pause instruction
Gcc even knows this, and still accepts _mm_pause()
when compiling with -mno-sse
. (Normally gcc and clang reject intriniscs for instructions that aren't enabled, unlike MSVC.) Amusingly, gcc even emits rep nop
in its asm output, while the other three emit pause
. They assemble to same machine code, of course.
Pause idles the front-end of that hyperthread for about 5 cycles on Sandybridge-family until Skylake. On Skylake, Intel increased it to ~100 cycles to save more power in spin-wait loops and increase overall throughput at the possible expense of latency, especially on Hyperthreaded cores.
On all CPUs it also avoids memory-order mis-speculation when leaving a spin-loop. So it does reduce latency right when it finally matters again.
See also What is the purpose of the "PAUSE" instruction in x86?.