Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with _mm_popcnt_u64 on Intel CPUs
Culprit: False Data Dependency (and the compiler isn't even aware of it)
On Sandy/Ivy Bridge and Haswell processors, the instruction:
popcnt src, dest
appears to have a false dependency on the destination register dest
. Even though the instruction only writes to it, the instruction will wait until dest
is ready before executing. This false dependency is (now) documented by Intel as erratum HSD146 (Haswell) and SKL029 (Skylake)
Skylake fixed this for lzcnt
and tzcnt
.
Cannon Lake (and Ice Lake) fixed this for popcnt
.
bsf
/bsr
have a true output dependency: output unmodified for input=0. (But no way to take advantage of that with intrinsics - only AMD documents it and compilers don't expose it.)
(Yes, these instructions all run on the same execution unit).
This dependency doesn't just hold up the 4 popcnt
s from a single loop iteration. It can carry across loop iterations making it impossible for the processor to parallelize different loop iterations.
The unsigned
vs. uint64_t
and other tweaks don't directly affect the problem. But they influence the register allocator which assigns the registers to the variables.
In your case, the speeds are a direct result of what is stuck to the (false) dependency chain depending on what the register allocator decided to do.
- 13 GB/s has a chain:
popcnt
-add
-popcnt
-popcnt
→ next iteration - 15 GB/s has a chain:
popcnt
-add
-popcnt
-add
→ next iteration - 20 GB/s has a chain:
popcnt
-popcnt
→ next iteration - 26 GB/s has a chain:
popcnt
-popcnt
→ next iteration
The difference between 20 GB/s and 26 GB/s seems to be a minor artifact of the indirect addressing. Either way, the processor starts to hit other bottlenecks once you reach this speed.
To test this, I used inline assembly to bypass the compiler and get exactly the assembly I want. I also split up the count
variable to break all other dependencies that might mess with the benchmarks.
Here are the results:
Sandy Bridge Xeon @ 3.5 GHz: (full test code can be found at the bottom)
- GCC 4.6.3:
g++ popcnt.cpp -std=c++0x -O3 -save-temps -march=native
- Ubuntu 12
Different Registers: 18.6195 GB/s
.L4:
movq (%rbx,%rax,8), %r8
movq 8(%rbx,%rax,8), %r9
movq 16(%rbx,%rax,8), %r10
movq 24(%rbx,%rax,8), %r11
addq $4, %rax
popcnt %r8, %r8
add %r8, %rdx
popcnt %r9, %r9
add %r9, %rcx
popcnt %r10, %r10
add %r10, %rdi
popcnt %r11, %r11
add %r11, %rsi
cmpq $131072, %rax
jne .L4
Same Register: 8.49272 GB/s
.L9:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq 16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq $4, %rdx
# This time reuse "rax" for all the popcnts.
popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add %rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp, %rax
add %rax, %rdi
cmpq $131072, %rdx
jne .L9
Same Register with broken chain: 17.8869 GB/s
.L14:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq 16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq $4, %rdx
# Reuse "rax" for all the popcnts.
xor %rax, %rax # Break the cross-iteration dependency by zeroing "rax".
popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add %rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp, %rax
add %rax, %rdi
cmpq $131072, %rdx
jne .L14
So what went wrong with the compiler?
It seems that neither GCC nor Visual Studio are aware that popcnt
has such a false dependency. Nevertheless, these false dependencies aren't uncommon. It's just a matter of whether the compiler is aware of it.
popcnt
isn't exactly the most used instruction. So it's not really a surprise that a major compiler could miss something like this. There also appears to be no documentation anywhere that mentions this problem. If Intel doesn't disclose it, then nobody outside will know until someone runs into it by chance.
(Update: As of version 4.9.2, GCC is aware of this false-dependency and generates code to compensate it when optimizations are enabled. Major compilers from other vendors, including Clang, MSVC, and even Intel's own ICC are not yet aware of this microarchitectural erratum and will not emit code that compensates for it.)
Why does the CPU have such a false dependency?
We can speculate: it runs on the same execution unit as bsf
/ bsr
which do have an output dependency. (How is POPCNT implemented in hardware?). For those instructions, Intel documents the integer result for input=0 as "undefined" (with ZF=1), but Intel hardware actually gives a stronger guarantee to avoid breaking old software: output unmodified. AMD documents this behaviour.
Presumably it was somehow inconvenient to make some uops for this execution unit dependent on the output but others not.
AMD processors do not appear to have this false dependency.
The full test code is below for reference:
#include <iostream>
#include <chrono>
#include <x86intrin.h>
int main(int argc, char* argv[]) {
using namespace std;
uint64_t size=1<<20;
uint64_t* buffer = new uint64_t[size/8];
char* charbuffer=reinterpret_cast<char*>(buffer);
for (unsigned i=0;i<size;++i) charbuffer[i]=rand()%256;
uint64_t count,duration;
chrono::time_point<chrono::system_clock> startP,endP;
{
uint64_t c0 = 0;
uint64_t c1 = 0;
uint64_t c2 = 0;
uint64_t c3 = 0;
startP = chrono::system_clock::now();
for( unsigned k = 0; k < 10000; k++){
for (uint64_t i=0;i<size/8;i+=4) {
uint64_t r0 = buffer[i + 0];
uint64_t r1 = buffer[i + 1];
uint64_t r2 = buffer[i + 2];
uint64_t r3 = buffer[i + 3];
__asm__(
"popcnt %4, %4 \n\t"
"add %4, %0 \n\t"
"popcnt %5, %5 \n\t"
"add %5, %1 \n\t"
"popcnt %6, %6 \n\t"
"add %6, %2 \n\t"
"popcnt %7, %7 \n\t"
"add %7, %3 \n\t"
: "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
: "r" (r0), "r" (r1), "r" (r2), "r" (r3)
);
}
}
count = c0 + c1 + c2 + c3;
endP = chrono::system_clock::now();
duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "No Chain\t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
{
uint64_t c0 = 0;
uint64_t c1 = 0;
uint64_t c2 = 0;
uint64_t c3 = 0;
startP = chrono::system_clock::now();
for( unsigned k = 0; k < 10000; k++){
for (uint64_t i=0;i<size/8;i+=4) {
uint64_t r0 = buffer[i + 0];
uint64_t r1 = buffer[i + 1];
uint64_t r2 = buffer[i + 2];
uint64_t r3 = buffer[i + 3];
__asm__(
"popcnt %4, %%rax \n\t"
"add %%rax, %0 \n\t"
"popcnt %5, %%rax \n\t"
"add %%rax, %1 \n\t"
"popcnt %6, %%rax \n\t"
"add %%rax, %2 \n\t"
"popcnt %7, %%rax \n\t"
"add %%rax, %3 \n\t"
: "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
: "r" (r0), "r" (r1), "r" (r2), "r" (r3)
: "rax"
);
}
}
count = c0 + c1 + c2 + c3;
endP = chrono::system_clock::now();
duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "Chain 4 \t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
{
uint64_t c0 = 0;
uint64_t c1 = 0;
uint64_t c2 = 0;
uint64_t c3 = 0;
startP = chrono::system_clock::now();
for( unsigned k = 0; k < 10000; k++){
for (uint64_t i=0;i<size/8;i+=4) {
uint64_t r0 = buffer[i + 0];
uint64_t r1 = buffer[i + 1];
uint64_t r2 = buffer[i + 2];
uint64_t r3 = buffer[i + 3];
__asm__(
"xor %%rax, %%rax \n\t" // <--- Break the chain.
"popcnt %4, %%rax \n\t"
"add %%rax, %0 \n\t"
"popcnt %5, %%rax \n\t"
"add %%rax, %1 \n\t"
"popcnt %6, %%rax \n\t"
"add %%rax, %2 \n\t"
"popcnt %7, %%rax \n\t"
"add %%rax, %3 \n\t"
: "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
: "r" (r0), "r" (r1), "r" (r2), "r" (r3)
: "rax"
);
}
}
count = c0 + c1 + c2 + c3;
endP = chrono::system_clock::now();
duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "Broken Chain\t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
free(charbuffer);
}
An equally interesting benchmark can be found here: http://pastebin.com/kbzgL8si
This benchmark varies the number of popcnt
s that are in the (false) dependency chain.
False Chain 0: 41959360000 0.57748 sec 18.1578 GB/s
False Chain 1: 41959360000 0.585398 sec 17.9122 GB/s
False Chain 2: 41959360000 0.645483 sec 16.2448 GB/s
False Chain 3: 41959360000 0.929718 sec 11.2784 GB/s
False Chain 4: 41959360000 1.23572 sec 8.48557 GB/s
I coded up an equivalent C program to experiment, and I can confirm this strange behaviour. What's more, gcc
believes the 64-bit integer (which should probably be a size_t
anyway...) to be better, as using uint_fast32_t
causes gcc to use a 64-bit uint.
I did a bit of mucking around with the assembly:
Simply take the 32-bit version, replace all 32-bit instructions/registers with the 64-bit version in the inner popcount-loop of the program. Observation: the code is just as fast as the 32-bit version!
This is obviously a hack, as the size of the variable isn't really 64 bit, as other parts of the program still use the 32-bit version, but as long as the inner popcount-loop dominates performance, this is a good start.
I then copied the inner loop code from the 32-bit version of the program, hacked it up to be 64 bit, fiddled with the registers to make it a replacement for the inner loop of the 64-bit version. This code also runs as fast as the 32-bit version.
My conclusion is that this is bad instruction scheduling by the compiler, not actual speed/latency advantage of 32-bit instructions.
(Caveat: I hacked up assembly, could have broken something without noticing. I don't think so.)
This is not an answer, but it's hard to read if I put results in comment.
I get these results with a Mac Pro (Westmere 6-Cores Xeon 3.33 GHz). I compiled it with clang -O3 -msse4 -lstdc++ a.cpp -o a
(-O2 get same result).
clang with uint64_t size=atol(argv[1])<<20;
unsigned 41950110000 0.811198 sec 12.9263 GB/s
uint64_t 41950110000 0.622884 sec 16.8342 GB/s
clang with uint64_t size=1<<20;
unsigned 41950110000 0.623406 sec 16.8201 GB/s
uint64_t 41950110000 0.623685 sec 16.8126 GB/s
I also tried to:
- Reverse the test order, the result is the same so it rules out the cache factor.
- Have the
for
statement in reverse:for (uint64_t i=size/8;i>0;i-=4)
. This gives the same result and proves the compile is smart enough to not divide size by 8 every iteration (as expected).
Here is my wild guess:
The speed factor comes in three parts:
code cache:
uint64_t
version has larger code size, but this does not have an effect on my Xeon CPU. This makes the 64-bit version slower.Instructions used. Note not only the loop count, but the buffer is accessed with a 32-bit and 64-bit index on the two versions. Accessing a pointer with a 64-bit offset requests a dedicated 64-bit register and addressing, while you can use immediate for a 32-bit offset. This may make the 32-bit version faster.
Instructions are only emitted on the 64-bit compile (that is, prefetch). This makes 64-bit faster.
The three factors together match with the observed seemingly conflicting results.