Count leading zeros in __m256i word
If your input values are uniformly distributed, almost all of the time the highest set bit will be in the top 64 bits of the vector (1 in 2^64). A branch on this condition will predict very well. @Nejc's answer is good for that case.
But many problems where lzcnt
is part of the solution have a uniformly distributed output (or similar), so a branchless version has an advantage. Not strictly uniform, but anything where it's common for the highest set bit to be somewhere other than the highest 64 bits.
Wim's idea of lzcnt on a compare bitmap to find the right element is a very good approach.
However, runtime-variable indexing of the vector with a store/reload is probably better than a shuffle. Store-forwarding latency is low (maybe 5 to 7 cycles on Skylake), and that latency is in parallel with the index generation (compare / movemask / lzcnt). The movd/vpermd/movd
lane-crossing shuffle strategy takes 5 cycles after the index is known, to get the right element into an integer register. (See http://agner.org/optimize/)
I think this version should be better latency on Haswell/Skylake (and Ryzen), and also better throughput. (vpermd
is quite slow on Ryzen, so it should be very good there) The address calculation for the load should have similar latency as the store-forwarding, so it's a toss-up which one is actually the critical path.
Aligning the stack by 32 to avoid cache-line splits on a 32-byte store takes extra instructions, so this is best if it can inline into a function that uses it multiple times, or already needs that much alignment for some other __m256i
.
#include <stdint.h>
#include <immintrin.h>
#ifndef _MSC_VER
#include <stdalign.h> //MSVC is missing this?
#else
#include <intrin.h>
#pragma intrinsic(_BitScanReverse) // https://msdn.microsoft.com/en-us/library/fbxyd7zd.aspx suggests this
#endif
// undefined result for mask=0, like BSR
uint32_t bsr_nonzero(uint32_t mask)
{
// on Intel, bsr has a minor advantage for the first step
// for AMD, BSR is slow so you should use 31-LZCNT.
//return 31 - _lzcnt_u32(mask);
// Intel's docs say there should be a _bit_scan_reverse(x), maybe try that with ICC
#ifdef _MSC_VER
unsigned long tmp;
_BitScanReverse(&tmp, mask);
return tmp;
#else
return 31 - __builtin_clz(mask);
#endif
}
And the interesting part:
int mm256_lzcnt_si256(__m256i vec)
{
__m256i nonzero_elem = _mm256_cmpeq_epi8(vec, _mm256_setzero_si256());
unsigned mask = ~_mm256_movemask_epi8(nonzero_elem);
if (mask == 0)
return 256; // if this is rare, branching is probably good.
alignas(32) // gcc chooses to align elems anyway, with its clunky code
uint8_t elems[32];
_mm256_storeu_si256((__m256i*)elems, vec);
// unsigned lz_msk = _lzcnt_u32(mask);
// unsigned idx = 31 - lz_msk; // can use bsr to get the 31-x, because mask is known to be non-zero.
// This takes the 31-x latency off the critical path, in parallel with final lzcnt
unsigned idx = bsr_nonzero(mask);
unsigned lz_msk = 31 - idx;
unsigned highest_nonzero_byte = elems[idx];
return lz_msk * 8 + _lzcnt_u32(highest_nonzero_byte) - 24;
// lzcnt(byte)-24, because we don't want to count the leading 24 bits of padding.
}
On Godbolt with gcc7.3 -O3 -march=haswell
, we get asm like this to count ymm1
into esi
.
vpxor xmm0, xmm0, xmm0
mov esi, 256
vpcmpeqd ymm0, ymm1, ymm0
vpmovmskb eax, ymm0
xor eax, -1 # ~mask and set flags, unlike NOT
je .L35
bsr eax, eax
vmovdqa YMMWORD PTR [rbp-48], ymm1 # note no dependency on anything earlier; OoO exec can run it early
mov ecx, 31
mov edx, eax # this is redundant, gcc should just use rax later. But it's zero-latency on HSW/SKL and Ryzen.
sub ecx, eax
movzx edx, BYTE PTR [rbp-48+rdx] # has to wait for the index in edx
lzcnt edx, edx
lea esi, [rdx-24+rcx*8] # lzcnt(byte) + lzcnt(vectormask) * 8
.L35:
For finding the highest non-zero element (the 31 - lzcnt(~movemask)
), we use bsr
to directly get the bit (and thus byte) index, and take a subtract off the critical path. This is safe as long as we branch on the mask being zero. (A branchless version would need to initialize the register to avoid an out-of-bounds index).
On AMD CPUs, bsr
is significantly slower than lzcnt
. On Intel CPUs, they're the same performance, except for minor variations in output-dependency details.
bsr
with an input of zero leaves the destination register unmodified, but GCC doesn't provide a way to take advantage of that. (Intel only documents it as undefined output, but AMD documents the actual behaviour of Intel / AMD CPUs as producing the old value in the destination register).
bsr
sets ZF if the input was zero, rather than based on the output like most instructions. (This and the output dependency may be why it's slow on AMD.) Branching on the BSR flags is not particularly better than branching on ZF as set by xor eax,-1
to invert the mask, which is what gcc does. Anyway, Intel does document a _BitScanReverse(&idx, mask)
intrinsic that returns a bool
, but gcc doesn't support it (not even with x86intrin.h
). The GNU C builtin doesn't return a boolean to let you use the flag result, but maybe gcc would make smart asm using the flag output of bsr
if you check for the input C variable being non-zero.
Using a dword (uint32_t
) array and vmovmskps
would let the 2nd lzcnt
use a memory source operand instead of needing a movzx
to zero-extend a single byte. But lzcnt
has a false dependency on Intel CPUs before Skylake, so compilers might tend to load separately and use lzcnt same,same
as a workaround anyway. (I didn't check.)
Wim's version needs lz_msk-24
because the high 24 bits are always zero with an 8-bit mask. But a 32-bit mask fills a 32-bit register.
This version with 8 bit elements and a 32-bit mask is the reverse: we need to lzcnt
the selected byte, not including the 24 leading zero bits in the register. So our -24
moves to a different spot, not part of the critical path for indexing the array.
gcc chooses to do it as part of a single 3-component LEA (reg + reg*scale - const
), which is great for throughput, but puts it on the critical path after the final lzcnt
. (It's not free because 3-component LEA has extra latency vs. reg + reg*scale
on Intel CPUs. See Agner Fog's instruction tables).
A multiply by 8 can be done as part of an lea
, but a multiply by 32 would need a shift (or be folded into two separate LEAs).
Intel's optimization manual says (Table 2-24) even Sandybridge can forward from a 256-bit store to single-byte loads without a problem, so I think it's fine on AVX2 CPUs, the same as forwarding to 32-bit loads that of 4-byte-aligned chunks of the store.
(Update: new answer since 2019-01-31)
Three alternatives are:
Peter Cordes' excellent answer. Fast. This solution is not branchless, which should not be a problem, unless the input is frequently zero with an irregular pattern of occurrences.
My previous answer which is in the edit history of this answer now. Less efficient than Peter Cordes' answer, but branchless.
This answer. Very fast if the data from the 2 tiny lookup tables is in L1 cache. The L1 cache footprint is 128 bytes. Branchless. It may suffer from cache misses when called not often.
In this answer, the input epi64
vector is compared with zero, which produces a mask.
This mask is converted to a 4-bit index i_mask
(by _mm256_movemask_pd
).
With index i_mask
two values are read from the two lookup tables:
1. the index of the first nonzero 64-bit element, and 2.
the number of nonzeros of the preceding (from left to right) zero elements.
Finally, the _lzcnt_u64
of the first nonzero 64-bit element is computed and added
to the lookup table value. Function mm256_lzcnt_si256
implements this method:
#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>
#include <stdalign.h>
/* gcc -Wall -m64 -O3 -march=haswell clz_avx256_upd.c */
int mm256_lzcnt_si256(__m256i input)
{
/* Version with lookup tables and scratch array included in the function */
/* Two tiny lookup tables (64 bytes each, less space is possible with uint8_t or uint16_t arrays instead of uint32_t): */
/* i_mask (input==0) 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 */
/* ~i_mask (input!=0) 1111 1110 1101 1100 1011 1010 1001 1000 0111 0110 0101 0100 0011 0010 0001 0000 */
static const uint32_t indx[16] = { 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 1, 1, 0, 0};
static const uint32_t lz_msk[16] = { 0, 0, 0, 0, 0, 0, 0, 0, 64, 64, 64, 64, 128, 128, 192, 192};
alignas(32) uint64_t tmp[4] = { 0, 0, 0, 0}; /* tmp is a scratch array of 32 bytes, preferably 32 byte aligned */
_mm256_storeu_si256((__m256i*)&tmp[0], input); /* Store input in the scratch array */
__m256i mask = _mm256_cmpeq_epi64(input, _mm256_setzero_si256()); /* Check which 64 bits elements are zero */
uint32_t i_mask = _mm256_movemask_pd(_mm256_castsi256_pd(mask)); /* Move vector mask to integer mask */
uint64_t input_i = tmp[indx[i_mask]]; /* Load the first (from the left) non-zero 64 bit element input_i */
int32_t lz_input_i = _lzcnt_u64(input_i); /* Count the number of leading zeros in input_i */
int32_t lz = lz_msk[i_mask] + lz_input_i; /* Add the number of leading zeros of the preceding 64 bit elements */
return lz;
}
int mm256_lzcnt_si256_v2(__m256i input, uint64_t* restrict tmp, const uint32_t* indx, const uint32_t* lz_msk)
{
/* Version that compiles to nice assembly, although, after inlining there won't be any difference between the different versions. */
_mm256_storeu_si256((__m256i*)&tmp[0], input); /* Store input in the scratch array */
__m256i mask = _mm256_cmpeq_epi64(input, _mm256_setzero_si256()); /* Check which 64 bits elements are zero */
uint32_t i_mask = _mm256_movemask_pd(_mm256_castsi256_pd(mask)); /* Move vector mask to integer mask */
uint64_t input_i = tmp[indx[i_mask]]; /* Load the first (from the left) non-zero 64 bit element input_i */
int32_t lz_input_i = _lzcnt_u64(input_i); /* Count the number of leading zeros in input_i */
int32_t lz = lz_msk[i_mask] + lz_input_i; /* Add the number of leading zeros of the preceding 64 bit elements */
return lz;
}
__m256i bit_mask_avx2_lsb(unsigned int n)
{
__m256i ones = _mm256_set1_epi32(-1);
__m256i cnst32_256 = _mm256_set_epi32(256,224,192,160, 128,96,64,32);
__m256i shift = _mm256_set1_epi32(n);
shift = _mm256_subs_epu16(cnst32_256,shift);
return _mm256_srlv_epi32(ones,shift);
}
int print_avx2_hex(__m256i ymm)
{
long unsigned int x[4];
_mm256_storeu_si256((__m256i*)x,ymm);
printf("%016lX %016lX %016lX %016lX ", x[3],x[2],x[1],x[0]);
return 0;
}
int main()
{
unsigned int i;
__m256i x;
printf("mm256_lzcnt_si256\n");
for (i = 0; i < 257; i++){
printf("x=");
x = bit_mask_avx2_lsb(i);
print_avx2_hex(x);
printf("lzcnt(x)=%i \n", mm256_lzcnt_si256(x));
}
printf("\n");
x = _mm256_set_epi32(0,0,0,0, 0,15,1,0);
printf("x=");print_avx2_hex(x);printf("lzcnt(x)=%i \n", mm256_lzcnt_si256(x));
x = _mm256_set_epi32(0,0,0,8, 0,0,0,256);
printf("x=");print_avx2_hex(x);printf("lzcnt(x)=%i \n", mm256_lzcnt_si256(x));
x = _mm256_set_epi32(0,0x100,0,8, 0,192,0,0);
printf("x=");print_avx2_hex(x);printf("lzcnt(x)=%i \n", mm256_lzcnt_si256(x));
x = _mm256_set_epi32(-1,0x100,0,8, 0,0,32,0);
printf("x=");print_avx2_hex(x);printf("lzcnt(x)=%i \n", mm256_lzcnt_si256(x));
/* Set arrays for mm256_lzcnt_si256_v2: */
alignas(32) static const uint32_t indx[16] = { 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 1, 1, 0, 0};
alignas(32) static const uint32_t lz_msk[16] = { 0, 0, 0, 0, 0, 0, 0, 0, 64, 64, 64, 64, 128, 128, 192, 192};
alignas(32) uint64_t tmp[4] = { 0, 0, 0, 0};
printf("\nmm256_lzcnt_si256_v2\n");
for (i = 0; i < 257; i++){
printf("x=");
x = bit_mask_avx2_lsb(i);
print_avx2_hex(x);
printf("lzcnt(x)=%i \n", mm256_lzcnt_si256_v2(x, tmp, indx, lz_msk));
}
printf("\n");
x = _mm256_set_epi32(0,0,0,0, 0,15,1,0);
printf("x=");print_avx2_hex(x);printf("lzcnt(x)=%i \n", mm256_lzcnt_si256_v2(x, tmp, indx, lz_msk));
x = _mm256_set_epi32(0,0,0,8, 0,0,0,256);
printf("x=");print_avx2_hex(x);printf("lzcnt(x)=%i \n", mm256_lzcnt_si256_v2(x, tmp, indx, lz_msk));
x = _mm256_set_epi32(0,0x100,0,8, 0,192,0,0);
printf("x=");print_avx2_hex(x);printf("lzcnt(x)=%i \n", mm256_lzcnt_si256_v2(x, tmp, indx, lz_msk));
x = _mm256_set_epi32(-1,0x100,0,8, 0,0,32,0);
printf("x=");print_avx2_hex(x);printf("lzcnt(x)=%i \n", mm256_lzcnt_si256_v2(x, tmp, indx, lz_msk));
return 0;
}
The output suggests that the code is correct:
$ ./a.out
mm256_lzcnt_si256
x=0000000000000000 0000000000000000 0000000000000000 0000000000000000 lzcnt(x)=256
x=0000000000000000 0000000000000000 0000000000000000 0000000000000001 lzcnt(x)=255
...
x=0000000000000000 0000000000000000 7FFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF lzcnt(x)=129
x=0000000000000000 0000000000000000 FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF lzcnt(x)=128
x=0000000000000000 0000000000000001 FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF lzcnt(x)=127
...
x=7FFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF lzcnt(x)=1
x=FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF lzcnt(x)=0
x=0000000000000000 0000000000000000 000000000000000F 0000000100000000 lzcnt(x)=188
x=0000000000000000 0000000000000008 0000000000000000 0000000000000100 lzcnt(x)=124
x=0000000000000100 0000000000000008 00000000000000C0 0000000000000000 lzcnt(x)=55
x=FFFFFFFF00000100 0000000000000008 0000000000000000 0000002000000000 lzcnt(x)=0
Function mm256_lzcnt_si256_v2
is an alternative version of the same function,
but now the pointers to the lookup tables and the scratch array are passed with
the function call. This leads to clean assembly code
(no stack operations), and gives an
impression which instructions are needed after inlining mm256_lzcnt_si256
in a loop.
With gcc 8.2 and options -m64 -O3 -march=skylake
:
mm256_lzcnt_si256_v2:
vpxor xmm1, xmm1, xmm1
vmovdqu YMMWORD PTR [rdi], ymm0
vpcmpeqq ymm0, ymm0, ymm1
vmovmskpd ecx, ymm0
mov eax, DWORD PTR [rsi+rcx*4]
lzcnt rax, QWORD PTR [rdi+rax*8]
add eax, DWORD PTR [rdx+rcx*4]
vzeroupper
ret
In a loop context, and with inlining, vpxor
is likely hoisted outside the loop.
Since you are also asking for more elegant (i.e. simpler) way to do this: on my computer, your code runs as fast as the one below. In both cases it took 45 milliseconds to compute the result for 10 million 256-bit words.
Since I was filling AVX registers with (four) randomly generated uniformly distributed 64-bit integers (and not uniformly distributed 256 integers), the order of iteration through the array had no impact on the result of my benchmark test. Also, even though this is almost needless to say, the compiler was smart enough to unroll the loop.
uint32_t countLeadZeros(__m256i const& reg)
{
alignas(32) uint64_t v[4];
_mm256_store_si256((__m256i*)&v[0], reg);
for (int i = 3; i >= 0; --i)
if (v[i]) return _lzcnt_u64(v[i]) + (3 - i)*64;
return 256;
}
EDIT: as it can be seen in the discussion below my answer and in my edit history, I initally took the approach similar to the one of @PeterCorbes (but he provided a better optimized solution). I changed my approach once I started doing benchmarks because I completely overlooked the fact that practically all of my inputs had the most significant bit located within top 64 bits of the AVX word.
After I realized the mistake I had made, I decided to try to do the benchmarks more properly. I will present two results below. I searched through the edit history of my post and from there I copy-pasted the function I submitted (but later edited-out) before I changed my approach and went for the branched version. That function is presented below. I compared the performance of my "branched" function, my "branchless" function and the branchless function that was independently developed by @PeterCorbes. His version is superior to mine in terms of performance - see his excellently written post that contains lots of useful details.
int countLeadZeros(__m256i const& reg){
__m256i zero = _mm256_setzero_si256();
__m256i cmp = _mm256_cmpeq_epi64(reg, zero);
int mask = _mm256_movemask_epi8(cmp);
if (mask == 0xffffffff) return 256;
int first_nonzero_idx = 3 - (_lzcnt_u32(~mask) >> 3);
alignas(32) uint64_t stored[4]; // edit: added alignas(32)
_mm256_store_si256((__m256i*)stored, reg);
int lead_zero_count = _lzcnt_u64(stored[first_nonzero_idx]);
return (3 - first_nonzero_idx) * 64 + lead_zero_count;
}
Benchmark number 1
I will present the test code in pseudocode to make this short. I actually used AVX implementation of random number generator that does the generation of random numbers blazingly fast. First, let's do the test on the inputs that make branch prediction really hard:
tick()
for(int i = 0; i < N; ++i)
{
// "xoroshiro128+"-based random generator was actually used
__m256i in = _mm256_set_epi64x(rand()%2, rand()%2, rand()%2, rand()%2);
res = countLeadZeros(in);
}
tock();
For 10 million repetitions, the function from the top of my post takes 200ms. The implementation that I initially developed requires only 65ms to do the same job. But the function provided by @PeterCorbes takes the cake by consuming only 60ms.
Benchmark number 2
Now let's turn to test that I originally used. Again, pseudocode:
tick()
for(int i = 0; i < N; ++i)
{
// "rand()" represents random 64-bit int; xoroshiro128+ waw actually used here
__m256i in = _mm256_set_epi64x(rand(), rand(), rand(), rand());
res = countLeadZeros(in);
}
tock();
In this case, the version with branches is faster; 45ms are required to compute 10 million results. The function by @PeterCorbes takes 50ms to complete and my "branchless" implementation requires 55ms to do the same job.
I don't think that I dare to draw any general conclusions out of this. It seems to me that the branchless approach is better as it offers the more stable computation time, but whether you need that stability or not probably depends on the usecase.
EDIT: the random generator.
This is an extended reply to comment by @PeterCorbes. As I stated above, the benchmark test code is just pseudocode. If anyone is interested, how I actually generated the numbers, here is a quick description.
I used xoroshiro128+ algorithm which was released into public domain and which is available at this website. It is quite simple to rewrite the algorithm with AVX instructions so that four numbers are generated in parallel. I wrote a class that accepts the so-called initial seed (128 bits) as parameter.
I obtain the seeds (states) for each one of four parallel generators by first copying the initial seed four times; after that I use jump instructions on i-th parallel generator i-times; i = {0, 1, 2, 3}. Each jump advances the internal state J=2^64 steps forward. This means I can generate 4*J numbers (moooore than enough for all everyday purposes), four at a time before any parallel generator starts to repeat a sequence of numbers that were already produced by any other generator in a current session. I control the range of produced numbers with _mm256_srli_epi64
instruction; I use shift 63 for first test and no shift for the second one.