Extract the low bit of each bool byte in a __m128i? bool array to packed bitmap
So your source data is contiguous? You should use _mm_load_si128
instead of messing around with scalar components of vector types.
Your real problem is packing an array of bool
(1 byte per element in the ABI used by g++ on x86) into a bitmap. You should do this with SIMD, not with scalar code to set 1 bit or byte at a time.
pmovmskb
(_mm_movemask_epi8
) is fantastic for extracting one bit per byte of input. You just need to arrange to get the bit you want into the high bit.
The obvious choice would be a shift, but vector shift instructions compete for the same execution port as pmovmskb
on Haswell (port 0). (http://agner.org/optimize/). Instead, adding 0x7F
will produce 0x80
(high bit set) for an input of 1
, but 0x7F
(high bit clear) for an input of 0
. (And a bool
in the x86-64 System V ABI must be stored in memory as an integer 0 or 1, not simply 0 vs. any non-zero value).
Why not pcmpeqb
against _mm_set1_epi8(1)
? Skylake runs pcmpeqb
on ports 0/1, but paddb
on all 3 vector ALU ports (0/1/5). It's very common to use pmovmskb
on the result of pcmpeqb/w/d/q
, though.
#include <immintrin.h>
#include <stdint.h>
// n is the number of uint16_t dst elements
// We access n*16 bool elements from src.
void pack_bools(uint16_t *dst, const bool *src, size_t n)
{
// you can later access dst with __m128i loads/stores
__m128i carry_to_highbit = _mm_set1_epi8(0x7F);
for (size_t i = 0 ; i < n ; i+=1) {
__m128i boolvec = _mm_loadu_si128( (__m128i*)&src[i*16] );
__m128i highbits = _mm_add_epi8(boolvec, carry_to_highbit);
dst[i] = _mm_movemask_epi8(highbits);
}
}
Because we want to use scalar stores when writing this bitmap, we want dst
to be in uint16_t
for strict-aliasing reasons. With AVX2, you'd want uint32_t
. (Or if you did combine = tmp1 << 16 | tmp
to combine two pmovmskb
results. But probably don't do that.)
This compiles into an asm loop like this (with gcc7.3 -O3, on the Godbolt compiler explorer)
.L3:
movdqu xmm0, XMMWORD PTR [rsi]
add rsi, 16
add rdi, 2
paddb xmm0, xmm1
pmovmskb eax, xmm0
mov WORD PTR [rdi-2], ax
cmp rdx, rsi
jne .L3
So it's not wonderful (7 fuse-domain uops -> front-end bottleneck at 16 bools per ~1.75 clock cycles). Clang unrolls by 2, and should manage 16 bools per 1.5 cycles.
Using a shift (pslld xmm0, 7
) would only run at one iteration per 2 cycles on Haswell, bottlenecked on port 0.