Using a variable to index a simd vector with _mm256_extract_epi32() intrinsic
Apparently GCC and Clang made a different choice.
IMHO GCC has made the right choice by not implementing this for variable indices. Intrinsic _mm256_extract_epi32
doesn't translate to a single instruction. With a variable index this intrinsic might lead to inefficient code,
if it is used in a performance critical loop.
For example, Clang 3.8 needs 4 instructions to implement _mm256_extract_epi32
with a variable index.
GCC forces the programmer to think about more efficient code that avoids _mm256_extract_epi32
with variable indices.
Nevertheless, sometimes it is useful to have a portable (gcc, clang, icc) function, which emulates _mm256_extract_epi32
with variable a index:
uint32_t mm256_extract_epi32_var_indx(const __m256i vec, const unsigned int i) {
__m128i indx = _mm_cvtsi32_si128(i);
__m256i val = _mm256_permutevar8x32_epi32(vec, _mm256_castsi128_si256(indx));
return _mm_cvtsi128_si32(_mm256_castsi256_si128(val));
}
This should compile to three instructions after inlining: two vmovd
s and a vpermd
(gcc 8.2 with -m64 -march=skylake -O3
):
mm256_extract_epi32_var_indx:
vmovd xmm1, edi
vpermd ymm0, ymm1, ymm0
vmovd eax, xmm0
vzeroupper
ret
Note that the intrinsics guide describes that the result is 0 for indices
>=8 (which is an unusual case anyway). With Clang 3.8, and with mm256_extract_epi32_var_indx
, the index is reduced modulo 8. In other words: only the 3 least significant bits of the index are used.
Note that Clang 5.0's round trip to memory isn't very efficient too,
see this Godbolt link. Clang 7.0
fails to compile _mm256_extract_epi32
with variable indices.
As @Peter Cordes commented: with a fixed index 0, 1, 2, or 3, only a single pextrd
instruction is needed to
extract the integer from the xmm register. With a fixed index 4, 5, 6, or 7, two instructions are required.
Unfortunately, a vpextrd
instruction working on 256-bit ymm registers doesn't exist.
The next example illustrates my answer:
A naive programmer starting with SIMD intrinsics might write
the following code to sum the elements 0, 1, ..., j-1, with j<8
, from vec
.
#include <stdlib.h>
#include <immintrin.h>
#include <stdint.h>
uint32_t foo( __m256i vec , int j)
{
uint32_t sum=0;
for (int i = 0; i < j; i++){
sum = sum + (uint32_t)_mm256_extract_epi32( vec, i );
}
return sum;
}
With Clang 3.8 this compiles to about 50 instructions with branches and loops. GCC fails to compile this code. Obviously an efficient code to sum these elements is likely based on:
- mask out the elements j, j+1, ..., 7, and
- compute the horizontal sum.
The __N
it says must be a 1-bit immediate is not the 2nd arg to _mm256_extract_epi32
, it's some function of that used as an arg to __builtin_ia32_vextractf128_si256
(presumably the 3rd bit). Then later it wants an integer constant in the 0..3 range for vpextrd
, giving you a total of 3 bits of index.
_mm256_extract_epi32
is a composite intrinsic, not directly defined in terms of a single builtin
function.
vpextrd r32, ymm, imm8
doesn't exist, only the xmm version exists, so _mm256_extract_epi32
is a wrapper around vextracti/f128
/ vpextrd
. Gcc chooses to only make it work for compile-time constants so it always compiles to at most 2 instructions.
If you want runtime-variable vector indexing, you need to use different syntax; e.g. store to an array and load a scalar, and hope gcc optimizes that into a shuffle / extract.
Or define a GNU C native vector type with the right element width, and use foo[i]
to index it like an array.
typedef int v8si __attribute__ ((vector_size (32)));
v8si tmp = foo; // may need a cast to convert from __m256i
int element_i = tmp[i];
__m256i
in gcc/clang is defined as a vector of long long
elements, so if you index it directly with []
, you'll get qword elements. (And your code won't compile with MSVC, which doesn't define __m256i
that way at all.)
I haven't checked the asm for any of these recently: if you care about efficiency, you might want to manually design a shuffle using your runtime-variable index, like @Wim's answer suggests that clang does.