How to combine two __m128 values to __m256?
Even this one will work:
__m128 a = _mm_set_ps(1,2,3,4);
__m128 b = _mm_set_ps(5,6,7,8);
__m256 c = _mm256_insertf128_ps(c,a,0);
c = _mm256_insertf128_ps(c,b,1);
You will get a warning as c is not initialized but you can ignore it and if you're looking for performances this solution will use less clock cycle then the other one.
Intel documents __m256 _mm256_set_m128(__m128 hi, __m128 lo)
and _mm256_setr_m128(lo, hi)
as intrinsics for the vinsertf128
instruction, which is what you want1. (Of course there are also __m256d
and __m256i
versions, which use the same instruction. The __m256i version may use vinserti128
if AVX2 is available, otherwise it'll use f128 as well.)
These days, those intrinsics are supported by current versions of all 4 major x86 compilers (gcc, clang, MSVC, and ICC). But not by older versions; like some other helper intrinsics that Intel documents, widespread implementation has been slow. (Often GCC or clang are the last hold-out to not have something you wish you could use portably.)
Use it if you don't need portability to old GCC versions: it's the most readable way to express what you want, following the well known _mm_set
and _mm_setr
patterns.
Performance-wise, it's of course just as efficient as manual cast + vinsertf128
intrinsics (@Mysticial's answer), and for gcc at least that's literally how the internal .h
actually implements _mm256_set_m128
.
Compiler version support for _mm256_set_m128
/ _mm256_setr_m128
:
- clang: 3.6 and newer. (Mainline, IDK about Apple)
- GCC: 8.x and newer, not present as recently as GCC7!
- ICC: since at least ICC13, the earliest on Godbolt.
- MSVC: since at least 19.14 and 19.10 (WINE) VS2015, the earliest on Godbolt.
https://godbolt.org/z/1na1qr has test cases for all 4 compilers.
__m256 combine_testcase(__m128 hi, __m128 lo) {
return _mm256_set_m128(hi, lo);
}
They all compile this function to one vinsertf128
, except MSVC where even the latest version wastes a vmovups xmm2, xmm1
copying a register. (I used -O2 -Gv -arch:AVX
to use the vectorcall convention so args would be in registers to make an efficient non-inlined function definition possible for MSVC.) Presumably MSVC would be ok inlining into a larger function if it could write the result to a 3rd register, instead of the calling convention forcing it to read xmm0 and write ymm0.
Footnote 1:vinsertf128
is very efficient on Zen1, and as efficient as vperm2f128
on other CPUs with 256-bit-wide shuffle units. It can also take the high half from memory in case the compiler spilled it or is folding a _mm_loadu_ps
into it, instead of needing to separately do a 128-bit load into a register; vperm2f128
's memory operand would be a 256-bit load which you don't want.
https://uops.info/ / https://agner.org/optimize/
This should do what you want:
__m128 a = _mm_set_ps(1,2,3,4);
__m128 b = _mm_set_ps(5,6,7,8);
__m256 c = _mm256_castps128_ps256(a);
c = _mm256_insertf128_ps(c,b,1);
If the order is reversed from what you want, then just switch a
and b
.
The intrinsic of interest is _mm256_insertf128_ps
which will let you insert a 128-bit register into either lower or upper half of a 256-bit AVX register:
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_avx_insertf128_ps.htm
The complete family of them is here:
_mm256_insertf128_pd()
_mm256_insertf128_ps()
_mm256_insertf128_si256()