Computing 8 horizontal sums of eight AVX single-precision floating-point vectors
OK, I think I have found faster algorithm based on (usually slow) HADDs:
__m256 HorizontalSums(__m256 v0, __m256 v1, __m256 v2, __m256 v3, __m256 v4, __m256 v5, __m256 v6, __m256 v7)
{
const __m256 s01 = _mm256_hadd_ps(v0, v1);
const __m256 s23 = _mm256_hadd_ps(v2, v3);
const __m256 s45 = _mm256_hadd_ps(v4, v5);
const __m256 s67 = _mm256_hadd_ps(v6, v7);
const __m256 s0123 = _mm256_hadd_ps(s01, s23);
const __m256 s4556 = _mm256_hadd_ps(s45, s67);
// inter-lane shuffle
v0 = _mm256_blend_ps(s0123, s4556, 0xF0);
v1 = _mm256_permute2f128_ps(s0123, s4556, 0x21);
return _mm256_add_ps(v0, v1);
}
According to IACA, it's ~8 cycles faster on Haswell.
Witek902's solution should work well, but it may
suffer from high port 5 pressure, if HorizontalSums
is called very often by the surrounding code.
On Intel Haswell, or newer, the vhaddps
instruction decodes to 3 micro-ops: 2 port 5 (p5) micro-ops and
one micro-op for p1 or p01 (see Agner Fog's instruction tables).
Function sort_of_alternative_hadd_ps
also decodes to 3 micro-ops, but only one of them (the shuffle) executes necessarily on p5:
inline __m256 sort_of_alternative_hadd_ps(__m256 x, __m256 y)
{
__m256 y_hi_x_lo = _mm256_blend_ps(x, y, 0b11001100); /* y7 y6 x5 x4 y3 y2 x1 x0 */
__m256 y_lo_x_hi = _mm256_shuffle_ps(x, y, 0b01001110); /* y5 y4 x7 x6 y1 y0 x3 x2 */
return _mm256_add_ps(y_hi_x_lo, y_lo_x_hi);
}
It is possible to replace the first 4 _mm256_hadd_ps()
intrinsics in Witek902's
answer by the sort_of_alternative_hadd_ps
function. Altogether
8 extra instructions are needed to compute the horizontal sum:
__m256 HorizontalSums_less_p5_pressure(__m256 v0, __m256 v1, __m256 v2, __m256 v3, __m256 v4, __m256 v5, __m256 v6, __m256 v7)
{
__m256 s01 = sort_of_alternative_hadd_ps(v0, v1);
__m256 s23 = sort_of_alternative_hadd_ps(v2, v3);
__m256 s45 = sort_of_alternative_hadd_ps(v4, v5);
__m256 s67 = sort_of_alternative_hadd_ps(v6, v7);
__m256 s0123 = _mm256_hadd_ps(s01, s23);
__m256 s4556 = _mm256_hadd_ps(s45, s67);
v0 = _mm256_blend_ps(s0123, s4556, 0xF0);
v1 = _mm256_permute2f128_ps(s0123, s4556, 0x21);
return _mm256_add_ps(v0, v1);
}
This compiles to:
HorizontalSums_less_p5_pressure:
vblendps ymm8, ymm0, ymm1, 204
vblendps ymm10, ymm2, ymm3, 204
vshufps ymm0, ymm0, ymm1, 78
vblendps ymm9, ymm4, ymm5, 204
vblendps ymm1, ymm6, ymm7, 204
vshufps ymm2, ymm2, ymm3, 78
vshufps ymm4, ymm4, ymm5, 78
vshufps ymm6, ymm6, ymm7, 78
vaddps ymm0, ymm8, ymm0
vaddps ymm6, ymm6, ymm1
vaddps ymm2, ymm10, ymm2
vaddps ymm4, ymm9, ymm4
vhaddps ymm0, ymm0, ymm2
vhaddps ymm4, ymm4, ymm6
vblendps ymm1, ymm0, ymm4, 240
vperm2f128 ymm0, ymm0, ymm4, 33
vaddps ymm0, ymm1, ymm0
ret
Eventually both Witek902's HorizontalSums
and
HorizontalSums_less_p5_pressure
are decoded by the CPU into 21 micro-ops,
with respectively 13 p5 micro-ops and 9 p5 micro-ops.
Depending on the surrouding code and the actual microarchitecture, this reduced port 5 pressure may improve the performance.