FMA3 in GCC: how to enable
Only answering a very small part of the question here. If you write _mm256_add_ps(_mm256_mul_ps(areg0,breg0), tmp0)
, gcc-4.9 handles it almost like inline asm and does not optimize it much. If you replace it with areg0*breg0+tmp0
, a syntax that is supported by both gcc and clang, then gcc starts optimizing and may use FMA if available. I improved that for gcc-5, _mm256_add_ps
for instance is now implemented as an inline function that simply uses +
, so the code with intrinsics can be optimized as well.
The following compiler options are sufficient to contract _mm256_add_ps(_mm256_mul_ps(a, b), c)
to a single fma instruction now (e.g vfmadd213ps
):
GCC 5.3: -O2 -mavx2 -mfma
Clang 3.7: -O1 -mavx2 -mfma -ffp-contract=fast
ICC 13: -O1 -march=core-avx2
I tried /O2 /arch:AVX2 /fp:fast
with MSVC but it still does not contract (surprise surprise). MSVC will contract scalar operations though.
GCC started doing this since at least GCC 5.1.
Although -O1
is sufficient for this optimization with some compilers, always use at least -O2
for overall performance, preferably -O3 -march=native -flto
and also profile-guided optimization.
And if it's ok for your code, -ffast-math
.