What is the instruction that gives branchless FP min and max on x86?
Warning: Beware of compilers treating _mm_min_ps
/ _mm_max_ps
(and _pd) intrinsics as commutative even in strict FP (not fast-math) mode; even though the asm instruction isn't. GCC specifically seems to have this bug: PR72867 which was fixed in GCC7 but may be back or never fixed for _mm_min_ss
etc. scalar intrinsics (_mm_max_ss has different behavior between clang and gcc, GCC bugzilla PR99497).
GCC knows how the asm instructions themselves work, and doesn't have this problem when using them to implement strict FP semantics in plain scalar code, only with the C/C++ intrinsics.
Unfortunately there isn't a single instruction that implements fmin(a,b)
(with guaranteed NaN propagation), so you have to choose between easy detection of problems vs. higher performance.
Most vector FP instructions have scalar equivalents. MINSS / MAXSS / MINSD / MAXSD are what you want. They handle +/-Infinity the way you'd expect.
MINSS a,b
exactly implements (a<b) ? a : b
according to IEEE rules, with everything that implies about signed-zero, NaN, and Infinities. (i.e. it keeps the source operand, b
, on unordered.) This means C++ compilers can use them for std::min(b,a)
and std::max(b,a)
, because those functions are based on the same expression. Note the b,a
operand order for the std::
functions, opposite Intel-syntax for x86 asm, but matching AT&T syntax.
MAXSS a,b
exactly implements (b<a) ? a : b
, again keeping the source operand (b
) on unordered. Like std::max(b,a)
.
Looping over an array with x = std::min(arr[i], x);
(i.e. minss
or maxss xmm0, [rsi]
) will take a NaN from memory if one is present, and then take whatever non-NaN element is next because that compare will be unordered. So you'll get the min or max of the elements following the last NaN. You normally don't want this, so it's only good for arrays that don't contain NaN. But it means you can start with float v = NAN;
outside a loop, instead of the first element or FLT_MAX or +Infinity, and might simplify handling possibly-empty lists. It's also convenient in asm, allowing init with pcmpeqd xmm0,xmm0
to generate an all-ones bit-pattern (a negative QNAN), but unfortunately GCC's NAN uses a different bit-pattern.
Demo/proof on the Godbolt compiler explorer, including showing that v = std::min(v, arr[i]);
(or max) ignores NaNs in the array, at the cost of having to load into a register and then minss into that register.
(Note that min of an array should use vectors, not scalar; preferably with multiple accumulators to hide FP latency. At the end, reduce to one vector then do horizontal min of it, just like summing an array or doing a dot product.)
Don't try to use _mm_min_ss
on scalar floats; the intrinsic is only available with __m128
operands, and Intel's intrinsics don't provide any way to get a scalar float into the low element of a __m128
without zeroing the high elements or somehow doing extra work. Most compilers will actually emit the useless instructions to do that even if the final result doesn't depend on anything in the upper elements. (Clang can often avoid it, though, applying the as-if rule to the contents of dead vector elements.) There's nothing like __m256 _mm256_castps128_ps256 (__m128 a)
to just cast a float to a __m128
with garbage in the upper elements. I consider this a design flaw. :/
But fortunately you don't need to do this manually, compilers know how to use SSE/SSE2 min/max for you. Just write your C such that they can. The function in your question is ideal: as shown below (Godbolt link):
// can and does inline to a single MINSD instruction, and can auto-vectorize easily
static inline double
dmnsn_min(double a, double b) {
return a < b ? a : b;
}
Note their asymmetric behaviour with NaN: if the operands are unordered, dest=src (i.e. it takes the second operand if either operand is NaN). This can be useful for SIMD conditional updates, see below.
(a
and b
are unordered if either of them is NaN. That means a<b
, a==b
, and a>b
are all false. See Bruce Dawson's series of articles on floating point for lots of FP gotchas.)
The corresponding _mm_min_ss
/ _mm_min_ps
intrinsics may or may not have this behaviour, depending on the compiler.
I think the intrinsics are supposed to have the same operand-order semantics as the asm instructions, but gcc has treated the operands to _mm_min_ps
as commutative even without -ffast-math
for a long time, gcc4.4 or maybe earlier. GCC 7 finally changed it to match ICC and clang.
Intel's online intrinsics finder doesn't document that behaviour for the function, but it's maybe not supposed to be exhaustive. The asm insn ref manual doesn't say the intrinsic doesn't have that property; it just lists _mm_min_ss
as the intrinsic for MINSS.
When I googled on "_mm_min_ps" NaN
, I found this real code and some other discussion of using the intrinsic to handle NaNs, so clearly many people expect the intrinsic to behave like the asm instruction. (This came up for some code I was writing yesterday, and I was already thinking of writing this up as a self-answered Q&A.)
Given the existence of this longstanding gcc bug, portable code that wants to take advantage of MINPS's NaN handling needs to take precautions. The standard gcc version on many existing Linux distros will mis-compile your code if it depends on the order of operands to _mm_min_ps
. So you probably need an #ifdef
to detect actual gcc (not clang etc), and an alternative. Or just do it differently in the first place :/ Perhaps with a _mm_cmplt_ps
and boolean AND/ANDNOT/OR.
Enabling -ffast-math
also makes _mm_min_ps
commutative on all compilers.
As usual, compilers know how to use the instruction set to implement C semantics correctly. MINSS and MAXSS are faster than anything you could do with a branch anyway, so just write code that can compile to one of those.
The commutative-_mm_min_ps
issue applies to only the intrinsic: gcc knows exactly how MINSS/MINPS work, and uses them to correctly implement strict FP semantics (when you don't use -ffast-math).
You don't usually need to do anything special to get decent scalar code out of a compiler. But if you are going to spend time caring about what instructions the compiler uses, you should probably start by manually vectorizing your code if the compiler isn't doing that.
(There may be rare cases where a branch is best, if the condition almost always goes one way and latency is more important than throughput. MINPS latency is ~3 cycles, but a perfectly predicted branch adds 0 cycles to the dependency chain of the critical path.)
In C++, use std::min
and std::max
, which are defined in terms of >
or <
, and don't have the same requirements on NaN behaviour that fmin
and fmax
do. Avoid fmin
and fmax
for performance unless you need their NaN behaviour.
In C, I think just write your own min
and max
functions (or macros if you do it safely).
C & asm on the Godbolt compiler explorer
float minfloat(float a, float b) {
return (a<b) ? a : b;
}
# any decent compiler (gcc, clang, icc), without any -ffast-math or anything:
minss xmm0, xmm1
ret
// C++
float minfloat_std(float a, float b) { return std::min(a,b); }
# This implementation of std::min uses (b<a) : b : a;
# So it can produce the result only in the register that b was in
# This isn't worse (when inlined), just opposite
minss xmm1, xmm0
movaps xmm0, xmm1
ret
float minfloat_fmin(float a, float b) { return fminf(a, b); }
# clang inlines fmin; other compilers just tailcall it.
minfloat_fmin(float, float):
movaps xmm2, xmm0
cmpunordss xmm2, xmm2
movaps xmm3, xmm2
andps xmm3, xmm1
minss xmm1, xmm0
andnps xmm2, xmm1
orps xmm2, xmm3
movaps xmm0, xmm2
ret
# Obviously you don't want this if you don't need it.
If you want to use _mm_min_ss
/ _mm_min_ps
yourself, write code that lets the compiler make good asm even without -ffast-math.
If you don't expect NaNs, or want to handle them specially, write stuff like
lowest = _mm_min_ps(lowest, some_loop_variable);
so the register holding lowest
can be updated in-place (even without AVX).
Taking advantage of MINPS's NaN behaviour:
Say your scalar code is something like
if(some condition)
lowest = min(lowest, x);
Assume the condition can be vectorized with CMPPS, so you have a vector of elements with the bits all set or all clear. (Or maybe you can get away with ANDPS/ORPS/XORPS on floats directly, if you just care about their sign and don't care about negative zero. This creates a truth value in the sign bit, with garbage elsewhere. BLENDVPS looks at only the sign bit, so this can be super useful. Or you can broadcast the sign bit with PSRAD xmm, 31
.)
The straight-forward way to implement this would be to blend x
with +Inf
based on the condition mask. Or do newval = min(lowest, x);
and blend newval into lowest
. (either BLENDVPS or AND/ANDNOT/OR).
But the trick is that all-one-bits is a NaN, and a bitwise OR will propagate it. So:
__m128 inverse_condition = _mm_cmplt_ps(foo, bar);
__m128 x = whatever;
x = _mm_or_ps(x, condition); // turn elements into NaN where the mask is all-ones
lowest = _mm_min_ps(x, lowest); // NaN elements in x mean no change in lowest
// REQUIRES NON-COMMUTATIVE _mm_min_ps: no -ffast-math
// AND DOESN'T WORK AT ALL WITH MOST GCC VERSIONS.
So with only SSE2, and we've done a conditional MINPS in two extra instructions (ORPS and MOVAPS, unless loop unrolling allows the MOVAPS to disappear).
The alternative without SSE4.1 BLENDVPS is ANDPS/ANDNPS/ORPS to blend, plus an extra MOVAPS. ORPS is more efficient than BLENDVPS anyway (it's 2 uops on most CPUs).
Peter Cordes's answer is great, I just figured I'd jump in with some shorter point-by-point answers:
- What is the scalar branchless minmax instruction on x86? Is it a sequence of instructions?
I was referring to minss
/minsd
. And even other architectures without such instructions should be able to do this branchlessly with conditional moves.
- Is it safe to assume it's going to be applied, or how do I call it?
gcc
and clang
will both optimize (a < b) ? a : b
to minss
/minsd
, so I don't bother using intrinsics. Can't speak to other compilers though.
- Does it make sense to bother about branchless-ness of min/max? From what I understand, for a raytracer and / or other viz software, given a ray - box intersection routine, there is no reliable pattern for the branch predictor to pick up, hence it does make sense to eliminate the branch. Am I right about this?
The individual a < b
tests are pretty much completely unpredictable, so it is very important to avoid branching for those. Tests like if (ray.dir.x != 0.0)
are very predictable, so avoiding those branches is less important, but it does shrink the code size and make it easier to vectorize. The most important part is probably removing the divisions though.
- Most importantly, the algorithm discussed is built around comparing against (+/-) INFINITY. Is this reliable w.r.t the (unknown) instruction we're discussing and the floating-point standard?
Yes, minss
/minsd
behave exactly like (a < b) ? a : b
, including their treatment of infinities and NaNs.
Also, I wrote a followup post to the one you referenced that talks about NaNs and min/max in more detail.