Efficient floating-point division with constant integer divisors
This question asks for a way to identify the values of the constant Y
that make it safe to transform x / Y
into a cheaper computation using FMA for all possible values of x
. Another approach is to use static analysis to determine an over-approximation of the values x
can take, so that the generally unsound transformation can be applied in the knowledge that the values for which the transformed code differs from the original division do not happen.
Using representations of sets of floating-point values that are well adapted to the problems of floating-point computations, even a forwards analysis starting from the beginning of the function can produce useful information. For instance:
float f(float z) {
float x = 1.0f + z;
float r = x / Y;
return r;
}
Assuming the default round-to-nearest mode(*), in the above function x
can only be NaN (if the input is NaN), +0.0f, or a number larger than 2-24 in magnitude, but not -0.0f or anything closer to zero than 2-24. This justifies the transformation into one of the two forms shown in the question for many values of the constant Y
.
(*) assumption without which many optimizations are impossible and that C compilers already make unless the program explicitly uses #pragma STDC FENV_ACCESS ON
A forwards static analysis that predicts the information for x
above can be based on a representation of sets of floating-point values an expression can take as a tuple of:
- a representation for the sets of possible NaN values (Since behaviors of NaN are underspecified, a choice is to use only a boolean, with
true
meaning some NaNs can be present, andfalse
indicating no NaN is present.), - four boolean flags indicating respectively the presence of +inf, -inf, +0.0, -0.0,
- an inclusive interval of negative finite floating-point values, and
- an inclusive interval of positive finite floating-point values.
In order to follow this approach, all the floating-point operations that can occur in a C program must be understood by the static analyzer. To illustrate, the addition betweens sets of values U and V, to be used to handle +
in the analyzed code, can be implemented as:
- If NaN is present in one of the operands, or if the operands can be infinities of opposite signs, NaN is present in the result.
- If 0 cannot be a result of the addition of a value of U and a value of V, use standard interval arithmetic. The upper bound of the result is obtained for the round-to-nearest addition of the largest value in U and the largest value in V, so these bounds should be computed with round-to-nearest.
- If 0 can be a result of the addition of a positive value of U and a negative value of V, then let M be the smallest positive value in U such that -M is present in V.
- if succ(M) is present in U, then this pair of values contributes succ(M) - M to the positive values of the result.
- if -succ(M) is present in V, then this pair of values contributes the negative value M - succ(M) to the negative values of the result.
- if pred(M) is present in U, then this pair of values contributes the negative value pred(M) - M to the negative values of the result.
- if -pred(M) is present in V, then this pair of values contributes the value M - pred(M) to the positive values of the result.
- Do the same work if 0 can be the result of the addition of a negative value of U and a positive value of V.
Acknowledgement: the above borrows ideas from “Improving the Floating Point Addition and Subtraction Constraints”, Bruno Marre & Claude Michel
Example: compilation of the function f
below:
float f(float z, float t) {
float x = 1.0f + z;
if (x + t == 0.0f) {
float r = x / 6.0f;
return r;
}
return 0.0f;
}
The approach in the question refuses to transform the division in function f
into an alternate form, because 6 is not one of the value for which the division can be unconditionally transformed. Instead, what I am suggesting is to apply a simple value analysis starting from the beginning of the function which, in this case, determines that x
is a finite float either +0.0f
or at least 2-24 in magnitude, and to use this information to apply Brisebarre et al's transformation, confident in the knowledge that x * C2
does not underflow.
To be explicit, I am suggesting to use an algorithm such as the one below to decide whether or not to transform the division into something simpler:
- Is
Y
one of the values that can be transformed using Brisebarre et al's method according to their algorithm? - Do C1 and C2 from their method have the same sign, or is it possible to exclude the possibility that the dividend is infinite?
- Do C1 and C2 from their method have the same sign, or can
x
take only one of the two representations of 0? If in the case where C1 and C2 have different signs andx
can only be one representation of zero, remember to fiddle(**) with the signs of the FMA-based computation to make it produce the correct zero whenx
is zero. - Can the magnitude of the dividend be guaranteed to be large enough to exclude the possibility that
x * C2
underflows?
If the answer to the four questions is “yes”, then the division can be transformed into a multiplication and an FMA in the context of the function being compiled. The static analysis described above serves to answer questions 2., 3. and 4.
(**) “fiddling with the signs” means using -FMA(-C1, x, (-C2)*x) in place of FMA(C1, x, C2*x) when this is necessary to make the result come out correctly when x can only be one of the two signed zeroes
Let me restart for the third time. We are trying to accelerate
q = x / y
where y
is an integer constant, and q
, x
, and y
are all IEEE 754-2008 binary32 floating-point values. Below, fmaf(a,b,c)
indicates a fused multiply-add a * b + c
using binary32 values.
The naive algorithm is via a precalculated reciprocal,
C = 1.0f / y
so that at runtime a (much faster) multiplication suffices:
q = x * C
The Brisebarre-Muller-Raina acceleration uses two precalculated constants,
zh = 1.0f / y
zl = -fmaf(zh, y, -1.0f) / y
so that at runtime, one multiplication and one fused multiply-add suffices:
q = fmaf(x, zh, x * zl)
The Markstein algorithm combines the naive approach with two fused multiply-adds that yields the correct result if the naive approach yields a result within 1 unit in the least significant place, by precalculating
C1 = 1.0f / y
C2 = -y
so that the divison can be approximated using
t1 = x * C1
t2 = fmaf(C1, t1, x)
q = fmaf(C2, t2, t1)
The naive approach works for all powers of two y
, but otherwise it is pretty bad. For example, for divisors 7, 14, 15, 28, and 30, it yields an incorrect result for more than half of all possible x
.
The Brisebarre-Muller-Raina approach similarly fails for almost all non-power of two y
, but much fewer x
yield the incorrect result (less than half a percent of all possible x
, varies depending on y
).
The Brisebarre-Muller-Raina article shows that the maximum error in the naive approach is ±1.5 ULPs.
The Markstein approach yields correct results for powers of two y
, and also for odd integer y
. (I have not found a failing odd integer divisor for the Markstein approach.)
For the Markstein approach, I have analysed divisors 1 - 19700 (raw data here).
Plotting the number of failure cases (divisor in the horizontal axis, the number of values of x
where Markstein approach fails for said divisor), we can see a simple pattern occur:
(source: nominal-animal.net)
Note that these plots have both horizontal and vertical axes logarithmic. There are no dots for odd divisors, as the approach yields correct results for all odd divisors I've tested.
If we change the x axis to the bit reverse (binary digits in reverse order, i.e. 0b11101101 → 0b10110111, data) of the divisors, we have a very clear pattern:
(source: nominal-animal.net)
If we draw a straight line through the center of the point sets, we get curve 4194304/x
. (Remember, the plot considers only half the possible floats, so when considering all possible floats, double it.)
8388608/x
and 2097152/x
bracket the entire error pattern completely.
Thus, if we use rev(y)
to compute the bit reverse of divisor y
, then 8388608/rev(y)
is a good first order approximation of the number of cases (out of all possible float) where the Markstein approach yields an incorrect result for an even, non-power-of-two divisor y
. (Or, 16777216/rev(x)
for the upper limit.)
Added 2016-02-28: I found an approximation for the number of error cases using the Markstein approach, given any integer (binary32) divisor. Here it is as pseudocode:
function markstein_failure_estimate(divisor):
if (divisor is zero)
return no estimate
if (divisor is not an integer)
return no estimate
if (divisor is negative)
negate divisor
# Consider, for avoiding underflow cases,
if (divisor is very large, say 1e+30 or larger)
return no estimate - do as division
while (divisor > 16777216)
divisor = divisor / 2
if (divisor is a power of two)
return 0
if (divisor is odd)
return 0
while (divisor is not odd)
divisor = divisor / 2
# Use return (1 + 83833608 / divisor) / 2
# if only nonnegative finite float divisors are counted!
return 1 + 8388608 / divisor
This yields a correct error estimate to within ±1 on the Markstein failure cases I have tested (but I have not yet adequately tested divisors larger than 8388608). The final division should be such that it reports no false zeroes, but I cannot guarantee it (yet). It does not take into account very large divisors (say 0x1p100, or 1e+30, and larger in magnitude) which have underflow issues -- I would definitely exclude such divisors from acceleration anyway.
In preliminary testing, the estimate seems uncannily accurate. I did not draw a plot comparing the estimates and the actual errors for divisors 1 to 20000, because the points all coincide exactly in the plots. (Within this range, the estimate is exact, or one too large.) Essentially, the estimates reproduce the first plot in this answer exactly.
The pattern of failures for the Markstein approach is regular, and very interesting. The approach works for all power of two divisors, and all odd integer divisors.
For divisors greater than 16777216, I consistently see the same errors as for a divisor that is divided by the smallest power of two to yield a value less than 16777216. For example, 0x1.3cdfa4p+23 and 0x1.3cdfa4p+41, 0x1.d8874p+23 and 0x1.d8874p+32, 0x1.cf84f8p+23 and 0x1.cf84f8p+34, 0x1.e4a7fp+23 and 0x1.e4a7fp+37. (Within each pair, the mantissa is the same, and only the power of two varies.)
Assuming my test bench is not in error, this means that the Markstein approach also works divisors larger than 16777216 in magnitude (but smaller than, say, 1e+30), if the divisor is such that when divided by the smallest power of two that yields a quotient of less than 16777216 in magnitude, and the quotient is odd.