Why is (1-x)(1+x) preferred to (1-x^2)?

The derivative of ArcTan(X, Sqrt(1-X*X)) with respect to X is 1/Sqrt(1-X*X). This goes to infinity as |X| goes to 1. Therefore, when X is near 1 or -1, any error in evaluation has a huge effect on the result. Thus, it is critical that evaluation minimize error in these cases.

When X is near 1, evaluation of 1-X has no error (in IEEE 754 or any good floating-point system, because the scale of the result is such that its least significant bit is at least as low as the least significant bit in 1 or X, so the exact mathematical result has no bits outside the available significand bits). Since 1-X is exact, consider the effect of the error in 1+X by considering the derivative of ArcTan(X, Sqrt((1-X)*(1+X+e)) with respect to e, where e is the error introduced in the 1+X operation. The derivative is, when X is near 1 and e is small, approximately -1/10. (Taking the derivative with Maple and substituting 1 for x yields -1/(sqrt(4+2e)*(5+2e). Then substituting 0 for e yields -1/10.) Thus, the error in 1+X is not critical.

Therefore, evaluating the expression as ArcTan(X, Sqrt((1-X)*(1+X)) is a good way to evaluate it.

The situation is symmetric for X near -1. (1+X has no error, and 1-X is not critical.)

Conversely, if we consider the error in X*X, the derivative of ArcTan(X, Sqrt(1-X*X+e)) with respect to e is, when X is near 1, approximately -1/(2sqrt(e)(1+e)), so it is large when e is small. So a small error in evaluating X*X will cause a large error in the result, when X is near 1.

Ask Pascal Cuoq points out, when evaluating a function f(x), we are generally interested in minimizing the relative error in the final result. And, as I have pointed out, the errors that occur during calculation are generally relative errors in intermediate results due to floating-point rounding. I was able to ignore this in the above because I was considering the function when X is near 1, so both the intermediate values under consideration (1+X and X*X) and the final value had magnitudes near 1, so dividing the values by those magnitudes would not change anything significantly.

However, for completeness, I examined the situation more closely. In Maple, I wrote g := arctan(x, sqrt((1-x*x*(1+e0))*(1+e1))*(1+e2)), thus allowing for relative errors e0, e1, and e2 in the calculations of x*x, 1-x*x, and the sqrt, respectively, and I wrote h:= arctan(x, sqrt((1-x)*(1+x)*(1+e0))*(1+e2)) for the alternative. Note that e0 in this case combines the three errors in 1-x, 1+x, and the multiplication of them; the full error term could be (1+ea)*(1+eb)*(1+ec), but this is effectively 1+e0 with a larger possible range for e0.

Then I examined the derivatives of these functions with respect to (one at a time) e0, e1, and e2 divided by abs(f(x)), where f was the ideal function, arctan(x, sqrt(1-x*x)). E.g., in Maple, I examined diff(g, e0) / abs(f(x)). I did not perform a full analytic evaluation of these; I examined the values for some values of x near 0 and near 1 and for values of e0, e1, and e2 at one of their limits, -2^-54.

For x near 0, the values were all of magnitude about 1 or less. That is, any relative error in calculation resulted in a similar relative error in the result, or less.

For x near 1, the values with the derivatives of e1 and e2 were tiny, about 10^-8 or less. However, the values with the derivatives of e0 were hugely different for the two methods. For the 1-x*x method, the value was about 2•10⁷ (using x = 1-2^-53). For the (1-x)*(1+x) method, the value was about 5•10^-9.

In summary, the two methods do not differ much near x = 0, but the (1-x)*(1+x) method is significantly better near x = 1.

I wrote the program below in order to obtain some empirical results for single-precision.

#include <float.h>
#include <math.h>
#include <stdio.h>

long double d1, d2, rel1, rel2;
float i1, i2;
int main() {
  float f;
  for (f = nextafterf(0, 2); f <= 1; f = nextafterf(f, 2))
    {
      long double o = 1.0L - ((long double)f * f);
      float r1 = (1 - f) * (1 + f);
      float r2 = 1 - f * f;
      long double c1 = fabsl(o - r1);
      long double c2 = fabsl(o - r2);
      if (c1 > d1) d1 = c1;
      if (c2 > d2) d2 = c2;
      if (c1 / o > rel1) rel1 = c1 / o, i1 = f;
      if (c2 / o > rel2) rel2 = c2 / o, i2 = f;
    }

  printf("(1-x)(1+x) abs:%Le  relative:%Le\n", d1, rel1);
  printf("1-x*x      abs:%Le  relative:%Le\n\n", d2, rel2);

  printf("input1: %a 1-x:%a 1+x:%a (1-x)(1+x):%a o:%a\n", i1, 1-i1, 1+i1, (1-i1)*(1+i1), (double)(1 - ((long double)i1 * i1)));
  printf("input2: %a x*x:%a 1-x*x:%a o:%a\n", i2, i2*i2, 1 - i2*i2, (double)(1 - ((long double)i2 * i2)));
}

A few remarks:

I used 80-bit long double to compute meta-data. This is not enough to precisely represent the error made for all values of x, but I fear the program would become prohibitively slow with higher precision.
The reference value o is computed as 1.0L - ((long double)f * f). This is always the nearest long double number to the real result, because (long double)f * f is exact (see, already it seems that the form 1 - x*x can sometimes be better :) ).

I obtained the results below:

(1-x)(1+x) abs:8.940394e-08  relative:9.447410e-08
1-x*x      abs:4.470348e-08  relative:8.631498e-05

input1: 0x1.6a046ep-1 1-x:0x1.2bf724p-2 1+x:0x1.b50238p+0 (1-x)(1+x):0x1.0007bep-1 o:0x1.0007bc6a305ep-1
input2: 0x1.ffe96p-1 x*x:0x1.ffd2cp-1 1-x*x:0x1.6ap-12 o:0x1.69f8007p-12

According to these results, 1 - x*x has better absolute accuracy and (1-x)*(1+x) has much better relative accuracy. Floating-point is all about relative accuracy (the entire system is designed to allow relatively accurate representation of small and large values), so the latter form is preferred.

EDIT: Computing the final error makes more sense, as illustrated in Eric's answer. A subexpression in an expression such as ArcTan(X, Sqrt(1 - X*X)) could have been chosen not because of its better accuracy overall but because it was accurate where it mattered most. Adding the lines below to the body of the loop:

  long double a = atan2l(f, sqrtl(o));
  float a1 = atan2f(f, sqrtf(r1));
  float a2 = atan2f(f, sqrtf(r2));
  long double e1 = fabsl(a - a1);
  long double e2 = fabsl(a - a2);
  if (e1 / a > ae1) ae1 = e1 / a, i1 = f;
  if (e2 / a > ae2) ae2 = e2 / a, i2 = f;

It might make as much sense to use atan2l(f, sqrtf(r1)) because I do not have the exact same function ArcTan as your system. Anyway, with these caveats, for the complete expression, the maximum relative error on the interval [-1 … 1] is 1.4e-07 for the (1-x)(1+x) version and 5.5e-7 for the 1-x² version.

Why is (1-x)(1+x) preferred to (1-x^2)?

Tags:

Floating Point

Related

Recent Posts