How is arctan implemented?
Implementations of the FPATAN instructions in x86 processors are usually proprietary. To compute arctan, or other (inverse) trigonometric functions, common algorithms follow a three-step process:
- argument reduction for mapping the full input domain to a narrow interval
- computation of the core approximation on the narrow interval (primary approximation interval)
- expansion of the intermediate result based on the argument reduction to produce the final result
The argument reduction is usually based on well-known trigonometric identities that can be looked up in various standard references such as MathWorld. For the computation of arctan, commonly used identities are
- arctan (-x) = -arctan(x)
- arctan (1/x) = 0.5 * pi - arctan(x) [x > 0]
- arctan (x) = arctan(c) + arctan((x - c) / (1 + x*c))
Note that the last identity lends itself to the construction of a table of values arctan(i/2n), i = 1...2n, which allows the use of an arbitrarily narrow primary approximation interval at the expense of additional table storage. This is a classical programming trade-off between space and time.
The approximation on the core interval is typically a minimax polynomial approximation of sufficient degree. Rational approximations are usually not competitive on modern hardware due to the high cost of floating-point division, and also suffer from additional numerical error, due to the computation of two polynomials plus the error contributed by the division.
The coefficients for minimax polynomial approximations are usually computed using the Remez algorithm. Tools like Maple and Mathematica have built-in facilities to compute such approximations. The accuracy of polynomial approximations can be improved by making sure all coefficients are exactly representable machine numbers. The only tool I am aware of that has a built-in facility for this is Sollya which offers an fpminimax()
function.
The evaluation of polynomials usually utilizes Horner's scheme which is efficient and accurate, or a mixture of Estrin's scheme and Horner's. Estrin's scheme allows one to make excellent use of instruction level parallelism provided by superscalar processors, with a minor impact on overall instruction count and often (but not always) benign impact on accuracy.
The use of FMA (fused-multiply add) enhances the accuracy and performance of either evaluation scheme due to the reduced number of rounding steps and by offering some protection against subtractive cancellation. FMA is found on many processors, including GPUs and recent ARM, x86, and Power CPUs. In standard C and standard C++, the FMA operation is exposed as the fma()
standard library function, however it needs to be emulated on platforms that do not offer hardware support, which makes it slow on those platforms.
From a programming viewpoint one would like to avoid the risk of conversion errors when translating the floating-point constants needed for the approximation and argument reduction from textual to machine representation. ASCII-to-floating-point conversion routine are notorious for containing tricky bugs. One mechanism offered by standard C (not C++ best I know, where it is available only as a proprietary extension) is to specify floating-point constants as hexadecimal literals that directly express the underlying bit-pattern, effectively avoiding complicated conversions.
Below is C code to compute double-precision arctan()
that demonstrates many of the design principles and techniques mentioned above. This quickly-constructed code lacks the sophistication of the implementations pointed to in other answers but provides an implementation with a maximum observed error of 1.65 ulps, which may be sufficient in various contexts. I created a custom minimax approximation with a simple implementation of the Remez algorithm that used 1024-bit floating-point arithmetic for all intermediate steps. I would expect the use of Sollya or similar tools to result in a numerically superior approximations. To balance performance and accuracy, the evaluation of the core approximation combines both second-order and ordinary Horner schemes.
/* Compute arctangent. Maximum observed error: 1.64991 ulps */
double my_atan (double x)
{
double a, z, p, r, q, s, t;
/* argument reduction:
arctan (-x) = -arctan(x);
arctan (1/x) = 1/2 * pi - arctan (x), when x > 0
*/
z = fabs (x);
a = (z > 1.0) ? (1.0 / z) : z;
s = a * a;
q = s * s;
/* core approximation: approximate atan(x) on [0,1] */
p = -2.0258553044340116e-5; // -0x1.53e1d2a258e3ap-16
t = 2.2302240345710764e-4; // 0x1.d3b63dbb6167ap-13
p = fma (p, q, -1.1640717779912220e-3); // -0x1.312788ddde71dp-10
t = fma (t, q, 3.8559749383656407e-3); // 0x1.f9690c824aaf1p-9
p = fma (p, q, -9.1845592187222193e-3); // -0x1.2cf5aabc7dbd2p-7
t = fma (t, q, 1.6978035834594660e-2); // 0x1.162b0b2a3bcdcp-6
p = fma (p, q, -2.5826796814492296e-2); // -0x1.a7256feb6f841p-6
t = fma (t, q, 3.4067811082715810e-2); // 0x1.171560ce4a4ecp-5
p = fma (p, q, -4.0926382420509999e-2); // -0x1.4f44d841450e8p-5
t = fma (t, q, 4.6739496199158334e-2); // 0x1.7ee3d3f36bbc6p-5
p = fma (p, q, -5.2392330054601366e-2); // -0x1.ad32ae04a9fd8p-5
t = fma (t, q, 5.8773077721790683e-2); // 0x1.e17813d669537p-5
p = fma (p, q, -6.6658603633512892e-2); // -0x1.11089ca9a5be4p-4
t = fma (t, q, 7.6922129305867892e-2); // 0x1.3b12b2db5173cp-4
p = fma (p, s, t);
p = fma (p, s, -9.0909012354005267e-2); // -0x1.745d022f8dc5fp-4
p = fma (p, s, 1.1111110678749421e-1); // 0x1.c71c709dfe925p-4
p = fma (p, s, -1.4285714271334810e-1); // -0x1.2492491fa1742p-3
p = fma (p, s, 1.9999999999755005e-1); // 0x1.99999999840cdp-3
p = fma (p, s, -3.3333333333331838e-1); // -0x1.5555555555448p-2
p = fma (p * s, a, a);
/* back substitution in accordance with argument reduction */
/* double-precision factorization of PI/2 courtesy of Tor Myklebust */
r = (z > 1.0) ? fma (0.93282184640716537, 1.6839188885261840, -p) : p;
return copysign (r, x);
}
Trigonometric functions do have pretty ugly implementations that are hacky and do lots of bit fiddling. I think it will be pretty hard to find someone here that is able to explain an algorithm that is actually used.
Here is an atan2 implementation: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/ieee754/dbl-64/e_atan2.c;h=a287ca6656b210c77367eec3c46d72f18476d61d;hb=HEAD
Edit: Actually I found this one: http://www.netlib.org/fdlibm/e_atan2.c which is a lot easier to follow, but probably slower because of that (?).
The FPU does all this in some circuits so the CPU doesn't have to do all this work.