What is the difference between float and double?
Huge difference.
As the name implies, a double
has 2x the precision of float
[1]. In general a double
has 15 decimal digits of precision, while float
has 7.
Here's how the number of digits are calculated:
double
has 52 mantissa bits + 1 hidden bit: log(253)÷log(10) = 15.95 digits
float
has 23 mantissa bits + 1 hidden bit: log(224)÷log(10) = 7.22 digits
This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.
float a = 1.f / 81;
float b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.7g\n", b); // prints 9.000023
while
double a = 1.0 / 81;
double b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.15g\n", b); // prints 8.99999999999996
Also, the maximum value of float is about 3e38
, but double is about 1.7e308
, so using float
can hit "infinity" (i.e. a special floating-point number) much more easily than double
for something simple, e.g. computing the factorial of 60.
During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.
Of course, sometimes, even double
isn't accurate enough, hence we sometimes have long double
[1] (the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should use int
or a fraction class.
Furthermore, don't use +=
to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, use fsum
. Otherwise, try to implement the Kahan summation algorithm.
[1]: The C and C++ standards do not specify the representation of float
, double
and long double
. It is possible that all three are implemented as IEEE double-precision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM) float
is indeed a IEEE single-precision floating point number (binary32), and double
is a IEEE double-precision floating point number (binary64).
Here is what the standard C99 (ISO-IEC 9899 6.2.5 §10) or C++2003 (ISO-IEC 14882-2003 3.1.9 §8) standards say:
There are three floating point types:
float
,double
, andlong double
. The typedouble
provides at least as much precision asfloat
, and the typelong double
provides at least as much precision asdouble
. The set of values of the typefloat
is a subset of the set of values of the typedouble
; the set of values of the typedouble
is a subset of the set of values of the typelong double
.
The C++ standard adds:
The value representation of floating-point types is implementation-defined.
I would suggest having a look at the excellent What Every Computer Scientist Should Know About Floating-Point Arithmetic that covers the IEEE floating-point standard in depth. You'll learn about the representation details and you'll realize there is a tradeoff between magnitude and precision. The precision of the floating point representation increases as the magnitude decreases, hence floating point numbers between -1 and 1 are those with the most precision.