Difference in casting float to int, 32-bit C
In the “32-bit system,” the difference is caused by the fact that f1*10.0
uses full double
precision, while f10
has only float
precision because that is its type. f1*10.0
uses double
precision because 10.0
is a double
constant. When f1*10.0
is assigned to f10
, the value changes because it is implicitly converted to float
, which has less precision.
If you use the float
constant 10.0f
instead, the differences vanish.
Consider the first case, when i
is 1. Then:
- In
f1 = 3+i*0.1
,0.1
is adouble
constant, so the arithmetic is performed indouble
, and the result is 3.100000000000000088817841970012523233890533447265625. Then, to assign this tof1
, it is converted tofloat
, which produces 3.099999904632568359375. - In
f10 = f1*10.0;
,10.0
is adouble
constant, so the arithmetic is again performed indouble
, and the result is 30.99999904632568359375. For assignment tof10
, this is converted tofloat
, and the result is 31. - Later, when
f10
andf1*10.0
are printed, we see the values given above, with nine digits after the decimal point, “31.000000000” forf10
, and “30.999999046”.
If you print f1*10.0f
, with the float
constant 10.0f
instead of the double
constant 10.0
, the result will be “31.000000000” rather than “30.999999046”.
(The above uses IEEE-754 basic 32-bit and 64-bit binary floating-point arithmetic.)
In particular, note this: The difference between f1*10.0
and f10
arises when f1*10.0
is converted to float
for assignment to f10
. While C permits implementations to use extra precision in evaluating expressions, it requires implementations to discard this precision in assignments and casts. Therefore, in a standard-conforming compiler, the assignment to f10
must use float
precision. This means, even when the program is compiled for a “64-bit system,” the differences should occur. If they do not, the compiler does not conforming to the C standard.
Furthermore, if float
is changed to double
, the conversion to float
does not happen, and the value will not be changed. In this case, no differences between f1*10.0
and f10
should manifest.
Given that the question reports the differences do not manifest with a “64-bit” compilation and do manifest with double
, it is questionable whether the observations have been reported accurately. To clarify this, the exact code should be shown, and the observations should be reproduced by a third party.
With MS Visual C 2008 I was able to reproduce this.
Inspecting the assembler, the difference between the two is an intermediate store and fetch of a result with intermediate conversions:
f10 = f1*10.0; // double result f10 converted to float and stored
c1 = (int)f10; // float result f10 fetched and converted to double
c2 = (int)(f1*10.0); // no store/fetch/convert
The assembler generated pushes values onto the FPU stack that get converted to 64 bits and then are multiplied. For c1
the result is then converted back to float and stored and is then retrieved again and placed on the FPU stack (and converted to double again) for a call to __ftol2_sse
, a run-time function to convert a double to int.
For c2
the intermediate value is not converted to and from float and passed immediately to the __ftol2_sse
function. For this function see also the answer at Convert double to int?.
Assembler:
f10 = f1*10;
fld dword ptr [f1]
fmul qword ptr [__real@4024000000000000 (496190h)]
fstp dword ptr [f10]
c2 = (int)(f1*10);
fld dword ptr [f1]
fmul qword ptr [__real@4024000000000000 (496190h)]
call __ftol2_sse
mov dword ptr [c2],eax
c1 = (int)f10;
fld dword ptr [f10]
call __ftol2_sse
mov dword ptr [c1],eax