Float32 to Float16

Here's the link to an article on IEEE754, which gives the bit layouts and biases.

http://en.wikipedia.org/wiki/IEEE_754-2008

The exponents in your float32 and float16 representations are probably biased, and biased differently. You need to unbias the exponent you got from the float32 representation to get the actual exponent, and then to bias it for the float16 representation.

Apart from this detail, I do think it's as simple as that, but I still get surprised by floating-point representations from time to time.

EDIT:

Check for overflow when doing the thing with the exponents while you're at it.
Your algorithm truncates the last bits of the mantisa a little abruptly, that may be acceptable but you may want to implement, say, round-to-nearest by looking at the bits that are about to be discarded. "0..." -> round down, "100..001..." -> round up, "100..00" -> round to even.

The exponent needs to be unbiased, clamped and rebiased. This is the fast code I use:

unsigned int fltInt32;
unsigned short fltInt16;

fltInt16 = (fltInt32 >> 31) << 5;
unsigned short tmp = (fltInt32 >> 23) & 0xff;
tmp = (tmp - 0x70) & ((unsigned int)((int)(0x70 - tmp) >> 4) >> 27);
fltInt16 = (fltInt16 | tmp) << 10;
fltInt16 |= (fltInt32 >> 13) & 0x3ff;

This code will be even faster with a lookup table for the exponent, but I use this one because it is easily adapted to a SIMD workflow.

Limitations of the implementation:

Overflowing values that cannot be represented in float16 will give undefined values.
Underflowing values will return an undefined value between 2^-15 and 2^-14 instead of zero.
Denormals will give undefined values.

Be careful with denormals. If your architecture uses them, they may slow down your program tremendously.

Float32 to Float16

Tags:

Floating Point

C

Related

Recent Posts