Fast multiplication/division by 2 for floats and doubles (C/C++)

This is one of those highly-application specific things. It may help in some cases and not in others. (In the vast majority of cases, a straight-forward multiplication is still best.)

The "intuitive" way of doing this is just to extract the bits into a 64-bit integer and add the shift value directly into the exponent. (this will work as long as you don't hit NAN or INF)

So something like this:

union{
    uint64 i;
    double f;
};

f = 123.;
i += 0x0010000000000000ull;

//  Check for zero. And if it matters, denormals as well.

Note that this code is not C-compliant in any way, and is shown just to illustrate the idea. Any attempt to implement this should be done directly in assembly or SSE intrinsics.

However, in most cases the overhead of moving the data from the FP unit to the integer unit (and back) will cost much more than just doing a multiplication outright. This is especially the case for pre-SSE era where the value needs to be stored from the x87 FPU into memory and then read back into the integer registers.

In the SSE era, the Integer SSE and FP SSE use the same ISA registers (though they still have separate register files). According the Agner Fog, there's a 1 to 2 cycle penalty for moving data between the Integer SSE and FP SSE execution units. So the cost is much better than the x87 era, but it's still there.

All-in-all, it will depend on what else you have on your pipeline. But in most cases, multiplying will still be faster. I've run into this exact same problem before so I'm speaking from first-hand experience.

Now with 256-bit AVX instructions that only support FP instructions, there's even less of an incentive to play tricks like this.


You can pretty safely assume IEEE 754 formatting, the details of which can get pretty gnarley (esp. when you get into subnormals). In the common cases, however, this should work:

const int DOUBLE_EXP_SHIFT = 52;
const unsigned long long DOUBLE_MANT_MASK = (1ull << DOUBLE_EXP_SHIFT) - 1ull;
const unsigned long long DOUBLE_EXP_MASK = ((1ull << 63) - 1) & ~DOUBLE_MANT_MASK; 
void unsafe_shl(double* d, int shift) { 
    unsigned long long* i = (unsigned long long*)d; 
    if ((*i & DOUBLE_EXP_MASK) && ((*i & DOUBLE_EXP_MASK) != DOUBLE_EXP_MASK)) { 
        *i += (unsigned long long)shift << DOUBLE_EXP_SHIFT; 
    } else if (*i) {
        *d *= (1 << shift);
    }
} 

EDIT: After doing some timing, this method is oddly slower than the double method on my compiler and machine, even stripped to the minimum executed code:

    double ds[0x1000];
    for (int i = 0; i != 0x1000; i++)
        ds[i] = 1.2;

    clock_t t = clock();

    for (int j = 0; j != 1000000; j++)
        for (int i = 0; i != 0x1000; i++)
#if DOUBLE_SHIFT
            ds[i] *= 1 << 4;
#else
            ((unsigned int*)&ds[i])[1] += 4 << 20;
#endif

    clock_t e = clock();

    printf("%g\n", (float)(e - t) / CLOCKS_PER_SEC);

In the DOUBLE_SHIFT completes in 1.6 seconds, with an inner loop of

movupd xmm0,xmmword ptr [ecx]  
lea    ecx,[ecx+10h]  
mulpd  xmm0,xmm1  
movupd xmmword ptr [ecx-10h],xmm0

Versus 2.4 seconds otherwise, with an inner loop of:

add dword ptr [ecx],400000h
lea ecx, [ecx+8]  

Truly unexpected!

EDIT 2: Mystery solved! One of the changes for VC11 is now it always vectorizes floating point loops, effectively forcing /arch:SSE2, though VC10, even with /arch:SSE2 is still worse with 3.0 seconds with an inner loop of:

movsd xmm1,mmword ptr [esp+eax*8+38h]  
mulsd xmm1,xmm0  
movsd mmword ptr [esp+eax*8+38h],xmm1  
inc   eax

VC10 without /arch:SSE2 (even with /arch:SSE) is 5.3 seconds... with 1/100th of the iterations!!, inner loop:

fld         qword ptr [esp+eax*8+38h]  
inc         eax  
fmul        st,st(1)  
fstp        qword ptr [esp+eax*8+30h]

I knew the x87 FP stack was aweful, but 500 times worse is kinda ridiculous. You probably won't see these kinds of speedups converting, i.e. matrix ops to SSE or int hacks, since this is the worst case loading into the FP stack, doing one op, and storing from it, but it's a good example for why x87 is not the way to go for anything perf. related.


How about ldexp?

Any half-decent compiler will generate optimal code on your platform.

But as @Clinton points out, simply writing it in the "obvious" way should do just as well. Multiplying and dividing by powers of two is child's play for a modern compiler.

Directly munging the floating point representation, besides being non-portable, will almost certainly be no faster (and might well be slower).

And of course, you should not waste time even thinking about this question unless your profiling tool tells you to. But the kind of people who listen to this advice will never need it, and the ones who need it will never listen.

[update]

OK, so I just tried ldexp with g++ 4.5.2. The cmath header inlines it as a call to __builtin_ldexp, which in turn...

...emits a call to the libm ldexp function. I would have thought this builtin would be trivial to optimize, but I guess the GCC developers never got around to it.

So, multiplying by 1 << p is probably your best bet, as you have discovered.


The fastest way to do this is probably:

x *= (1 << p);

This sort of thing may simply be done by calling an machine instruction to add p to the exponent. Telling the compiler to instead extract the some bits with a mask and doing something manually to it will probably make things slower, not faster.

Remember, C/C++ is not assembly language. Using a bitshift operator does not necessarily compile to a bitshift assembly operation, not does using multiplication necessarily compile to multiplication. There's all sorts of weird and wonderful things going on like what registers are being used and what instructions can be run simultaneously which I'm not smart enough to understand. But your compiler, with many man years of knowledge and experience and lots of computational power, is much better at making these judgements.

p.s. Keep in mind, if your doubles are in an array or some other flat data structure, your compiler might be really smart and use SSE to multiple 2, or even 4 doubles at the same time. However, doing a lot of bit shifting is probably going to confuse your compiler and prevent this optimisation.