Efficiently implementing floored / euclidean integer division
Just a note: the x86 sar
instruction performs floored division when it comes to powers of two.
It could be more efficient to come up with something branch free to correct the result based on the sign, as branches are expensive.
See page 20ff of Chapter 2 in Hacker's Delight on how to access the sign.
I've written a test program to benchmark the ideas presented here:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <windows.h>
#define N 10000000
#define M 100
int dividends[N], divisors[N], results[N];
__forceinline int floordiv_signcheck(int a, int b)
{
return (a<0 ? a-(b-1) : a) / b;
}
__forceinline int floordiv_signcheck2(int a, int b)
{
return (a - (a<0 ? b-1 : 0)) / b;
}
__forceinline int floordiv_signmultiply(int a, int b)
{
return (a + (a>>(sizeof(a)*8-1))*(b-1)) / b;
}
__forceinline int floordiv_floatingpoint(int a, int b)
{
// I imagine that the call to floor can be replaced to a cast
// if you can get FPU rounding control to work (I couldn't).
return floor((double)a / b);
}
void main()
{
for (int i=0; i<N; i++)
{
dividends[i] = rand();
do
divisors[i] = rand();
while (divisors[i]==0);
}
LARGE_INTEGER t0, t1;
QueryPerformanceCounter(&t0);
for (int j=0; j<M; j++)
for (int i=0; i<N; i++)
results[i] = floordiv_signcheck(dividends[i], divisors[i]);
QueryPerformanceCounter(&t1);
printf("signcheck : %9llu\n", t1.QuadPart-t0.QuadPart);
QueryPerformanceCounter(&t0);
for (int j=0; j<M; j++)
for (int i=0; i<N; i++)
results[i] = floordiv_signcheck2(dividends[i], divisors[i]);
QueryPerformanceCounter(&t1);
printf("signcheck2 : %9llu\n", t1.QuadPart-t0.QuadPart);
QueryPerformanceCounter(&t0);
for (int j=0; j<M; j++)
for (int i=0; i<N; i++)
results[i] = floordiv_signmultiply(dividends[i], divisors[i]);
QueryPerformanceCounter(&t1);
printf("signmultiply : %9llu\n", t1.QuadPart-t0.QuadPart);
QueryPerformanceCounter(&t0);
for (int j=0; j<M; j++)
for (int i=0; i<N; i++)
results[i] = floordiv_floatingpoint(dividends[i], divisors[i]);
QueryPerformanceCounter(&t1);
printf("floatingpoint: %9llu\n", t1.QuadPart-t0.QuadPart);
}
Results:
signcheck : 61458768
signcheck2 : 61284370
signmultiply : 61625076
floatingpoint: 287315364
So, according to my results, checking the sign is the fastest:
(a - (a<0 ? b-1 : 0)) / b
I'm revisiting this question five years later, as this is relevant for me too. I did some performance measurements on two pure-C versions and two inline-assembly versions for x86-64, and the results may be interesting.
The tested variants of floored division are:
- The implementation I've been using for some time now;
- The slight variant on that presented above which only uses one division;
- The previous one, but hand-implemented in inline-assembly; and
- A
CMOV
version implemented in assembly.
The following is my benchmark program:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#ifndef VARIANT
#define VARIANT 3
#endif
#if VARIANT == 0
#define floordiv(a, b) (((a) < 0)?((((a) + 1) / (b)) - 1):((a) / (b)))
#elif VARIANT == 1
#define floordiv(a, b) ((((a) < 0)?((a) - ((b) - 1)):(a)) / (b))
#elif VARIANT == 2
#define floordiv(a, b) ({ \
int result; \
asm("test %%eax, %%eax; jns 1f; sub %1, %%eax;" \
"add $1, %%eax; 1: cltd; idivl %1;" \
: "=a" (result) \
: "r" (b), \
"0" (a) \
: "rdx"); \
result;})
#elif VARIANT == 3
#define floordiv(a, b) ({ \
int result; \
asm("mov %%eax, %%edx; sub %1, %%edx; add $1, %%edx;" \
"test %%eax, %%eax; cmovs %%edx, %%eax; cltd;" \
"idivl %1;" \
: "=a" (result) \
: "r" (b), \
"0" (a) \
: "rdx"); \
result;})
#endif
double ntime(void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return(tv.tv_sec + (((double)tv.tv_usec) / 1000000.0));
}
void timediv(int n, int *p, int *q, int *r)
{
int i;
for(i = 0; i < n; i++)
r[i] = floordiv(p[i], q[i]);
}
int main(int argc, char **argv)
{
int n, i, *q, *p, *r;
double st;
n = 10000000;
p = malloc(sizeof(*p) * n);
q = malloc(sizeof(*q) * n);
r = malloc(sizeof(*r) * n);
for(i = 0; i < n; i++) {
p[i] = (rand() % 1000000) - 500000;
q[i] = (rand() % 1000000) + 1;
}
st = ntime();
for(i = 0; i < 100; i++)
timediv(n, p, q, r);
printf("%g\n", ntime() - st);
return(0);
}
I compiled this with gcc -march=native -Ofast
using GCC 4.9.2, and the results, on my Core i5-2400, were as follows. The results are fairly reproducible from run to run -- they always land in the same order, at least.
- Variant 0: 7.21 seconds
- Variant 1: 7.26 seconds
- Variant 2: 6.73 seconds
- Variant 3: 4.32 seconds
So the CMOV
implementation blows the others out of the water, at least. What surprises me is that variant 2 out-does its pure-C version (variant 1) by a fairly wide margin. I'd have thought the compiler should be able to emit code at least as efficient as mine.
Here are some other platforms, for comparison:
AMD Athlon 64 X2 4200+, GCC 4.7.2:
- Variant 0: 26.33 seconds
- Variant 1: 25.38 seconds
- Variant 2: 25.19 seconds
- Variant 3: 22.39 seconds
Xeon E3-1271 v3, GCC 4.9.2:
- Variant 0: 5.95 seconds
- Variant 1: 5.62 seconds
- Variant 2: 5.40 seconds
- Variant 3: 3.44 seconds
As a final note, I should perhaps warn against taking the apparent performance advantage of the CMOV
version too seriously, because in the real world, the branch in the other versions will probably not be as completely random as in this benchmark, and if the branch predictor can do a reasonable job, the branching versions may turn out to be better. However, the realities of that will depend quite a bit on the data that are being used in practice, and so is probably pointless to try and do any generic benchmark of.