How is `min` of two integers just as fast as 'bit hacking'?
Well, the bit hacking trick might have been faster in the 90s, but it is slower on current machines by a factor of two. Compare for yourself:
// gcc -Wall -Wextra -std=c11 ./min.c -D_POSIX_SOURCE -Os
// ./a.out 42
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define COUNT (1 << 28)
static int array[COUNT];
int main(int argc, char **argv) {
(void) argc;
unsigned seed = atoi(argv[1]);
for (unsigned i = 0; i < COUNT; ++i) {
array[i] = rand_r(&seed);
}
clock_t begin = clock();
int x = array[0];
for (unsigned i = 1; i < COUNT; ++i) {
int y = array[i];
#if 1
x = x ^ ((y ^ x) & -(x > y));
# else
if (y < x) {
x = y;
}
#endif
}
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("Minimum: %d (%.3f seconds)\n", x, time_spent);
return 0;
}
On an average 0.277 seconds in the "naïve" implementation, but 0.442 seconds for the "optimized" implementation. Always have a grain of doubt in CS classes. At least since the CMOVxx instruction (added with Pentium Pro in 1995) there is no chance that the bit hacking solution could have been faster.
On an i5-750 (gcc (Debian 5.2.1-23) 5.2.1 20151028):
optimized naïve
O0 1.367 0.781
O1 0.530 0.274
O2 0.444 0.271
O3 0.442 0.144
Os 0.446 0.273
Afterthought: Compiler developers are very smart people, who spend their working days finding and implementing optimizations. If the bit hacking trick were faster, then your compiler would implement min()
this way. And you can safely assume that the compiler understands what you are doing inside the loop. But the people working for Intel, AMD and so on are smart, too, so they will optimize important operations such as min()
and max()
if they see that the compiler hackers do weird hacks because the obvious solution is slow.
For the extra-curious: This is the generated code for the "optimized" implementation with -O3:
mov $0x40600b00, %ebp # int *e = &array[COUNT];
mov 0x600b00, %ebx # int x = array[0];
mov $0x600b04, %edx # int *i = &array[1];
loop:
mov (%rdx), %eax # int y = *i;
xor %ecx, %ecx # int tmp = (
cmp %ebx, %eax # y < x
setl %cl # ? 1 : 0 );
xor %ebx, %eax # y ^= x;
add $0x4, %rdx # ++i;
neg %ecx # tmp = -tmp;
and %ecx, %eax # y &= tmp;
xor %eax, %ebx # x ^= y;
cmp %rdx, %rbp # if (i != e) {
jne loop # goto loop; }
And the naïve implementation with -Os (-O3 is huge and full of SSE instructions I would have to look up):
mov 600ac0, %ebx # int x = array[0];
mov $0x40600abc,%ecx # int *e = &array[COUNT];
mov $0x600ac0,%eax # int *i = &array[0];
loop:
mov 0x4(%rax),%edx # int y = *(i + 1);
cmp %edx,%ebx # if (x > y) {
cmovg %edx,%ebx # x = y; }
add $0x4,%rax # ++i;
cmp %rcx,%rax # if (i != e) {
jne loop # goto loop; }
This is likely due to how the min
function is implemented in python.
Many python builtins are actually implemented in low level languages such as C or assembly and use the python apis in order to be callable in python.
Your bit fiddling technique is likely very fast in C but in python the interpretation overhead of the statement will far exceed the overhead of calling even a complex function implemented in a low level language.
If you really want a fair test compare a C program, or C python extension implementing that technique to your python call of min
and see how it compares, I expect that will explain the result you see.
EDIT:
Thanks to @Two-BitAlchemist I can now give some more details onto additional reasons this bit twiddling will not work well in python. It appears that integers are not stored in the obvious way but are actually a fairly complex expanding object designed to store potentially very large numbers.
Some details on this are findable here (Thanks to Two-BitAlchemist) though it appears this is changed somewhat in newer python versions. Still the point remains that we are most certainly not manipulation a simple set of bits when we touch an integer in python, but a complex object where the bit manipulations are in fact virtual method calls with enormous overhead (compared to what they do).
Lets do a slightly deeper dive here to find out the real reason behind this weirdness (if any).
Lets create 3 methods and look at their python bytecode and runtimes...
import dis
def func1(x, y):
return min(x, y)
def func2(x, y):
if x < y:
return x
return y
def func3(x, y):
return x ^ ((y ^ x) & -(x > y))
print "*" * 80
dis.dis(func1)
print "*" * 80
dis.dis(func2)
print "*" * 80
dis.dis(func3)
The output from this program is...
*****************************************************
4 0 LOAD_GLOBAL 0 (min)
3 LOAD_FAST 0 (x)
6 LOAD_FAST 1 (y)
9 CALL_FUNCTION 2
12 RETURN_VALUE
*****************************************************
7 0 LOAD_FAST 0 (x)
3 LOAD_FAST 1 (y)
6 COMPARE_OP 0 (<)
9 POP_JUMP_IF_FALSE 16
8 12 LOAD_FAST 0 (x)
15 RETURN_VALUE
9 >> 16 LOAD_FAST 1 (y)
19 RETURN_VALUE
*****************************************************
12 0 LOAD_FAST 0 (x)
3 LOAD_FAST 1 (y)
6 LOAD_FAST 0 (x)
9 BINARY_XOR
10 LOAD_FAST 0 (x)
13 LOAD_FAST 1 (y)
16 COMPARE_OP 4 (>)
19 UNARY_NEGATIVE
20 BINARY_AND
21 BINARY_XOR
22 RETURN_VALUE
Here are the running times of each of these functions
%timeit func1(4343,434234)
1000000 loops, best of 3: 282 ns per loop
%timeit func2(23432, 3243424)
10000000 loops, best of 3: 137 ns per loop
%timeit func3(928473, 943294)
1000000 loops, best of 3: 246 ns per loop
func2 is the fastest because it has the least amount of work to do in the python interpreter. How?. Looking at the bytecode for func2, we see that in either case of x > y
or x < y
, the python interpreter will execute 6 instructions.
func3 will execute 11 instructions (and is thus almost twice as slow as func2... in fact, its extremely close to 137.0 * 11 / 6 = 251 ns).
func1 has just 5 python instructions, and by the logic in the previous 2 points, we might think that func1 should probably be the fastest. However, there is a CALL_FUNCTION
in there... and function calls have a lot of overhead in Python (because it creates a new eval frame for the function call - that's the thing that we see in the python stacktrace - a stack of eval frames).
More details : Because python is interpreted, each python bytecode instruction takes a lot longer than a single C/asm statement. In fact, you can take a look at the python interpreter source code to see that each instruction has an overhead of 30 or so C statements (this is from a very rough look at ceval.c python main interpreter loop). The for (;;)
loop executes one python instruction per loop cycle (ignoring optimizations).
https://github.com/python/cpython/blob/master/Python/ceval.c#L1221
So, with so much overhead for each instruction, there is no point in comparing 2 tiny C code snippets in python. One will take 34 and the other will take 32 cpu cycles, because the python interpreter adds 30 cycles overhead for each instruction.
In OP's C module, if we loop inside the C function to do the comparison a million times, that loop will not have the python interpreter's overhead for each instruction. It will probably run 30 to 40 times faster.
Tips for python optimization...
Profile your code to find hotspots, refactor hot code into its own function (write tests for hotspot before that to make sure refactor does not break stuff), avoid function calls from the hot code (inline functions if possible), use the dis
module on new function to find ways to reduce the number of python instructions (if x
is faster than if x is True
... surprised?), and lastly modify your algorithm. Finally, if python speedup is not enough, reimplement your new function in C.
ps : The explanation above is simplified to keep the answer within reasonable size. For example, not all python instructions take the same amount of time, and there are optimizations, so not every instruction has the same overhead... and lot more things. Please ignore such omissions for the sake of brevity.