why is c++ std::max_element so slow?
Before voting on this answer, please test (and verify) this on your machine and comment/add the results. Note that I used a vector size of 1000*1000*1000 for my tests. Currently, this answer has 19 upvotes but only one posted results, and these results did not show the effect described below (though obtained with a different test code, see comments).
There seems to be an optimizer bug/artifact. Compare the times of:
template<typename _ForwardIterator, typename _Compare>
_ForwardIterator
my_max_element_orig(_ForwardIterator __first, _ForwardIterator __last,
_Compare __comp)
{
if (__first == __last) return __first;
_ForwardIterator __result = __first;
while(++__first != __last)
if (__comp(__result, __first))
__result = __first;
return __result;
}
template<typename _ForwardIterator, typename _Compare>
_ForwardIterator
my_max_element_changed(_ForwardIterator __first, _ForwardIterator __last,
_Compare __comp)
{
if (__first == __last) return __first;
_ForwardIterator __result = __first;
++__first;
for(; __first != __last; ++__first)
if (__comp(__result, __first))
__result = __first;
return __result;
}
The first is the original libstdc++ implementation, the second one should be a transformation without any changes in behaviour or requirements. Clang++ produces very similar run times for those two functions, whereas g++4.8.2 is four times faster with the second version.
Following Maxim's proposal, changing the vector from int
to int64_t
, the changed version is not 4, but only 1.7 times faster than the original version (g++4.8.2).
The difference is in predictive commoning of *result
, that is, storing the value of the current max element so that it does not have to be reloaded from memory each time. This gives a far cleaner cache access pattern:
w/o commoning with commoning
* *
** *
** *
** *
* * *
* * *
* * *
Here's the asm for comparison (rdi
/rsi
contain the first/last iterators respectively):
With the while loop (2.88743 ms; gist):
movq %rdi, %rax
jmp .L49
.L51:
movl (%rdi), %edx
cmpl %edx, (%rax)
cmovl %rdi, %rax
.L49:
addq $4, %rdi
cmpq %rsi, %rdi
jne .L51
With the for loop (1235.55 μs):
leaq 4(%rdi), %rdx
movq %rdi, %rax
cmpq %rsi, %rdx
je .L53
movl (%rdi), %ecx
.L54:
movl (%rdx), %r8d
cmpl %r8d, %ecx
cmovl %rdx, %rax
cmovl %r8d, %ecx
addq $4, %rdx
cmpq %rdx, %rsi
jne .L54
.L53:
If I force commoning by explicitly storing *result
into a variable prev
at the start and whenever result
is updated, and using prev
instead of *result
in the comparison, I get an even faster loop (377.601 μs):
movl (%rdi), %ecx
movq %rdi, %rax
.L57:
addq $4, %rdi
cmpq %rsi, %rdi
je .L60
.L59:
movl (%rdi), %edx
cmpl %edx, %ecx
jge .L57
movq %rdi, %rax
addq $4, %rdi
movl %edx, %ecx
cmpq %rsi, %rdi
jne .L59
.L60:
The reason this is faster than the for
loop is that the conditional moves (cmovl
) in the above are a pessimisation as they are executed so rarely (Linus says that cmov is only a good idea if the branch is unpredictable). Note that for randomly distributed data the branch is expected to be taken Hn times, which is a negligible proportion (Hn grows logarithmically, so Hn/n rapidly approaches 0). The conditional-move code will only be better on pathological data e.g. [1, 0, 3, 2, 5, 4, ...].
You are probably running your test in 64-bit mode, where sizeof(int) == 4
, but sizeof(std::vector<>::iterator) == 8
, so that assignment in the loop to int
(what my_max_element
does) is faster than to std::vector<>::iterator
(this is what std::max_element
does).
If you change std::vector<int>
to std::vector<long>
results change in favour to std::max_element
:
MaxIter = 1000000012
MaxArray = 1000000012
Total CPU time iterator = 0.00429082
Total CPU time array = 0.00572205
iter/array ratio: = 0.749875
One important note: when benchmarking disable the CPU frequency scaling, so that the CPU does not switch gears in the middle of the benchmark.
But I think something else is at play here, since just changing the loop variable from int
to long
does not change the results...
It's a simple issue of cache. To wit, the first time you load memory, in this case the contents of the vector, it's always considerably slower than if it's been recently accessed. I copied and pasted your code with GCC 4.9.
When the functions are reversed, the ratio is 1. When they're in the original order, the ratio is 1.6.
This still seems like a fundamental misoptimization by GCC in the case of max_element to me. However, your function times are so low, they will be dominated by CPU noise like the above cache effects, instead of any meaningful comparison.
Reversed, Original