C++ performance std::array vs std::vector
GCC (and probably Clang) are optimizing out the Arrays, but not the Vectors
Your base assumption that arrays are necessarily slower than vectors is incorrect. Because vectors require their data to be stored in allocated memory (which with a default allocator uses dynamic memory), the values that need to be used have to be stored in heap memory and accessed repeatedly during the execution of this program. Conversely, the values used by the array can be optimized out entirely and simply directly referenced in the assembly of the program.
Below is what GCC spit out as assembly for the assemble_vec
and assemble_arr
functions once optimizations were turned on:
[-snip-]
//==============
//Vector Version
//==============
assemble_vec():
mov rax, QWORD PTR glob[rip]
mov rcx, QWORD PTR v2[rip]
mov rdx, QWORD PTR v1[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rsi, [rax+784]
.L23:
movsd xmm2, QWORD PTR [rcx]
addsd xmm2, QWORD PTR [rdx]
add rax, 8
addsd xmm0, xmm2
movsd QWORD PTR [rax-8], xmm0
movsd xmm0, QWORD PTR [rcx+8]
addsd xmm0, QWORD PTR [rdx+8]
addsd xmm0, xmm1
movsd QWORD PTR [rax], xmm0
movsd xmm1, QWORD PTR [rcx+16]
addsd xmm1, QWORD PTR [rdx+16]
addsd xmm1, QWORD PTR [rax+8]
movsd QWORD PTR [rax+8], xmm1
cmp rax, rsi
jne .L23
ret
//=============
//Array Version
//=============
assemble_arr():
mov rax, QWORD PTR glob[rip]
movsd xmm2, QWORD PTR .LC1[rip]
movsd xmm3, QWORD PTR .LC2[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rdx, [rax+784]
.L26:
addsd xmm1, xmm3
addsd xmm0, xmm2
add rax, 8
movsd QWORD PTR [rax-8], xmm0
movapd xmm0, xmm1
movsd QWORD PTR [rax], xmm1
movsd xmm1, QWORD PTR [rax+8]
addsd xmm1, xmm2
movsd QWORD PTR [rax+8], xmm1
cmp rax, rdx
jne .L26
ret
[-snip-]
There are several differences between these sections of code, but the critical difference is after the .L23
and .L26
labels respectively, where for the vector version, the numbers are being added together through less efficient opcodes, as compared to the array version, which is using (more) SSE instructions. The vector version also involves more memory lookups compared to the array version. These factors in combination with each other is going to result in code that executes faster for the std::array
version of the code than it will for the std::vector
version.
C++ aliasing rules don't let the compiler prove that glob[i] += stuff
doesn't modify one of the elements of const vec v1 {1.0,-1.0,1.0};
or v2
.
const
on a std::vector
means the "control block" pointers can be assumed to not be modified after it's constructed, but the memory is still dynamically allocated an all the compiler knows is that it effectively has a const double *
in static storage.
Nothing in the std::vector
implementation lets the compiler rule out some other non-const
pointer pointing into that storage. For example, the double *data
in the control block of glob
.
C++ doesn't provide a way for library implementers to give the compiler the information that the storage for different std::vector
s doesn't overlap. They can't use __restrict
(even on compilers that support that extension) because that could break programs that take the address of a vector element. See the C99 documentation for restrict
.
But with const arr a1 {1.0,-1.0,1.0};
and a2
, the doubles themselves can go in read-only static storage, and the compiler knows this. Therefore it can evaluate comb(a1[0],a2[0]);
and so on at compile time. In @Xirema's answer, you can see the asm output loads constants .LC1
and .LC2
. (Only two constants because both a1[0]+a2[0]
and a1[2]+a2[2]
are 1.0+1.0
. The loop body uses xmm2
as a source operand for addsd
twice, and the other constant once.)
But couldn't the compiler still do the sums once outside the loop at runtime?
No, again because of potential aliasing. It doesn't know that stores into glob[i+0..3]
won't modify the contents of v1[0..2]
, so it reloads from v1 and v2 every time through the loop after the store into glob
.
(It doesn't have to reload the vector<>
control block pointers, though, because type-based strict aliasing rules let it assume that storing a double
doesn't modify a double*
.)
The compiler could have checked that glob.data() + 0 .. N-3
didn't overlap with either of v1/v1.data() + 0 .. 2
, and made a different version of the loop for that case, hoisting the three comb()
results out of the loop.
This is a useful optimization that some compilers do when auto-vectorizing if they can't prove lack of aliasing; it's clearly a missed optimization in your case that gcc doesn't check for overlap because it would make the function run much faster. But the question is whether the compiler could reasonably guess that it was worth emitting asm that checks at runtime for overlap, and has 2 different versions of the same loop. With profile-guided optimization, it would know the loop is hot (runs many iterations), and would be worth spending extra time on. But without that, the compiler might not want to risk bloating the code too much.
ICC19 (Intel's compiler) in fact does do something like that here, but it's weird: if you look at the beginning of assemble_vec
(on the Godbolt compiler explorer), it load the data pointer from glob
, then adds 8 and subtracts the pointer again, producing a constant 8
. Then it branches at runtime on 8 > 784
(not taken) and then -8 < 784
(taken). It looks like this was supposed to be an overlap check, but it maybe used the same pointer twice instead of v1 and v2? (784 = 8*100 - 16 = sizeof(double)*N - 16
)
Anyway, it ends up running the ..B2.19
loop that hoists all 3 comb()
calculations, and interestingly does 2 iterations at once of the loop with 4 scalar loads and stores to glob[i+0..4]
, and 6 addsd
(scalar double) add instructions.
Elsewhere in the function body, there's a vectorized version that uses 3x addpd
(packed double), just storing / reloading 128-bit vectors that partially overlap. This will cause store-forwarding stalls, but out-of-order execution may be able to hide that. It's just really weird that it branches at runtime on a calculation that will produce the same result every time, and never uses that loop. Smells like a bug.
If glob[]
had been a static array, you'd still have had a problem. Because the compiler can't know that v1/v2.data()
aren't pointing into that static array.
I thought if you accessed it through double *__restrict g = &glob[0];
, there wouldn't have been a problem at all. That will promise the compiler that g[i] += ...
won't affect any values that you access through other pointers, like v1[0]
.
In practice, that does not enable hoisting of comb()
for gcc, clang, or ICC -O3
. But it does for MSVC. (I've read that MSVC doesn't do type-based strict aliasing optimizations, but it's not reloading glob.data()
inside the loop so it has somehow figured out that storing a double won't modify a pointer. But MSVC does define the behaviour of *(int*)my_float
for type-punning, unlike other C++ implementations.)
For testing, I put this on Godbolt
//__attribute__((noinline))
void assemble_vec()
{
double *__restrict g = &glob[0]; // Helps MSVC, but not gcc/clang/ICC
// std::vector<double> &g = glob; // actually hurts ICC it seems?
// #define g glob // so use this as the alternative to __restrict
for (size_t i=0; i<N-2; ++i)
{
g[i] += comb(v1[0],v2[0]);
g[i+1] += comb(v1[1],v2[1]);
g[i+2] += comb(v1[2],v2[2]);
}
}
We get this from MSVC outside the loop
movsd xmm2, QWORD PTR [rcx] # v2[0]
movsd xmm3, QWORD PTR [rcx+8]
movsd xmm4, QWORD PTR [rcx+16]
addsd xmm2, QWORD PTR [rax] # += v1[0]
addsd xmm3, QWORD PTR [rax+8]
addsd xmm4, QWORD PTR [rax+16]
mov eax, 98 ; 00000062H
Then we get an efficient-looking loop.
So this is a missed-optimization for gcc/clang/ICC.