Aligned and unaligned memory access with AVX/AVX2 intrinsics
There is no way to explicitly control folding of loads with intrinsics. I consider this a weakness of intrinsics. If you want to explicitly control the folding then you have to use assembly.
In previous version of GCC I was able to control the folding to some degree using an aligned or unaligned load. However, that no longer appears to be the case (GCC 4.9.2). I mean for example in the function AddDot4x4_vec_block_8wide
here the loads are folded
vmulps ymm9, ymm0, YMMWORD PTR [rax-256]
vaddps ymm8, ymm9, ymm8
However in a previous verison of GCC the loads were not folded:
vmovups ymm9, YMMWORD PTR [rax-256]
vmulps ymm9, ymm0, ymm9
vaddps ymm8, ymm8, ymm9
The correct solution is, obviously, to only used aligned loads when you know the data is aligned and if you really want to explicitly control the folding use assembly.
In addition to Z boson's answer I can tell that the problem can be caused by that the compiler assumes the memory region is aligned (because of __attribute__ ((aligned(32)))
marking the array). In runtime that attribute may not work for values on the stack because the stack is only 16-byte aligned (see this bug, which is still open at the time of this writing, though some fix have made it into gcc 4.6). The compiler is in its rights to choose the instructions to implement intrinsics, so it may or may not fold the memory load into the computational instruction, and it is also in its rights to use vmovaps
when the folding does not occur (because, as noted before, the memory region is supposed to be aligned).
You can try forcing the compiler to realign the stack to 32 bytes upon entry in main
by specifying -mstackrealign
and -mpreferred-stack-boundary=5
(see here) but it will incur a performance overhead.