How to instruct compiler to generate unaligned loads for __m128
In my opinion you should write your data structures using standard C++ constructions (of which __m128i
is not). When you want to use intrinsics which are are not standard C++ you "enter SSE world" through intrinsics such as _mm_loadu_ps
and you "leave SSE world" back to standard C++ with an intrinsic such as _mm_storeu_ps
. Don't rely on implicit SSE loads and stores. I have seen too many mistakes on SO doing this.
In this case you should use
struct Foobar {
float a[4];
float b[4];
int c;
};
then you can do
Foobar foo[16];
In this case foo[1]
won't be 16 byte aligned but when you want to use SSE and leave standard C++ do
__m128 a4 = _mm_loadu_ps(foo[1].a);
__m128 b4 = _mm_loadu_ps(foo[1].b);
__m128 max = _mm_max_ps(a4,b4);
_mm_storeu_ps(array, max);
then go back to standard C++.
Another thing you can consider is this
struct Foobar {
float a[16];
float b[16];
int c[4];
};
then to get an array of 16 of the original struct do
Foobar foo[4];
In this case as long the first element is aligned so are all the other elements.
If you want utility functions which act on SSE registers then don't use explicit or implicit load/stores in the utility function. Pass const references to __m128
and return __m128
if you need to.
//SSE utility function
static inline __m128 mulk_SSE(__m128 const &a, float k)
{
return _mm_mul_ps(_mm_set1_ps(k),a);
}
//main function
void foo(float *x, float *y n)
{
for(int i=0; i<n; i+=4)
__m128 t1 = _mm_loadu_ps(x[i]);
__m128 t2 = mulk_SSE(x4,3.14159f);
_mm_store_ps(&y[i], t2);
}
}
The reason to use a const reference is that MSVC cannot pass __m128
by value. Without a const reference you get an error
error C2719: formal parameter with __declspec(align('16')) won't be aligned.
__m128
for MSVC is really a union anyway.
typedef union __declspec(intrin_type) _CRT_ALIGN(16) __m128 {
float m128_f32[4];
unsigned __int64 m128_u64[2];
__int8 m128_i8[16];
__int16 m128_i16[8];
__int32 m128_i32[4];
__int64 m128_i64[2];
unsigned __int8 m128_u8[16];
unsigned __int16 m128_u16[8];
unsigned __int32 m128_u32[4];
} __m128;
presumably MSVC should not have to load the union when the SSE utility functions are inlined.
Based on the OPs latest code update here is what I would suggest
#include <x86intrin.h>
struct Vector4 {
__m128 data;
Vector4() {
}
Vector4(__m128 const &v) {
data = v;
}
Vector4 & load(float const *x) {
data = _mm_loadu_ps(x);
return *this;
}
void store(float *x) const {
_mm_storeu_ps(x, data);
}
operator __m128() const {
return data;
}
};
static inline Vector4 operator + (Vector4 const & a, Vector4 const & b) {
return _mm_add_ps(a, b);
}
static inline Vector4 operator - (Vector4 const & a, Vector4 const & b) {
return _mm_sub_ps(a, b);
}
struct Foobar {
float a[4];
float b[4];
int c;
};
int main(void)
{
Foobar myArray[10];
// note that myArray[0].a, myArray[0].b, and myArray[1].b should be // initialized before doing the following
Vector4 a0 = Vector4().load(myArray[0].a);
Vector4 b0 = Vector4().load(myArray[0].b);
Vector4 a1 = Vector4().load(myArray[1].a);
(a0 + b0 - a1).store(myArray[1].b);
}
This code was based on ideas from Agner Fog's Vector Class Library.
Clang has -fmax-type-align
. If you set -fmax-type-align=8
then no 16-byte aligned instruction will be generated.