SIMD and difference between packed and scalar double precision
In this context, "packed" means "several of the same type put into one lump" - so "packed single precision floating point" means 4 * 32 bit floating point numbers stored as a 128-bit value.
You either need to "pack" each value into the register using various PACK*
instructions, or have the data already "packed" in memory, e.g. an array of (multiples of) 4 floating point values [that are suitably aligned].
Scalar means "one value" in the lower n
bits of the register (e.g. a double
would be the low 64 bits of a 128-bit SSE register).
In SSE, the 128 bits registers can be represented as 4 elements of 32 bits or 2 elements of 64 bits.
SSE defines two types of operations; scalar and packed. Scalar operation only operates on the least-significant data element (bit 0~31 or 0~63), and packed operation computes all elements in parallel.
_mm_cmpeq_sd
is designed to work with double-precision (64-bit) floating-point elements and would only compare the least-significant data element (first 64 bits) of the two operands (scalar).
_mm_cmpeq_pd
is designed to work with double-precision (64-bit) floating-point elements as well but would compare each two groups of 64 bits in parallel (packed).
_mm_cmpeq_ss
is designed to work with single-precision (32-bit) floating-point elements and would only compare the least-significant data element (first 32 bits) of the two operands (scalar).
_mm_cmpeq_ps
is designed to work with single-precision (32-bit) floating-point elements and would compare each group of 32 bits in parallel (packed).
If you're using 32 bits float, you could pack the float in quadruplet to make use of the 128 bits space. That way, _mm_cmpeq_ps
would be able to make 4 comparison in parallel.
If you're using 64 bits double, you could pack the double in pair to make use of the 128 bits space. That way, _mm_cmpeq_pd
would be able to make 2 comparison in parallel.
If you want to make only one comparison at a time, you can use _mm_cmpeq_sd
to compare two 64 bits double or _mm_cmpeq_ss
to compare two 32 bits float.
Note that _mm_cmpeq_sd
and _mm_cmpeq_pd
are SSE2 while _mm_cmpeq_ss
and _mm_cmpeq_ps
are SSE.