128-bit values - From XMM registers to General Purpose
You cannot move the upper bits of an XMM register into a general purpose register directly.
You'll have to follow a two-step process, which may or may not involve a roundtrip to memory or the destruction of a register.
in registers (SSE2)
movq rax,xmm0 ;lower 64 bits
movhlps xmm0,xmm0 ;move high 64 bits to low 64 bits.
movq rbx,xmm0 ;high 64 bits.
punpckhqdq xmm0,xmm0
is the SSE2 integer equivalent of movhlps xmm0,xmm0
. Some CPUs may avoid a cycle or two of bypass latency if xmm0 was last written by an integer instruction, not FP.
via memory (SSE2)
movdqu [mem],xmm0
mov rax,[mem]
mov rbx,[mem+8]
slow, but does not destroy xmm register (SSE4.1)
mov rax,xmm0
pextrq rbx,xmm0,1 ;3 cycle latency on Ryzen! (and 2 uops)
A hybrid strategy is possible, e.g. store to memory, movd/q e/rax,xmm0
so it's ready quickly, then reload the higher elements. (Store-forwarding latency is not much worse than ALU, though.) That gives you a balance of uops for different back-end execution units. Store/reload is especially good when you want lots of small elements. (mov
/ movzx
loads into 32-bit registers are cheap and have 2/clock throughput.)
For 32 bits, the code is similar:
in registers
movd eax,xmm0
psrldq xmm0,xmm0,4 ;shift 4 bytes to the right
movd ebx,xmm0
psrldq xmm0,xmm0,4 ; pshufd could copy-and-shuffle the original reg
movd ecx,xmm0 ; not destroying the XMM and maybe creating some ILP
psrlq xmm0,xmm0,4
movd edx,xmm0
via memory
movdqu [mem],xmm0
mov eax,[mem]
mov ebx,[mem+4]
mov ecx,[mem+8]
mov edx,[mem+12]
Not destroying xmm register (SSE4.1) (slow like the psrldq
/ pshufd
version)
movd eax,xmm0
pextrd ebx,xmm0,1 ;3 cycle latency on Skylake!
pextrd ecx,xmm0,2 ;also 2 uops: like a shuffle(port5) + movd(port0)
pextrd edx,xmm0,3
The 64-bit shift variant can run in 2 cycles. The pextrq
version takes 4 minimum. For 32-bit, the numbers are 4 and 10, respectively.
On Intel SnB-family (including Skylake), shuffle+movq
or movd
has the same performance as a pextrq
/d
. It decodes to a shuffle uop and a movd
uop, so this is not surprising.
On AMD Ryzen, pextrq
apparently has 1 cycle lower latency than shuffle + movq
. pextrd/q
is 3c latency, and so is movd/q
, according to Agner Fog's tables. This is a neat trick (if it's accurate), since pextrd/q
does decode to 2 uops (vs. 1 for movq
).
Since shuffles have non-zero latency, shuffle+movq
is always strictly worse than pextrq
on Ryzen (except for possible front-end decode / uop-cache effects).
The major downside to a pure ALU strategy for extracting all elements is throughput: it takes a lot of ALU uops, and most CPUs only have one execution unit / port that can move data from XMM to integer. Store/reload has higher latency for the first element, but better throughput (because modern CPUs can do 2 loads per cycle). If the surrounding code is bottlenecked by ALU throughput, a store/reload strategy could be good. Maybe do the low element with a movd
or movq
so out-of-order execution can get started on whatever uses it while the rest of the vector data is going through store forwarding.
Another option worth considering (besides what Johan mentioned) for extracting 32-bit elements to integer registers is to do some of the "shuffling" with integer shifts:
mov rax,xmm0
# use eax now, before destroying it
shr rax,32
pextrq rcx,xmm0,1
# use ecx now, before destroying it
shr rcx, 32
shr
can run on p0 or p6 in Intel Haswell/Skylake. p6 has no vector ALUs, so this sequence is quite good if you want low latency but also low pressure on vector ALUs.
Or if you want to keep them around:
movq rax,xmm0
rorx rbx, rax, 32 # BMI2
# shld rbx, rax, 32 # alternative that has a false dep on rbx
# eax=xmm0[0], ebx=xmm0[1]
pextrq rdx,xmm0,1
mov ecx, edx # the "normal" way, if you don't want rorx or shld
shr rdx, 32
# ecx=xmm0[2], edx=xmm0[3]