Combining prefixes in SSE
I do not remember having seen any specification on what you should expect in the case of wildly combining random prefixes, so I guess CPU behaviour may be "undefined" and possibly CPU-specific. (Clearly, some things are specified in e.g. Intel's docs, but many cases aren't covered). And some combinations may be reserved for future use.
My naive assumptions would generally have been that additional prefixes would be no-ops but there's no guarantee. That seems reasonable given that e.g. some optimising manuals recommend multi-byte NOP
(canonically 90h
) by prefixing with 66h
, e.g.:
db 66h, 90h; 2-byte NOP
db 66h, 66h, 90h; 3-byte NOP
db 66h, 66h, 66h, 90h; 4-byte NOP
However, I also know that CS
and DS
segment override prefixes have aquired novel functions as SSE2 branch hint prefixes (predict branch taken = 3Eh
= DS
override; predict branch not taken = 2Eh
= CS
override) when applied to conditional jump instructions.
Anyway, I looked at your examples above, always setting XMM1
to all 0
and XMM7
to all 0FFh
by
pxor xmm1, xmm1 ; xmm1 <- 0s
pcmpeqw xmm7, xmm7 ; xmm7 <- FFs
and then the code in question, with xmm1, xmm7
arguments. What I observed (32bit code on Win64 system and Intel T7300 Core 2 Duo) was:
1) no change observed for addsd
by adding 66h
prefix
db 66h
addsd xmm1, xmm7 ;total sequence = 66 F2 0F 58 CF
2) no change observed for addss
by adding 0F2h
prefix
db 0f2h
addss xmm1,xmm7 ;total sequence = F2 F3 0F 58 CF
3) However, I observed a change by prefixing addpd
by 0F2h
:
db 0f2h
addpd xmm1, xmm7 ;total sequence = F2 66 0F 58 CF
In this case, the result in XMM1 was 0000000000000000FFFFFFFFFFFFFFFFh
instead of FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFh
.
So my conclusion is that one shouldn't make any assumptions and expect "undefined" behaviour. I wouldn't be surprised, however, if you could find some clues in Agner fog's manuals.