Combining prefixes in SSE

I do not remember having seen any specification on what you should expect in the case of wildly combining random prefixes, so I guess CPU behaviour may be "undefined" and possibly CPU-specific. (Clearly, some things are specified in e.g. Intel's docs, but many cases aren't covered). And some combinations may be reserved for future use.

My naive assumptions would generally have been that additional prefixes would be no-ops but there's no guarantee. That seems reasonable given that e.g. some optimising manuals recommend multi-byte NOP (canonically 90h) by prefixing with 66h, e.g.:

db 66h, 90h; 2-byte NOP
db 66h, 66h, 90h; 3-byte NOP
db 66h, 66h, 66h, 90h; 4-byte NOP

However, I also know that CS and DS segment override prefixes have aquired novel functions as SSE2 branch hint prefixes (predict branch taken = 3Eh = DS override; predict branch not taken = 2Eh = CS override) when applied to conditional jump instructions.

Anyway, I looked at your examples above, always setting XMM1 to all 0 and XMM7 to all 0FFh by

pxor xmm1, xmm1    ; xmm1 <- 0s
pcmpeqw xmm7, xmm7 ; xmm7 <- FFs 

and then the code in question, with xmm1, xmm7 arguments. What I observed (32bit code on Win64 system and Intel T7300 Core 2 Duo) was:

1) no change observed for addsd by adding 66h prefix

db 66h 
addsd xmm1, xmm7 ;total sequence = 66 F2 0F 58 CF     

2) no change observed for addss by adding 0F2h prefix

db 0f2h     
addss xmm1,xmm7 ;total sequence = F2 F3 0F 58 CF

3) However, I observed a change by prefixing addpd by 0F2h:

db 0f2h    
addpd xmm1, xmm7 ;total sequence = F2 66 0F 58 CF

In this case, the result in XMM1 was 0000000000000000FFFFFFFFFFFFFFFFh instead of FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFh.

So my conclusion is that one shouldn't make any assumptions and expect "undefined" behaviour. I wouldn't be surprised, however, if you could find some clues in Agner fog's manuals.