How to move 128-bit immediates to XMM registers
As one of the 10000 ways to do it, use SSE4.1 pinsrq
mov rax, first half
movq xmm0, rax ; better than pinsrq xmm0,rax,0 for performance and code-size
mov rax, second half
pinsrq xmm0, rax, 1
You can do it like this, with just one movaps
instruction:
.section .rodata # put your constants in the read-only data section
.p2align 4 # align to 16 = 1<<4
LC0:
.long 1082130432
.long 1077936128
.long 1073741824
.long 1065353216
.text
foo:
movaps LC0(%rip), %xmm0
Loading it with a data load is usually preferable to embedding it in the instruction stream, especially because of how many instructions it takes. That's several extra uops for the CPU to execute, for an arbitrary constant that can't be generated from all-ones with a couple shifts.
If it's easier, you can put constants right before or after a function that you jit-compile, instead of in a separate section. But since CPUs have split L1d / L1i caches and TLBs, it's generally best to group constants together separate from instructions.
If both halves of your constant are the same, you can broadcast-load it with SSE3movddup (m64), %xmm0
.
Just wanted to add that one can read about generating various constants using assembly in Agner Fog's manual Optimizing subroutines in assembly language, Generating constants, section 13.8, page 124.