How to move 128-bit immediates to XMM registers

As one of the 10000 ways to do it, use SSE4.1 pinsrq

mov    rax, first half
movq   xmm0, rax      ; better than pinsrq xmm0,rax,0 for performance and code-size

mov    rax, second half
pinsrq xmm0, rax, 1

You can do it like this, with just one movaps instruction:

.section .rodata    # put your constants in the read-only data section
.p2align 4          # align to 16 = 1<<4
LC0:
        .long   1082130432
        .long   1077936128
        .long   1073741824
        .long   1065353216

.text
foo:
        movaps  LC0(%rip), %xmm0

Loading it with a data load is usually preferable to embedding it in the instruction stream, especially because of how many instructions it takes. That's several extra uops for the CPU to execute, for an arbitrary constant that can't be generated from all-ones with a couple shifts.

If it's easier, you can put constants right before or after a function that you jit-compile, instead of in a separate section. But since CPUs have split L1d / L1i caches and TLBs, it's generally best to group constants together separate from instructions.

If both halves of your constant are the same, you can broadcast-load it with SSE3
movddup (m64), %xmm0.


Just wanted to add that one can read about generating various constants using assembly in Agner Fog's manual Optimizing subroutines in assembly language, Generating constants, section 13.8, page 124.