Compiler generates costly MOVZX instruction
Thank you for the good question!
Clearing Registers and Dependency Breaking Idioms
A Quote from the Intel® 64 and IA-32 Architectures Optimization Reference Manual, Section 3.5.1.8:
Code sequences that modifies partial register can experience some delay in its dependency chain, but can be avoided by using dependency breaking idioms. In processors based on Intel Core microarchitecture, a number of instructions can help clear execution dependency when software uses these instructions to clear register content to zero. Break dependences on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.
Assembly/Compiler Coding Rule 37. (M impact, MH generality): Break dependences on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.
movzx vs mov
The compiler knows that movzx is not costly and uses it as often as possible. It may take more bytes to encode movzx than mov, but it is not expensive to execute.
Contrary to the logic, a program with movzx (that fills the entire registers) actually works faster than with just mov, which only sets lower parts of the registers.
Let me demonstrate this conclusion to you on the following code fragment. It is part of the code that implements CRC-32 calculation using the Slicing by-N algorithm. Here it is:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx ecx, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 2]
movzx ecx, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 1]
skipped 6 more similar triplets that do movzx, shr, xor.
dec <<<a counter register >>>>
jnz …… <<repeat the whole loop again>>>
Here is the second code fragment. We have cleared ecx in advance, and now just instead of “movzx ecx, bl” do “mov cl, bl”:
// ecx is already cleared here to 0
mov cl, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
mov cl, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 2]
mov cl, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 1]
<<< and so on – as in the example #1>>>
Now guess which of the two above code fragments runs faster? Did you think previously that the speed is the same, or the movzx version is slower? In fact, the movzx code is faster because all the CPUs since Pentium Pro do Out-Of-Order execution of instructions and register renaming.
Register Renaming
Register renaming is a technique used internally by a CPU that eliminates the false data dependencies arising from the reuse of registers by successive instructions that do not have any real data dependencies between them.
Let me just take the first 4 instructions from the first code fragment:
-
movzx ecx, bl
-
shr ebx, 8
-
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
-
movzx ecx, bl
As you see, instruction 4 depends on instruction 2. Instruction 4 does not rely on the result of instruction 3.
So the CPU could execute instructions 3 and 4 in parallel (together), but instruction 3 uses the register (read-only) modified by instruction 4, thus instruction 4 may only start executing after instruction 3 fully completes. Let us then rename the register ecx to edx after the first triplet to avoid this dependency:
movzx ecx, bl
shr ebx, 8
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
movzx edx, bl
shr ebx, 8
xor eax, dword ptr [edx * 4 + edi + 1024 * 2]
movzx ecx, bl
shr ebx, 8
xor eax, dword ptr [ecx * 4 + edi + 1024 * 1]
Here is what we have now:
-
movzx ecx, bl
-
shr ebx, 8
-
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
-
movzx edx, bl
Now instruction 4 in no way uses any register needed for instruction 3, and vice versa, so instructions 3 and 4 can execute simultaneously for sure!
This is what the CPU does for us. The CPU, when translating instructions to micro-operations (micro-ops) which the Out-of-order algorithm will execute, renames the registers internally to eliminate these dependencies, so the micro-ops deal with renamed, internal registers, rather than with the real ones as we know them. Thus we don't need to rename registers ourselves as I have just renamed in the above example – the CPU will automatically rename everything for us while translating instructions to micro-ops.
The micro-ops of instruction 3 and instruction 4 will be executed in parallel, since micro-ops of instruction 4 will deal with entirely different internal register (exposed to outside as ecx) than micro-ops of instruction 3, so we don't need to rename anything.
Let me revert the code to the initial version. Here it is:
-
movzx ecx, bl
-
shr ebx, 8
-
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
-
movzx ecx, bl
(instructions 3 and 4 run in parallel because ecx of instruction 3 is not that ecx as of instruction 4, but a different, renamed register – the CPU has automatically allocated for instruction 4 micro-ops a new, fresh register from the pool of internally available registers).
Now let us go back to movxz vs mov.
Movzx clears a register entirely, so the CPU for sure knows that we do not depend on any previous value that remained in higher bits of the register. When the CPU sees the movxz instruction, it knows that it can safely rename the register internally and execute the instruction in parallel with previous instructions. Now take the first 4 instructions from our example #2, where we use mov rather than movzx:
-
mov cl, bl
-
shr ebx, 8
-
mov eax, dword ptr [ecx * 4 + edi + 1024 * 3]
-
mov cl, bl
In this case, instruction 4, by modifying cl, modifies bits 0-7 of the ecx, leaving bits 8-32 unchanged. Thus the CPU cannot just rename the register for instruction 4 and allocate another, fresh register, because instruction 4 depends on bits 8-32 left from previous instructions. The CPU has to preserve bits 8-32 before it can execute instruction 4. Thus it cannot just rename the register. It will wait until instruction 3 completes before executing instruction 4. Instruction 4 didn't become fully independent - it depends on the previous value of ECX and the previous value of bl. So it depends on two registers at once. If we had used movzx, it would have depended on just one register - bl. Consequently, instructions 3 and 4 would not run in parallel because of their interdependence. Sad but true.
That's why it is always faster to operate complete registers. Suppose we need only to modify a part of the register. In that case, it's always quicker to alter the entire register (for example, use movzx) – to let the CPU know for sure that the register no longer depends on its previous value. Modifying complete registers allows the CPU to rename the register and let the Out-of-order execution algorithm execute this instruction together with the other instructions, rather than execute them one-by-one.
The movzx
instruction zero extends a quantity into a register of larger size. In your case, a word (two bytes) is zero extended into a dword (four bytes). Zero extending itself is usually free, the slow part is loading the memory operand WORD PTR [rsi-2]
from RAM.
To speed this up, you can try to ensure that the datum you want to fetch from RAM is in the L1 cache at the time you need it. You can do this by placing strategic prefetch intrinsics into an appropriate place. For example, assuming that one cache line is 64 bytes, you could add a prefetch intrinsic to fetch array entry i + 32
every time you go through the loop.
You can also consider an algorithmic improvement such that less data needs to be fetched from memory, but that seems unlikely to be possible.