How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent
Other answers welcome to address Sandybridge and IvyBridge in more detail. I don't have access to that hardware.
I haven't found any partial-reg behaviour differences between HSW and SKL. On Haswell and Skylake, everything I've tested so far supports this model:
AL is never renamed separately from RAX (or r15b from r15). So if you never touch the high8 registers (AH/BH/CH/DH), everything behaves exactly like on a CPU with no partial-reg renaming (e.g. AMD).
Write-only access to AL merges into RAX, with a dependency on RAX. For loads into AL, this is a micro-fused ALU+load uop that executes on p0156, which is one of the strongest pieces of evidence that it's truly merging on every write, and not just doing some fancy double-bookkeeping as Agner speculated.
Agner (and Intel) say Sandybridge can require a merging uop for AL, so it probably is renamed separately from RAX. For SnB, Intel's optimization manual (section 3.5.2.4 Partial Register Stalls) says
SnB (not necessarily later uarches) inserts a merging uop in the following cases:
After a write to one of the registers AH, BH, CH or DH and before a following read of the 2-, 4- or 8-byte form of the same register. In these cases a merge micro-op is inserted. The insertion consumes a full allocation cycle in which other micro-ops cannot be allocated.
After a micro-op with a destination register of 1 or 2 bytes, which is not a source of the instruction (or the register's bigger form), and before a following read of a 2-,4- or 8-byte form of the same register. In these cases the merge micro-op is part of the flow.
I think they're saying that on SnB, add al,bl
will RMW the full RAX instead of renaming it separately, because one of the source registers is (part of) RAX. My guess is that this doesn't apply for a load like mov al, [rbx + rax]
; rax
in an addressing mode probably doesn't count as a source.
I haven't tested whether high8 merging uops still have to issue/rename on their own on HSW/SKL. That would make the front-end impact equivalent to 4 uops (since that's the issue/rename pipeline width).
- There is no way to break a dependency involving AL without writing EAX/RAX.
xor al,al
doesn't help, and neither doesmov al, 0
. movzx ebx, al
has zero latency (renamed), and needs no execution unit. (i.e. mov-elimination works on HSW and SKL). It triggers merging of AH if it's dirty, which I guess is necessary for it to work without an ALU. It's probably not a coincidence that Intel dropped low8 renaming in the same uarch that introduced mov-elimination. (Agner Fog's micro-arch guide has a mistake here, saying that zero-extended moves are not eliminated on HSW or SKL, only IvB.)movzx eax, al
is not eliminated at rename. mov-elimination on Intel never works for same,same.mov rax,rax
isn't eliminated either, even though it doesn't have to zero-extend anything. (Although there'd be no point to giving it special hardware support, because it's just a no-op, unlikemov eax,eax
). Anyway, prefer moving between two separate architectural registers when zero-extending, whether it's with a 32-bitmov
or an 8-bitmovzx
.movzx eax, bx
is not eliminated at rename on HSW or SKL. It has 1c latency and uses an ALU uop. Intel's optimization manual only mentions zero-latency for 8-bit movzx (and points out thatmovzx r32, high8
is never renamed).
High-8 regs can be renamed separately from the rest of the register, and do need merging uops.
- Write-only access to
ah
withmov ah, reg8
ormov ah, [mem8]
do rename AH, with no dependency on the old value. These are both instructions that wouldn't normally need an ALU uop for the 32-bit version. (Butmov ah, bl
is not eliminated; it does need a p0156 ALU uop so that might be a coincidence). - a RMW of AH (like
inc ah
) dirties it. setcc ah
depends on the oldah
, but still dirties it. I thinkmov ah, imm8
is the same, but haven't tested as many corner cases.(Unexplained: a loop involving
setcc ah
can sometimes run from the LSD, see thercr
loop at the end of this post. Maybe as long asah
is clean at the end of the loop, it can use the LSD?).If
ah
is dirty,setcc ah
merges into the renamedah
, rather than forcing a merge intorax
. e.g.%rep 4
(inc al
/test ebx,ebx
/setcc ah
/inc al
/inc ah
) generates no merging uops, and only runs in about 8.7c (latency of 8inc al
slowed down by resource conflicts from the uops forah
. Also theinc ah
/setcc ah
dep chain).I think what's going on here is that
setcc r8
is always implemented as a read-modify-write. Intel probably decided that it wasn't worth having a write-onlysetcc
uop to optimize thesetcc ah
case, since it's very rare for compiler-generated code tosetcc ah
. (But see the godbolt link in the question: clang4.0 with-m32
will do so.)reading AX, EAX, or RAX triggers a merge uop (which takes up front-end issue/rename bandwidth). Probably the RAT (Register Allocation Table) tracks the high-8-dirty state for the architectural R[ABCD]X, and even after a write to AH retires, the AH data is stored in a separate physical register from RAX. Even with 256 NOPs between writing AH and reading EAX, there is an extra merging uop. (ROB size=224 on SKL, so this guarantees that the
mov ah, 123
was retired). Detected with uops_issued/executed perf counters, which clearly show the difference.Read-modify-write of AL (e.g.
inc al
) merges for free, as part of the ALU uop. (Only tested with a few simple uops, likeadd
/inc
, notdiv r8
ormul r8
). Again, no merging uop is triggered even if AH is dirty.Write-only to EAX/RAX (like
lea eax, [rsi + rcx]
orxor eax,eax
) clears the AH-dirty state (no merging uop).- Write-only to AX (
mov ax, 1
) triggers a merge of AH first. I guess instead of special-casing this, it runs like any other RMW of AX/RAX. (TODO: testmov ax, bx
, although that shouldn't be special because it's not renamed.) xor ah,ah
has 1c latency, is not dep-breaking, and still needs an execution port.- Read and/or write of AL does not force a merge, so AH can stay dirty (and be used independently in a separate dep chain). (e.g.
add ah, cl
/add al, dl
can run at 1 per clock (bottlenecked on add latency).
Making AH dirty prevents a loop from running from the LSD (the loop-buffer), even when there are no merging uops. The LSD is when the CPU recycles uops in the queue that feeds the issue/rename stage. (Called the IDQ).
Inserting merging uops is a bit like inserting stack-sync uops for the stack-engine. Intel's optimization manual says that SnB's LSD can't run loops with mismatched push
/pop
, which makes sense, but it implies that it can run loops with balanced push
/pop
. That's not what I'm seeing on SKL: even balanced push
/pop
prevents running from the LSD (e.g. push rax
/ pop rdx
/ times 6 imul rax, rdx
. (There may be a real difference between SnB's LSD and HSW/SKL: SnB may just "lock down" the uops in the IDQ instead of repeating them multiple times, so a 5-uop loop takes 2 cycles to issue instead of 1.25.) Anyway, it appears that HSW/SKL can't use the LSD when a high-8 register is dirty, or when it contains stack-engine uops.
This behaviour may be related to a an erratum in SKL:
SKL150: Short Loops Which Use AH/BH/CH/DH Registers May Cause Unpredictable System Behaviour
Problem: Under complex micro-architectural conditions, short loops of less than 64 instruction that use AH, BH, CH, or DH registers as well as their corresponding wider registers (e.g. RAX, EAX, or AX for AH) may cause unpredictable system behaviour. This can only happen when both logical processors on the same physical processor are active.
This may also be related to Intel's optimization manual statement that SnB at least has to issue/rename an AH-merge uop in a cycle by itself. That's a weird difference for the front-end.
My Linux kernel log says microcode: sig=0x506e3, pf=0x2, revision=0x84
.
Arch Linux's intel-ucode
package just provides the update, you have to edit config files to actually have it loaded. So my Skylake testing was on an i7-6700k with microcode revision 0x84, which doesn't include the fix for SKL150. It matches the Haswell behaviour in every case I tested, IIRC. (e.g. both Haswell and my SKL can run the setne ah
/ add ah,ah
/ rcr ebx,1
/ mov eax,ebx
loop from the LSD). I have HT enabled (which is a pre-condition for SKL150 to manifest), but I was testing on a mostly-idle system so my thread had the core to itself.
With updated microcode, the LSD is completely disabled for everything all the time, not just when partial registers are active. lsd.uops
is always exactly zero, including for real programs not synthetic loops. Hardware bugs (rather than microcode bugs) often require disabling a whole feature to fix. This is why SKL-avx512 (SKX) is reported to not have a loopback buffer. Fortunately this is not a performance problem: SKL's increased uop-cache throughput over Broadwell can almost always keep up with issue/rename.
Extra AH/BH/CH/DH latency:
- Reading AH when it's not dirty (renamed separately) adds an extra cycle of latency for both operands. e.g.
add bl, ah
has a latency of 2c from input BL to output BL, so it can add latency to the critical path even if RAX and AH are not part of it. (I've seen this kind of extra latency for the other operand before, with vector latency on Skylake, where an int/float delay "pollutes" a register forever. TODO: write that up.)
This means unpacking bytes with movzx ecx, al
/ movzx edx, ah
has extra latency vs. movzx
/shr eax,8
/movzx
, but still better throughput.
Reading AH when it is dirty doesn't add any latency. (
add ah,ah
oradd ah,dh
/add dh,ah
have 1c latency per add). I haven't done a lot of testing to confirm this in many corner-cases.Hypothesis: a dirty high8 value is stored in the bottom of a physical register. Reading a clean high8 requires a shift to extract bits [15:8], but reading a dirty high8 can just take bits [7:0] of a physical register like a normal 8-bit register read.
Extra latency doesn't mean reduced throughput. This program can run at 1 iter per 2 clocks, even though all the add
instructions have 2c latency (from reading DH, which is not modified.)
global _start
_start:
mov ebp, 100000000
.loop:
add ah, dh
add bh, dh
add ch, dh
add al, dh
add bl, dh
add cl, dh
add dl, dh
dec ebp
jnz .loop
xor edi,edi
mov eax,231 ; __NR_exit_group from /usr/include/asm/unistd_64.h
syscall ; sys_exit_group(0)
Performance counter stats for './testloop':
48.943652 task-clock (msec) # 0.997 CPUs utilized
1 context-switches # 0.020 K/sec
0 cpu-migrations # 0.000 K/sec
3 page-faults # 0.061 K/sec
200,314,806 cycles # 4.093 GHz
100,024,930 branches # 2043.675 M/sec
900,136,527 instructions # 4.49 insn per cycle
800,219,617 uops_issued_any # 16349.814 M/sec
800,219,014 uops_executed_thread # 16349.802 M/sec
1,903 lsd_uops # 0.039 M/sec
0.049107358 seconds time elapsed
Some interesting test loop bodies:
%if 1
imul eax,eax
mov dh, al
inc dh
inc dh
inc dh
; add al, dl
mov cl,dl
movzx eax,cl
%endif
Runs at ~2.35c per iteration on both HSW and SKL. reading `dl` has no dep on the `inc dh` result. But using `movzx eax, dl` instead of `mov cl,dl` / `movzx eax,cl` causes a partial-register merge, and creates a loop-carried dep chain. (8c per iteration).
%if 1
imul eax, eax
imul eax, eax
imul eax, eax
imul eax, eax
imul eax, eax ; off the critical path unless there's a false dep
%if 1
test ebx, ebx ; independent of the imul results
;mov ah, 123 ; dependent on RAX
;mov eax,0 ; breaks the RAX dependency
setz ah ; dependent on RAX
%else
mov ah, bl ; dep-breaking
%endif
add ah, ah
;; ;inc eax
; sbb eax,eax
rcr ebx, 1 ; dep on add ah,ah via CF
mov eax,ebx ; clear AH-dirty
;; mov [rdi], ah
;; movzx eax, byte [rdi] ; clear AH-dirty, and remove dep on old value of RAX
;; add ebx, eax ; make the dep chain through AH loop-carried
%endif
The setcc version (with the %if 1
) has 20c loop-carried latency, and runs from the LSD even though it has setcc ah
and add ah,ah
.
00000000004000e0 <_start.loop>:
4000e0: 0f af c0 imul eax,eax
4000e3: 0f af c0 imul eax,eax
4000e6: 0f af c0 imul eax,eax
4000e9: 0f af c0 imul eax,eax
4000ec: 0f af c0 imul eax,eax
4000ef: 85 db test ebx,ebx
4000f1: 0f 94 d4 sete ah
4000f4: 00 e4 add ah,ah
4000f6: d1 db rcr ebx,1
4000f8: 89 d8 mov eax,ebx
4000fa: ff cd dec ebp
4000fc: 75 e2 jne 4000e0 <_start.loop>
Performance counter stats for './testloop' (4 runs):
4565.851575 task-clock (msec) # 1.000 CPUs utilized ( +- 0.08% )
4 context-switches # 0.001 K/sec ( +- 5.88% )
0 cpu-migrations # 0.000 K/sec
3 page-faults # 0.001 K/sec
20,007,739,240 cycles # 4.382 GHz ( +- 0.00% )
1,001,181,788 branches # 219.276 M/sec ( +- 0.00% )
12,006,455,028 instructions # 0.60 insn per cycle ( +- 0.00% )
13,009,415,501 uops_issued_any # 2849.286 M/sec ( +- 0.00% )
12,009,592,328 uops_executed_thread # 2630.307 M/sec ( +- 0.00% )
13,055,852,774 lsd_uops # 2859.456 M/sec ( +- 0.29% )
4.565914158 seconds time elapsed ( +- 0.08% )
Unexplained: it runs from the LSD, even though it makes AH dirty. (At least I think it does. TODO: try adding some instructions that do something with eax
before the mov eax,ebx
clears it.)
But with mov ah, bl
, it runs in 5.0c per iteration (imul
throughput bottleneck) on both HSW/SKL. (The commented-out store/reload works, too, but SKL has faster store-forwarding than HSW, and it's variable-latency...)
# mov ah, bl version
5,009,785,393 cycles # 4.289 GHz ( +- 0.08% )
1,000,315,930 branches # 856.373 M/sec ( +- 0.00% )
11,001,728,338 instructions # 2.20 insn per cycle ( +- 0.00% )
12,003,003,708 uops_issued_any # 10275.807 M/sec ( +- 0.00% )
11,002,974,066 uops_executed_thread # 9419.678 M/sec ( +- 0.00% )
1,806 lsd_uops # 0.002 M/sec ( +- 3.88% )
1.168238322 seconds time elapsed ( +- 0.33% )
Notice that it doesn't run from the LSD anymore.
Update: Possible evidence that IvyBridge still renames low16 / low8 registers separately from the full register, like Sandybridge but unlike Haswell and later.
InstLatX64 results from SnB and IvB show 0.33c throughput for movsx r16, r8
(as expected, movsx
is never eliminated and there were only 3 ALUs before Haswell).
But apparently InstLat's movsx r16, r8
test bottlenecks Haswell / Broadwell / Skylake at 1c throughput (see also this bug report on the instlat github). Probably by writing the same architectural register, creating a chain of merges.
(The actual throughput for that instruction with separate destination registers is 0.25c on my Skylake. Tested with 7 movsx
instructions writing to eax..edi and r10w/r11w, all reading from cl
. And a dec ebp/jnz
as the loop branch to make an even 8 uop loop.)
If I'm guessing right about what created that 1c throughput result on CPUs after IvB, it's doing something like running a block of movsx dx, al
. And that can only run at more than 1 IPC on CPUs that rename dx
separately from RDX instead of merging. So we can conclude that IvB actually does still rename low8 / low16 registers separately from full registers, and it wasn't until Haswell that they dropped that. (But something is fishy here: if this explanation was right, we should see the same 1c throughput on AMD which doesn't rename partial registers. But we don't, see below.)
Results with ~0.33c throughput for the movsx r16, r8
(and movzx r16, r8
) tests:
- IvB with AIDA64 build: 4.0.568.0 May 24 2013
- IvB-E build: 4.3.764.0 Jul 10 2017
- SnB-EP with a 2013 build
- SnB with a 2018 build.
Haswell results with a mysterious 0.58c
throughput for movsx/zx r16, r8
:
- A Haswell result with the same 4.3.764.0 Jul 10 2017 build of AIDA64
- Haswell-E with a 2014 build
Other earlier and later Haswell (and CrystalWell) / Broadwell / Skylake results are all 1.0c throughput for those two tests.
- HSW with 4.1.570.0 Jun 5 2013, BDW with 4.3.15787.0 Oct 12 2018, BDW with 4.3.739.0 Mar 17 2017.
As I reported in the linked InstLat issue on github, the "latency" numbers for movzx r32, r8
ignore mov-elimination, presumably testing like movzx eax, al
.
Even worse, the newer versions of InstLatX64 with separate-registers versions of the test, like MOVSX r1_32, r2_8
, show latency numbers below 1 cycle, like 0.3c for that MOVSX on Skylake. This is total nonsense; I tested just to be sure.
The MOVSX r1_16, r2_8
test does show 1c latency, so apparently they're just measuring the latency of the output (false) dependency. (Which doesn't exist for 32-bit and wider outputs).
But that MOVSX r1_16, r2_8
test measured 1c latency on Sandybridge as well! So maybe my theory was wrong about what the movsx r16, r8
test is telling us.
On Ryzen (AIDA64 build 4.3.781.0 Feb 21 2018), which we know doesn't do any partial-register renaming at all, the results don't show the 1c throughput effect that we'd expect if the test was really writing the same 16-bit register repeatedly. I don't find it on any older AMD CPUs either, with older versions of InstLatX64, like K10 or Bulldozer-family.
## Instlat Zen tests of ... something?
43 X86 :MOVSX r16, r8 L: 0.28ns= 1.0c T: 0.11ns= 0.40c
44 X86 :MOVSX r32, r8 L: 0.28ns= 1.0c T: 0.07ns= 0.25c
45 AMD64 :MOVSX r64, r8 L: 0.28ns= 1.0c T: 0.12ns= 0.43c
46 X86 :MOVSX r32, r16 L: 0.28ns= 1.0c T: 0.12ns= 0.43c
47 AMD64 :MOVSX r64, r16 L: 0.28ns= 1.0c T: 0.13ns= 0.45c
48 AMD64 :MOVSXD r64, r32 L: 0.28ns= 1.0c T: 0.13ns= 0.45c
IDK why throughput isn't 0.25 for all of them; seems weird. This might be a version of the 0.58c Haswell throughput effect. MOVZX numbers are the same, with 0.25 throughput for the no-prefixes version that reads R8 and writes an R32. Maybe there's a bottleneck on fetch/decode for larger instructions? But movsx r32, r16
is the same size as movsx r32, r8
.
The separate-reg tests show the same pattern as on Intel, though, with 1c latency only for the one that has to merge. MOVZX is the same.
## Instlat Zen separate-reg tests
2252 X86 :MOVSX r1_16, r2_8 L: 0.28ns= 1.0c T: 0.08ns= 0.28c
2253 X86 :MOVSX r1_32, r2_8 L: 0.07ns= 0.3c T: 0.07ns= 0.25c
2254 AMD64 :MOVSX r1_64, r2_8 L: 0.07ns= 0.3c T: 0.07ns= 0.25c
2255 X86 :MOVSX r1_32, r2_16 L: 0.07ns= 0.3c T: 0.07ns= 0.25c
Excavator results are also pretty similar to this, but of course lower throughput.
https://www.uops.info/table.html confirms that Zen+ has the expected 0.25c throughput (and 1c latency) for MOVSX_NOREX (R16, R8)
, same as Instlat found with their separate-reg tests.
Perhaps InstLat's throughput test for MOVSX r16, r8
(not MOVSX r1_16, r2_8
) only uses 2 or 3 dep chains, which isn't enough for modern CPUs? Or perhaps breaks the dep chain occasionally so OoO exec can overlap some?