Optimizing an incrementing ASCII decimal counter in video RAM on 7th gen Intel Core
If a counter ticks in the forest, does anyone see it?
our requirements state that every single change of a number has to be visible on screen
Your screen's refresh rate is probably 60Hz, maybe as high as 144Hz. Changing video RAM any faster than that will leave some counts unread by the hardware scan out loop over the framebuffer1, never sent to a physical screen, and never turning into a pattern of photons of visible light that a high-speed camera could record.
Footnote 1: Or the virtual equivalent if VGA text mode is emulated somehow on top of hardware that only knows how to draw pixels. Asked Does modern PC video hardware support VGA text mode in HW, or does the BIOS emulate it (with System Management Mode)? as a followup.
If we don't accept this limit of 1 increment per 16.66.. ms (60 Hz), we need to decide what we're willing to bottleneck on vs. what we can sidestep.
Certainly we need to do the actual work of having the ASCII digits calculated, not just incrementing a binary counter and formatting it into a string occasionally in a timer or vertical blanking interrupt (once per screen refresh). That wouldn't satisfy the spirit of the assignment.
Or what if we calculate the ASCII digits purely in registers and only do mov
stores in a timer or vblank interrupt? That would sample the fast-incrementing counter asynchronously from its increments so you'd visually see all the low digits changing. (Which is a pretty clear minimum requirement).
Omitting stores from the actual loop still doesn't feel like it hits the spirit of the assignment. I think our loop should, if run on its own with no fancy hardware setup, truly get every count all the way to video RAM. That seems uncontroversial. That's what the original code does.
The CPU can be configured to do write-combining with MTRRs. Some desktops had a BIOS option to set the AGP GART as UC (UnCacheable) vs. WC (calling it "USWC = Uncacheable Speculative Write Combining"). This BIOS-tuning article has a section on it. It seems modern firmware leaves VGA memory UC, letting OSes / graphics drivers set up MTRRs / PAT.
Unfortunately, making VGA memory WC works too well and the stores never make it out of the CPU core's write-combining buffer. (An LFB since this is an Intel CPU.) We can flush manually after every store with a memory barrier like mfence
, or clflushopt
with the address of the cache line. But then we're back where we started because on the OP's Kaby Lake iGPU / firmware, it seems that flushing a WC store costs about the same as just doing a UC store costs.
Of course we only have to flush when the whole counter is in sync, after updating all the digits if a carry rippled far. If we were storing each digit separately this could speed us up by 11.111% if I have my math right vs. UC memory. Or if we were doing dword stores of 2 digits at once, by 1.0101% because we only need an extra store every 100 counts, not every 10.
I think we can capture the spirit of the assignment while still letting the hardware optimize our stores by using a WC framebuffer and flushing in a timer or vblank interrupt.
This means we're incrementing a counter very fast (nearly 1 count per core clock cycle with a careful implementation). And we sample that counter by merely using a memory barrier or serializing instruction in an interrupt handler that runs right before the video hardware starts a new pass at the top left of the screen, scanning out a new frame. In fact iret
is serializing so merely returning from an empty interrupt handler will do the job. Holding down a key on the keyboard may even make the counter updates visible on screen (where they weren't otherwise) if you used the MTRR to make video RAM WC but didn't program a timer or vertical-blanking interrupt to fire periodically.
Using clflush
or mfence
from an outer level of the loop wouldn't work well; that would be synchronous with the increments and would thus leave the low digits always zero. It would make the fact that we only sometimes flush explicit in the loop, instead of leaving the flushing as something that happens because of interrupts that are part of normal system operation. (Or at least they would be if this bootloader wasn't literally the only thing running. e.g. if run under DOS you'd have a timer interrupt every few ms.)
If we insist on flushing to video RAM every count (either by leaving it UC, or manually with WC + explicit flushes in the loop), the only optimization that would matter is reducing the number of stores to video RAM. i.e. by not updating digits that aren't changing. The original code stores every digit every time, so fixing that should give very close to a 10x speedup.
Even just storing to uncacheable DRAM or making a PCIe transaction is much slower than anything you could optimize inside the loop, even a self-modifying-code machine clear. And if storing to a VGA text framebuffer triggers a System Management Mode Interrupt (SMI) to emulate text mode by updating a real pixel framebuffer, the cost of a store to the frame is astronomical compared to anything else you could do in the loop. This might well be how firmware for out Skylake / Kaby Lake integrated GPUs works: Does modern PC video hardware support VGA text mode in HW, or does the BIOS emulate it (with System Management Mode)?
Allowing the hardware to do write-combining on our stores to VRAM is thus essential to making this optimization problem interesting beyond that one algorithmic tweak.
To do this, program the MTRR for the VGA framebuffer.
https://wiki.osdev.org/MTRR documents the actual MSRs you can use with the wrmsr
instruction. I think each MSR has a bit-field of 8 regions. The one you want is IA32_MTRR_FIX16K_A0000
, in MSR[259]
- 8 regions of 16 KB each (128 KB total) which include the linear address block B8000
that holds VGA text-mode memory. Figure 11-8 in Intel's SDM vol 3 documents the layout.
Assuming WC video memory (or for updating WB cacheable memory)
There are lots of things to improve, but two critical things:
Micro-architectural: Self-modifying code pipeline nukes, aka machine clears, from
count[]
being in the same 64B cache line as your main loop (~50x performance with no other changes.) Without changing this, it's hard to see any gains from any other micro-optimizations.Algorithmic: Don't blindly propagate carry all the way up through every digit every time: 90% of the increments don't carry at all, 99% carry only 1 place, etc. Nested loops to handle the low digits can run very efficiently, just incrementing their own digit counter and having the outer loop reset it to
'0'
, no need to explicitly propagate those carries withadc
. Keeping those ASCII digits in registers also avoids the need to load/store them tocounts[]
, just pure stores to video RAM, likemov [di-4], eax
.With very efficient inner loops for the low digits, performance of the upper 6 or 7 digits becomes nearly irrelevant. That part runs once per 10k or 1k increments so its cost is amortized. (~19x speedup for aggressively optimized inner loops vs. a micro-optimized version of your original loop that saves some uops and avoids some bottlenecks without changing the algorithm.)
Other micro-optimizations of your original (after fixing the SMC machine clears) gave a factor of ~1.5x speedup: making the carry branch normally not-taken, saving some uops, avoiding some partial-register false dependencies from lodsb
and writing 16-bit partial registers.
With the optimized 4 levels of inner loops I rewrote from scratch, my version is about 29x faster on Skylake / Kaby Lake than the no-SMC-stall version of the original, or ~1500x faster than the true original. There's certainly a middle ground where you do adc
carry propagation but take an early out when CF==0; I didn't try to implement that.
Tested in 32-bit mode, but the same code assembled for 16-bit mode should execute the same way, including the SMC stalls in your original. (Assuming WC stores don't trigger an SMI until flushed, and that the WC buffer keeps the stores local inside the core so ~1 store / clock is possible just like with WB memory.)
SKL and KBL are clock-for-clock identical in perf, same microarchitecture, so my test results should be reproducible for you. I did assemble your code in 16-bit mode to see alignment: it looks like your loop will have some bytes of count[]
in the same 64-byte cache line as the end of the loop, hence a SMC pipeline nuke per iteration for most digits.
I adapted your original code so I could run the same loop in 32-bit mode under Linux, making it possible to use perf
to profile with HW performance counters. The first step in optimizing anything is to get a baseline measurement. Since you mention some micro-optimizations for the micro-architectural reasons, we want perf counters not just total time. We can't easily get that in a bootloader on bare metal. Possibly in a guest VM, but then you'd be storing to a virtual VGA device, not real hardware, so it's probably not different from using normal or NT stores on normal WB memory in user-space under Linux.
perf stat -I1000
to show counters for amount of work done every second is a handy way to compare speed for tweaks that don't change the algorithm or number of branches. Look at counts for branches in 1 second to see relative speed of the loop, or divide that by cycles.
I used movnti
to try to simulate a store to WC video RAM (uncacheable speculative Write-Combining, instead of normal WB = write-back cacheable). I think normal stores to WC memory regions behave like movnt
stores. movnt
stores that don't complete a cache line can keep updating the same write-combining LFB without actually flushing to memory. So it's similar to a normal store to WB memory that can hit in L1d cache.
SMI trapping of framebuffer stores (if done at all) is done by hardware outside the CPU core, probably the System Agent, so it doesn't fire until the core flushes. Or if no SMI trap, then probably it just goes to DRAM on our iGPU systems. Or over a PCIe bus to get to video RAM on a separate card.
Versions timed under GNU/Linux kernel 5.5.10 on i7-6700k on a somewhat idle system at ~4.2GHz
DRAM and cache are barely involved, and the system was idle enough that nothing was taking cycles on the other logical core of the physical core, so the code had a whole CPU to itself the whole time to spam stores into a write-combining buffer.
- Original version, ported to run in 32-bit user-space: Godbolt - not fully timed, but
perf stat -I1000
to print stats per second shows it's running about 52x slower than withalign 64
beforecounter:
. The pipeline nuke may include flushing WC buffers which would mean going to DRAM as well. - Original version, with SMC pipeline nukes avoided: ~85.7 seconds, ~358 billion core clock cycles for 10^10 counts. 2.66 IPC
- Micro-optimized version of that: Godbolt - ~55.3 seconds, ~231 billion clock cycles for 10^10 counts. 4.56 IPC (but with more simpler instructions, not lodsb)
- New inner loops, empty placeholder outer loop: Godbolt - ~2.93 seconds, ~12.25 billion core clock cycles. 2.73 IPC
The optimized version achieves close to 3 stores per 4 clocks. (Counting the low 2 digits from 00..99 takes 100 stores, the way it does it. I didn't time these final versions with clflushopt.)
If you fixed some of the stalls and stopped your loop with CF==0, this would result in a bottleneck on store/reload (store-forwaring) latency to low element of the count
array. You definitely want those in registers so they can be store-only, not load/adc/store.
TODO: comment and talk about the micro-optimizations that I applied for that version:
Why doesn't GCC use partial registers? / How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent - also
lodsb
sucks.lodsd
/q
are ok. Usemovzx
to do narrow loads, instead of merging into the low byte. Fortunatelyinc
/dec
in anadc
loop on Sandybridge-family is fine, not causing partial-flag stalls like it would on P6-family. Especially in Skylake which doesn't do flag-merging at all, instead just reading the CF and/or SPAZO parts of FLAGS separately if needed. (Consequence:cmovbe
andcmova
are 2 uops to read 2 integer inputs and CF + ZF; other cmov are only 1 uop.)You can use 32-bit registers in 16-bit mode, you don't have to switch modes. The assembler just uses an operand-size prefix. Writing a 32-bit register has no dependency on the old value, but 16 or 8 does. I used this to break dependency chains that would otherwise be loop-carried, allowing the CPU to exploit the instruction-level parallelism (ILP) across loop iterations / http://www.lighterra.com/papers/modernmicroprocessors/.
Haswell/Skylake have taken branch throughput of 1/clock, but can run a not-taken and a taken in the same cycle. Lay out branches to favour not-taken on the fast path (always a good idea in general).
Which Intel microarchitecture introduced the ADC reg,0 single-uop special case? -
adc al,0
is unfortunately 2 uops on Skylake, unlikeadc eax,0
oradc bl,0
. Crazy, right? This is basically a CPU performance bug or CPU missed-optimization by the hardware designers, where the special-case opcodes for smaller encodings decode worse.32-byte aligned routine does not fit the uops cache - Intel's recent JCC erratum makes the
idq.mite_uops
perf event worth checking. Skylake used to be pretty robust against code alignment, but now it's horrible for high-throughput code.Perf doesn't totally fall off a cliff, but a significant factor is possible because of front-end bottlenecks from having to use legacy decode for some 32-byte blocks of machine code that end with a
jcc
on a 32-byte boundary. I didn't spend a lot of effort on this optimization for this code, but the fast versions happen to avoid this problem according to perf counters.
My version with nested loops, testable on GNU/Linux
This is only the inner loop; the outer loop just repeats it 10^10 / 10k times with no actual outer-loop work. We only leave the inner 4 loops once per 10k increments so pretending that part takes zero time doesn't change the result particularly.
The same pattern of 2 nested levels of looping per register could be repeated more times, or just do a chain of adc
like you were doing.
;; nasm -felf32 decimal-counter.asm
;; ld -N -melf_i386 -o decimal-counter decimal-counter.o
;; writeable text segment like a bootloader
;; runs in 32-bit mode with prefixes for 16-bit operand-size
;;
;; taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u,resource_stalls.any:u,rs_events.empty_cycles:u,machine_clears.count:u -I1000 ./decimal-counter
%use smartalign
alignmode p6, 64
;org 7c00h
;pos equ vram + 2*(2*80-2) ;address on screen
pos equ vram + 2*(2*80-4) ;address on screen
; In GDB, use
; p ((char*)&vram) + 2*(2*80-4)-36
;init
;cli
;mov ax,3
;int 10h
;mov ax,0b800h
;mov es,ax
;jmp 0:start
; pick your poison, or let stores stay in the CPU, not reaching VRAM
%macro FLUSH 1
; clflushopt %1 ; all the way to DRAM
; mfence ; for mov to WB: just drain store buffer. For WC or movnt, IDK how guaranteed it is to hit DRAM
; lock xor byte [esp], 0 ; faster version of mfence (at least on Skylake)
%endmacro
;%define movnti mov ; for experiments
global _start
align 512
_start:
; push cs
; pop ds
; mov ebp, counter+9 ; save address in a register
; mov edi,pos
mov edi, pos - 10*4
mov eax, '0_0_'
mov ecx, 10
rep stosw ; memset the digits in VRAM
mov ebp, 10000000000 / 10000 ; outer loop iterations
mov edi, pos-4
; mov ah, 4Eh ; VGA attribute byte
; mov eax, '____'
align 32
.outer:
mov edx, '0_0_' ; thousands (low), hundreds (high) digits
.thousands:
.hundreds:
movnti [edi-4], edx
; don't want to flush yet; only after low digits are updated
add edx, 1<<16
mov eax, '0_0_' ; tens (low=AX), ones (high) digits
.tens:
.ones: ; do{
movnti [edi], eax ; store low 2 digits
FLUSH [edi]
lea ecx, [eax + (1<<16)] ; off the critical path of the EAX dep chain
movnti [edi], ecx
FLUSH [edi]
add eax, 2<<16 ; unroll by 2
cmp eax, '9_'<<16
jle .ones ; }while(ones<='9')
; mov byte [edi+2], '9' ; peel the last 2 iterations?
add eax, ('1_0_') - ('0_0_' + (10<<16)) ; increment the more-significant digit (AL), resetting less-significant digit back to '0'
cmp al, '9'
jle .tens
cmp edx, '9_9_'
jle .hundreds
add edx, ('1_0_') - ('0_0_' + (10<<16)) ; increment the more-significant digit (DL), resetting less-significant digit back to '0'
cmp dl, '9'
jle .thousands
;; TODO: increment the high 6 digits, propagating carry. Possibly clflushopt here only?
; pause
dec ebp
jnz .outer
; jmp $
mov eax, 1
int 0x80
;section .data ; avoids machine clears
; in original 16-bit code: counter starts at 00000037 30<rept>, ends at 00000040 (inclusive), in same cache line as the loop
align 64
counter:
times 10 db '0'
;section .text
times 510-($-$$) db 0
dw 0aa55h
section .bss
vram: resw 80*25
I have tested that this works for the low digits, single-stepping it in GDB and using display ((char*)&vram) + 2*(2*80-4)-36
or something like that to show the contents of that part of BSS as a string every step.
Using dword stores means that when the ones place, wraps we don't need a separate store to update the tens place. It just has to update the low byte of the same register and let the first iteration of the inner loop do that store.
During rollover from 0099
to 0100
, memory contents are temporarily 0199
. But unless you use SSE to store 16 bytes at once, you can't really avoid one problem or the other. The other option would be to somehow arrange for 0000
before 0100
, but that might waste a store to the tens/ones dword in the hundreds loop.
Here's my take on it. The following optimisations have been applied:
- the least significant digit has been unrolled completely for best performance
- the remaining digits have been unrolled to one section per digit
- BCD arithmetic has been used to reduce the code to one conditional branch per digit
- segment usage has been shuffled around to reduce the number of prefixes used
- instruction order has been optimised to move long-latency instructions out of the critical path
Additionally I have altered the code to be a COM binary for easier testing. Turning it back into a boot loader is left as an exercise to the reader. One thing you can do once it's a boot loader is fixing the code such that CS
and SS
have a segment base of 0000
. This avoids a penalty on loads and stores on some microarchitectures.
org 100h
pos equ 2*(2*80-12) ; address on screen
mov ax, 3 ; set up video mode
int 10h
mov ax, 0b800h
mov ds, ax
mov es, ax
mov di, pos
mov ax, 4e30h ; '0' + attribute byte 4e
mov cx, 10
cld
rep stosw ; set up initial display
xor ax, ax
sub sp, 10
push ax
push ax
push ax
push ax
push ax
mov bp, sp ; set up counter
dec di
dec di ; di points to the last digit on screen
mov bx, digits ; translation table
jmp countloop
%macro docarry 1 ; digits other than the last one
mov al, [bp+%1] ; second to last digit
inc ax ; add carry to al
aaa ; generate BCD carry
mov [bp+%1], al ; desposit to counter
cs xlat ; generate ASCII digit
mov [di-2*9+2*%1], al ; display digit
jnc countloop ; exit when carry dies
%endm
docarry2: ; place this here so jumps are in range
docarry 2
docarry 1
docarry 0
int 20h
align 16 ; for performance
countloop:
mov [di], byte '0' ; treat last digit separately
mov [di], byte '1'
mov [di], byte '2'
mov [di], byte '3'
mov [di], byte '4'
mov [di], byte '5'
mov [di], byte '6'
mov [di], byte '7'
mov [di], byte '8'
mov [di], byte '9'
docarry 8
docarry 7
docarry 6
docarry 5
docarry 4
docarry 3
jmp docarry2
digits:
db '0123456789'
This increases the speed by a factor of about 30 compared to the original code on my 8 MHz 80286 based machine and manages to increment the counter about 329000 times per second (about 3.04 µs per digit). It's going to be a bit hard to test on a modern system, but I'll try to find a solution.