Assembly Language (x86): How to create a loop to calculate Fibonacci sequence
related: Code-golf print the first 1000 digits of Fib(10**9): my x86 asm answer using an extended-precision adc
loop, and converting binary to strings. The inner loop is optimized for speed, other parts for size.
Computing a Fibonacci sequence only requires keeping two pieces of state: the current and previous element. I have no idea what you're trying to do with fibInitial
, other than counting its length. This isn't perl where you do for $n (0..5)
.
I know you're just learning asm at all, but I'm still going to talk about performance. There's not much reason to learn asm without knowing what's fast and what's not. If you don't need performance, let a compiler make the asm for you, from C sources. Also see the other links at https://stackoverflow.com/tags/x86/info
Using registers for your state simplifies the problem of needing to look at a[-1]
while calculating a[1]
. You start with curr=1
, prev=0
, and start with a[0] = curr
. To produce the "modern" starting-with-zero sequence of Fibonacci numbers, start with curr=0
, prev=1
.
Lucky for you, I was just thinking about an efficient loop for fibonacci code recently, so I took the time to write up a complete function. See below for an unrolled and a vectorized version (saves on store instructions, but also makes 64bit ints fast even when compiling for a 32bit CPU):
; fib.asm
;void fib(int32_t *dest, uint32_t count);
; not-unrolled version. See below for a version which avoids all the mov instructions
global fib
fib:
; 64bit SysV register-call ABI:
; args: rdi: output buffer pointer. esi: count (and you can assume the upper32 are zeroed, so using rsi is safe)
;; locals: rsi: endp
;; eax: current edx: prev
;; ecx: tmp
;; all of these are caller-saved in the SysV ABI, like r8-r11
;; so we can use them without push/pop to save/restore them.
;; The Windows ABI is different.
test esi, esi ; test a reg against itself instead of cmp esi, 0
jz .early_out ; count == 0.
mov eax, 1 ; current = 1
xor edx, edx ; prev = 0
lea rsi, [rdi + rsi * 4] ; endp = &out[count]; // loop-end pointer
;; lea is very useful for combining add, shift, and non-destructive operation
;; this is equivalent to shl rsi, 4 / add rsi, rdi
align 16
.loop: ; do {
mov [rdi], eax ; *buf = current
add rdi, 4 ; buf++
lea ecx, [rax + rdx] ; tmp = curr+prev = next_cur
mov edx, eax ; prev = curr
mov eax, ecx ; curr=tmp
;; see below for an unrolled version that doesn't need any reg->reg mov instructions
; you might think this would be faster:
; add edx, eax ; but it isn't
; xchg eax, edx ; This is as slow as 3 mov instructions, but we only needed 2 thanks to using lea
cmp rdi, rsi ; } while(buf < endp);
jb .loop ; jump if (rdi BELOW rsi). unsigned compare
;; the LOOP instruction is very slow, avoid it
.early_out:
ret
An alternate loop condition could be
dec esi ; often you'd use ecx for counts, but we had it in esi
jnz .loop
AMD CPUs can fuse cmp/branch, but not dec/branch. Intel CPUs can also macro-fuse dec/jnz
. (Or signed less than zero / greater than zero). dec/inc
don't update the Carry flag, so you can't use them with above/below unsigned ja/jb
. I think the idea is that you could do an adc
(add with carry) in a loop, using inc/dec
for the loop counter to not disturb the carry flag, but partial-flags slowdowns make this bad on modern CPUs.
lea ecx, [eax + edx]
needs an extra byte (address-size prefix), which is why I used a 32bit dest and a 64bit address. (Those are the default operand sizes for lea
in 64bit mode). No direct impact on speed, just indirect through code size.
An alternate loop body could be:
mov ecx, eax ; tmp=curr. This stays true after every iteration
.loop:
mov [rdi], ecx
add ecx, edx ; tmp+=prev ;; shorter encoding than lea
mov edx, eax ; prev=curr
mov eax, ecx ; curr=tmp
Unrolling the loop to do more iterations would mean less shuffling. Instead of mov
instructions, you just keep track of which register is holding which variable. i.e. you handle assignments with a sort of register renaming.
.loop: ;; on entry: ; curr:eax prev:edx
mov [rdi], eax ; store curr
add edx, eax ; curr:edx prev:eax
.oddentry:
mov [rdi + 4], edx ; store curr
add eax, edx ; curr:eax prev:edx
;; we're back to our starting state, so we can loop
add rdi, 8
cmp rdi, rsi
jb .loop
The thing with unrolling is that you need to clean up any odd iterations that are left over. Power-of-two unroll factors can make the cleanup loop slightly easier, but adding 12 isn't any faster than adding 16. (See the previous revision of this post for a silly unroll-by-3 version using lea
to produce curr + prev
in a 3rd register, because I failed to realize that you don't actually need a temp. Thanks to rcgldr for catching that.)
See below for a complete working unrolled version which handles any count.
Test frontend (new in this version: a canary element to detect asm bugs writing past the end of the buffer.)
// fib-main.c
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
void fib(uint32_t *buf, uint32_t count);
int main(int argc, const char *argv[]) {
uint32_t count = 15;
if (argc > 1) {
count = atoi(argv[1]);
}
uint32_t buf[count+1]; // allocated on the stack
// Fib overflows uint32 at count = 48, so it's not like a lot of space is useful
buf[count] = 0xdeadbeefUL;
// uint32_t count = sizeof(buf)/sizeof(buf[0]);
fib(buf, count);
for (uint32_t i ; i < count ; i++){
printf("%u ", buf[i]);
}
putchar('\n');
if (buf[count] != 0xdeadbeefUL) {
printf("fib wrote past the end of buf: sentinel = %x\n", buf[count]);
}
}
This code is fully working and tested (unless I missed copying a change in my local file back into the answer >.<):
peter@tesla:~/src/SO$ yasm -f elf64 fib.asm && gcc -std=gnu11 -g -Og fib-main.c fib.o
peter@tesla:~/src/SO$ ./a.out 48
1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1346269 2178309 3524578 5702887 9227465 14930352 24157817 39088169 63245986 102334155 165580141 267914296 433494437 701408733 1134903170 1836311903 2971215073 512559680
unrolled version
Thanks again to rcgldr for getting me thinking about how to handle odd vs. even count in the loop setup, rather than with a cleanup iteration at the end.
I went for branchless setup code, which adds 4 * count%2 to the starting pointer. That can be zero, but adding zero is cheaper than branching to see if we should or not. The Fibonacci sequence overflows a register very quickly, so keeping the prologue code tight and efficient is important, not just the code inside the loop. (If we're optimizing at all, we'd want to optimize for many calls with short length).
; 64bit SysV register-call ABI
; args: rdi: output buffer pointer. rsi: count
;; locals: rsi: endp
;; eax: current edx: prev
;; ecx: tmp
;; all of these are caller-saved in the SysV ABI, like r8-r11
;; so we can use them without push/pop to save/restore them.
;; The Windows ABI is different.
;void fib(int32_t *dest, uint32_t count); // unrolled version
global fib
fib:
cmp esi, 1
jb .early_out ; count below 1 (i.e. count==0, since it's unsigned)
mov eax, 1 ; current = 1
mov [rdi], eax
je .early_out ; count == 1, flags still set from cmp
;; need this 2nd early-out because the loop always does 2 iterations
;;; branchless handling of odd counts:
;;; always do buf[0]=1, then start the loop from 0 or 1
;;; Writing to an address you just wrote to is very cheap
;;; mov/lea is about as cheap as best-case for branching (correctly-predicted test/jcc for count%2==0)
;;; and saves probably one unconditional jump that would be needed either in the odd or even branch
mov edx, esi ;; we could save this mov by using esi for prev, and loading the end pointer into a different reg
and edx, eax ; prev = count & 1 = count%2
lea rsi, [rdi + rsi*4] ; end pointer: same regardless of starting at 0 or 1
lea rdi, [rdi + rdx*4] ; buf += count%2
;; even count: loop starts at buf[0], with curr=1, prev=0
;; odd count: loop starts at buf[1], with curr=1, prev=1
align 16 ;; the rest of this func is just *slightly* longer than 16B, so there's a lot of padding. Tempting to omit this alignment for CPUs with a loop buffer.
.loop: ;; do {
mov [rdi], eax ;; *buf = current
; on loop entry: curr:eax prev:edx
add edx, eax ; curr:edx prev:eax
;.oddentry: ; unused, we used a branchless sequence to handle odd counts
mov [rdi+4], edx
add eax, edx ; curr:eax prev:edx
;; back to our starting arrangement
add rdi, 8 ;; buf++
cmp rdi, rsi ;; } while(buf < endp);
jb .loop
; dec esi ; set up for this version with sub esi, edx; instead of lea
; jnz .loop
.early_out:
ret
To produce the starting-with-zero sequence, do
curr=count&1; // and esi, 1
buf += curr; // lea [rdi], [rdi + rsi*4]
prev= 1 ^ curr; // xor eax, esi
instead of the current
curr = 1;
prev = count & 1;
buf += count & 1;
We can also save a mov
instruction in both versions by using esi
to hold prev
, now that prev
depends on count
.
;; loop prologue for sequence starting with 1 1 2 3
;; (using different regs and optimized for size by using fewer immediates)
mov eax, 1 ; current = 1
cmp esi, eax
jb .early_out ; count below 1
mov [rdi], eax
je .early_out ; count == 1, flags still set from cmp
lea rdx, [rdi + rsi*4] ; endp
and esi, eax ; prev = count & 1
lea rdi, [rdi + rsi*4] ; buf += count & 1
;; eax:curr esi:prev rdx:endp rdi:buf
;; end of old code
;; loop prologue for sequence starting with 0 1 1 2
cmp esi, 1
jb .early_out ; count below 1, no stores
mov [rdi], 0 ; store first element
je .early_out ; count == 1, flags still set from cmp
lea rdx, [rdi + rsi*4] ; endp
mov eax, 1 ; prev = 1
and esi, eax ; curr = count&1
lea rdi, [rdi + rsi*4] ; buf += count&1
xor eax, esi ; prev = 1^curr
;; ESI:curr EAX:prev (opposite of other setup)
;;
;; optimized for code size, NOT speed. Prob. could be smaller, esp. if we want to keep the loop start aligned, and jump between before and after it.
;; most of the savings are from avoiding mov reg, imm32,
;; and from counting down the loop counter, instead of checking an end-pointer.
;; loop prologue for sequence starting with 0 1 1 2
xor edx, edx
cmp esi, 1
jb .early_out ; count below 1, no stores
mov [rdi], edx ; store first element
je .early_out ; count == 1, flags still set from cmp
xor eax, eax ; movzx after setcc would be faster, but one more byte
shr esi, 1 ; two counts per iteration, divide by two
;; shift sets CF = the last bit shifted out
setc al ; curr = count&1
setnc dl ; prev = !(count&1)
lea rdi, [rdi + rax*4] ; buf+= count&1
;; extra uop or partial register stall internally when reading eax after writing al, on Intel (except P4 & silvermont)
;; EAX:curr EDX:prev (same as 1 1 2 setup)
;; even count: loop starts at buf[0], with curr=0, prev=1
;; odd count: loop starts at buf[1], with curr=1, prev=0
.loop:
...
dec esi ; 1B smaller than 64b cmp, needs count/2 in esi
jnz .loop
.early_out:
ret
vectorized:
The Fibonacci sequence isn't particularly parallelizable. There's no simple way to get F(i+4) from F(i) and F(i-4), or anything like that. What we can do with vectors is fewer stores to memory. Start with:
a = [f3 f2 f1 f0 ] -> store this to buf
b = [f2 f1 f0 f-1]
Then a+=b; b+=a; a+=b; b+=a;
produces:
a = [f7 f6 f5 f4 ] -> store this to buf
b = [f6 f5 f4 f3 ]
This is less silly when working with two 64bit ints packed into a 128b vector. Even in 32bit code, you can use SSE to do 64bit integer math.
A previous version of this answer has an unfinished packed-32bit vector version that doesn't properly handle count%4 != 0
. To load the first 4 values of the sequence, I used pmovzxbd
so I didn't need 16B of data when I could use only 4B. Getting the first -1 .. 1 values of the sequence into vector registers is a lot easier, because there's only one non-zero value to load and shuffle around.
;void fib64_sse(uint64_t *dest, uint32_t count);
; using SSE for fewer but larger stores, and for 64bit integers even in 32bit mode
global fib64_sse
fib64_sse:
mov eax, 1
movd xmm1, eax ; xmm1 = [0 1] = [f0 f-1]
pshufd xmm0, xmm1, 11001111b ; xmm0 = [1 0] = [f1 f0]
sub esi, 2
jae .entry ; make the common case faster with fewer branches
;; could put the handling for count==0 and count==1 right here, with its own ret
jmp .cleanup
align 16
.loop: ; do {
paddq xmm0, xmm1 ; xmm0 = [ f3 f2 ]
.entry:
;; xmm1: [ f0 f-1 ] ; on initial entry, count already decremented by 2
;; xmm0: [ f1 f0 ]
paddq xmm1, xmm0 ; xmm1 = [ f4 f3 ] (or [ f2 f1 ] on first iter)
movdqu [rdi], xmm0 ; store 2nd last compute result, ready for cleanup of odd count
add rdi, 16 ; buf += 2
sub esi, 2
jae .loop ; } while((count-=2) >= 0);
.cleanup:
;; esi <= 0 : -2 on the count=0 special case, otherwise -1 or 0
;; xmm1: [ f_rc f_rc-1 ] ; rc = count Rounded down to even: count & ~1
;; xmm0: [ f_rc+1 f_rc ] ; f(rc+1) is the value we need to store if count was odd
cmp esi, -1
jne .out ; this could be a test on the Parity flag, with no extra cmp, if we wanted to be really hard to read and need a big comment explaining the logic
;; xmm1 = [f1 f0]
movhps [rdi], xmm1 ; store the high 64b of xmm0. There is no integer version of this insn, but that doesn't matter
.out:
ret
No point unrolling this further, the dep chain latency limits throughput, so we can always average storing one element per cycle. Reducing the loop overhead in uops can help for hyperthreading, but that's pretty minor.
As you can see, handling all the corner cases even when unrolling by two is quite complex to keep track of. It requires extra startup overhead, even when you're trying to optimize that to keep it to a minimum. It's easy to end up with a lot of conditional branches.
updated main:
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#include <stdlib.h>
#ifdef USE32
void fib(uint32_t *buf, uint32_t count);
typedef uint32_t buftype_t;
#define FMTx PRIx32
#define FMTu PRIu32
#define FIB_FN fib
#define CANARY 0xdeadbeefUL
#else
void fib64_sse(uint64_t *buf, uint32_t count);
typedef uint64_t buftype_t;
#define FMTx PRIx64
#define FMTu PRIu64
#define FIB_FN fib64_sse
#define CANARY 0xdeadbeefdeadc0deULL
#endif
#define xstr(s) str(s)
#define str(s) #s
int main(int argc, const char *argv[]) {
uint32_t count = 15;
if (argc > 1) {
count = atoi(argv[1]);
}
int benchmark = argc > 2;
buftype_t buf[count+1]; // allocated on the stack
// Fib overflows uint32 at count = 48, so it's not like a lot of space is useful
buf[count] = CANARY;
// uint32_t count = sizeof(buf)/sizeof(buf[0]);
if (benchmark) {
int64_t reps = 1000000000 / count;
for (int i=0 ; i<=reps ; i++)
FIB_FN(buf, count);
} else {
FIB_FN(buf, count);
for (uint32_t i ; i < count ; i++){
printf("%" FMTu " ", buf[i]);
}
putchar('\n');
}
if (buf[count] != CANARY) {
printf(xstr(FIB_FN) " wrote past the end of buf: sentinel = %" FMTx "\n", buf[count]);
}
}
Performance
For count just below 8192, the unrolled-by-two non-vector version runs near its theoretical-max throughput of 1 store per cycle (3.5 instructions per cycle), on my Sandybridge i5. 8192 * 4B/int = 32768 = L1 cache size. In practice, I see ~3.3 to ~3.4 insn / cycle. I'm counting the entire program with Linux perf
, though, not just the tight loop.
Anyway, there's not really any point unrolling further. And obviously this stopped being a Fibonacci sequence after count=47, since we use uint32_t. However, for large count
, the throughput is limited by memory bandwidth, down to ~2.6 insn / cycle. At this point we're basically looking at how to optimize memset.
The 64bit vector version runs at 3 insns per cycle (one 128b store per two clocks) up to an array size of about 1.5 times L2 cache size. (i.e. ./fib64 49152
). As the array size goes up to larger fractions of L3 cache size, performance decreases down to ~2 insn per cycle (one store per 3 clocks) at 3/4 of L3 cache size. It levels out to 1 store per 6 cycles at sizes > L3 cache.
So storing with vectors does better when we fit in L2 but not L1 cache.