Why are Rust stack frames so big?
Using formatting machinery like println!
creates a number of things on the stack. Expanding the macros used in your code:
fn useless_function(x: usize) {
if x > 0 {
{
::std::io::_print(::core::fmt::Arguments::new_v1(
&["", "\n"],
&match (&get_rsp(),) {
(arg0,) => [::core::fmt::ArgumentV1::new(
arg0,
::core::fmt::LowerHex::fmt,
)],
},
));
};
useless_function(x - 1);
}
}
I believe that those structs consume the majority of the space. As an attempt to prove that, I printed the size of the value created by format_args
, which is used by println!
:
let sz = std::mem::size_of_val(&format_args!("{:x}", get_rsp()));
println!("{}", sz);
This shows that it is 48 bytes.
See also:
- How do I see the expanded macro code that's causing my compile error?
Something like this should remove the printing from the equation, but the compiler / optimizer ignores the inline(never)
hint here and inlines it anyway, resulting in the sequential values all being the same.
/// SAFETY:
/// The length of `rsp` and the value of `x` must always match
#[inline(never)]
unsafe fn useless_function(x: usize, rsp: &mut [usize]) {
if x > 0 {
*rsp.get_unchecked_mut(0) = get_rsp();
useless_function(x - 1, rsp.get_unchecked_mut(1..));
}
}
fn main() {
unsafe {
let mut rsp = [0; 10];
useless_function(rsp.len(), &mut rsp);
for w in rsp.windows(2) {
println!("{}", w[0] - w[1]);
}
}
}
That said, you can make the function public and look at its assembly anyway (lightly cleaned):
playground::useless_function:
pushq %r15
pushq %r14
pushq %rbx
testq %rdi, %rdi
je .LBB6_3
movq %rsi, %r14
movq %rdi, %r15
xorl %ebx, %ebx
.LBB6_2:
callq playground::get_rsp
movq %rax, (%r14,%rbx,8)
addq $1, %rbx
cmpq %rbx, %r15
jne .LBB6_2
.LBB6_3:
popq %rbx
popq %r14
popq %r15
retq
but in debug mode each frame still takes 80 bytes
Compare the unoptimized assembly:
playground::useless_function:
subq $104, %rsp
movq %rdi, 80(%rsp)
movq %rsi, 88(%rsp)
movq %rdx, 96(%rsp)
cmpq $0, %rdi
movq %rdi, 56(%rsp) # 8-byte Spill
movq %rsi, 48(%rsp) # 8-byte Spill
movq %rdx, 40(%rsp) # 8-byte Spill
ja .LBB44_2
jmp .LBB44_8
.LBB44_2:
callq playground::get_rsp
movq %rax, 32(%rsp) # 8-byte Spill
xorl %eax, %eax
movl %eax, %edx
movq 48(%rsp), %rdi # 8-byte Reload
movq 40(%rsp), %rsi # 8-byte Reload
callq core::slice::<impl [T]>::get_unchecked_mut
movq %rax, 24(%rsp) # 8-byte Spill
movq 24(%rsp), %rax # 8-byte Reload
movq 32(%rsp), %rcx # 8-byte Reload
movq %rcx, (%rax)
movq 56(%rsp), %rdx # 8-byte Reload
subq $1, %rdx
setb %sil
testb $1, %sil
movq %rdx, 16(%rsp) # 8-byte Spill
jne .LBB44_9
movq $1, 72(%rsp)
movq 72(%rsp), %rdx
movq 48(%rsp), %rdi # 8-byte Reload
movq 40(%rsp), %rsi # 8-byte Reload
callq core::slice::<impl [T]>::get_unchecked_mut
movq %rax, 8(%rsp) # 8-byte Spill
movq %rdx, (%rsp) # 8-byte Spill
movq 16(%rsp), %rdi # 8-byte Reload
movq 8(%rsp), %rsi # 8-byte Reload
movq (%rsp), %rdx # 8-byte Reload
callq playground::useless_function
jmp .LBB44_8
.LBB44_8:
addq $104, %rsp
retq
.LBB44_9:
leaq str.0(%rip), %rdi
leaq .L__unnamed_7(%rip), %rdx
movq core::panicking::panic@GOTPCREL(%rip), %rax
movl $33, %esi
callq *%rax
ud2
This answer shows how this works in asm for an un-optimized C++ version.
This might not tell us as much as I thought about Rust; apparently Rust uses its own ABI / calling convention so it won't have "shadow space" making its stack frames bulkier on Windows. The first version of my answer guessed that it would follow the Windows calling convention for calls to other Rust functions, when targeting Windows. I've adjusted the wording, but I didn't delete it even though it's potentially not relevant to Rust.
After further research, at least in 2016 Rust's ABI happens to match the platform calling convention on Windows x64, at least if disassembly of the debug-build binary in this random tutorial is representative of anything. heap::allocate::h80a36d45ddaa4ae3Lca
in the disassembly clearly takes args in RCX and RDX, (spills and reloads them to the stack), then calls another function with those args. Leaving 0x20 bytes of space unused above RSP before the call, i.e. shadow space.
If nothing has changed since 2016 (easily possible), I think this answer does reflect some of what Rust does when compiling for Windows.
The recursion gets optimised away in release mode, but in debug mode each frame still takes 80 bytes which is way more than I anticipated. Is this just the way stack frames work on x86? Do other languages do better?
Yes, C and C++ do better: 48 or 64 bytes per stack frame on Windows, 32 on Linux.
The Windows x64 calling convention requires a caller to reserve 32 bytes of shadow space (basically unused stack-arg space above the return address) for use by the callee. But it looks like un-optimized clang builds may not take advantage of that shadow space, allocating extra space to spill local vars.
Also, the return address takes 8 bytes, and re-aligning the stack by 16 before another call takes another 8 bytes, so the minimum you can hope for is 48 bytes on Windows (unless you enable optimization, then as you say, tail-recursion easily optimizes into a loop). GCC compiling a C or C++ version of that code does achieve that.
Compiling for Linux, or any other x86-64 target that uses the x86-64 System V ABI, gcc and clang manage 32 bytes per frame for a C or C++ version. Just ret addr, saved RBP, and another 16 bytes to keep alignment while making room to spill 8-byte x
. (Compiling as C or as C++ makes no difference to the asm).
I tried GCC and clang on an un-optimized C++ version using the Windows calling convention on the Godbolt compiler explorer. To just look at the asm for useless_function
, there was no need to write a main
or get_rsp
.
#include <stdlib.h>
#define MS_ABI __attribute__((ms_abi)) // for GNU C compilers. Godbolt link has an ifdeffed version of this
void * RSPS[10] = {0};
MS_ABI void *get_rsp(void);
MS_ABI void useless_function(size_t x) {
RSPS[x] = get_rsp();
if (x == 0) {
return;
}
useless_function(x - 1);
}
clang/LLVM un-optimized does push rbp
/ sub rsp, 48
, so a total of 64 bytes per frame (including the return address). GCC does push / sub rsp,32
, for a total of only 48 bytes per frame, as predicted.
So apparently un-optimized LLVM does allocate "unneeded" space because it fails to use the shadow space allocated by the caller. If Rust used shadow space, this might explains some of why your debug-mode Rust version might use more stack space than we might expect, even with printing done outside the recursive function. (Printing uses a lot of space for locals).
But part of that explanation must also include having some locals that take more space, e.g. perhaps for pointer locals or bounds checks? C and C++ map pretty directly to asm, with access to globals not needing any extra stack space. (Or even extra registers, when the global array can be assumed to be in the low 2GiB of virtual address space, so it's address is usable as a 32-bit signed displacement in combination with other registers.)
# clang 10.0.1 -O0, for Windows x64
useless_function(unsigned long):
push rbp
mov rbp, rsp # set up a legacy frame pointer.
sub rsp, 48 # reserve enough for shadow space (32) + 16, maintaining stack alignment.
mov qword ptr [rbp - 8], rcx # spill incoming arg to newly reserved space above the shadow space
call get_rsp()
...
The only space for locals used on the stack is for x
, no invented temporaries as part of array access. It's just a reload of x
then mov qword ptr [8*rcx + RSPS], rax
to store the function call return value.
# GCC10.2 -O0, for Windows x64
useless_function(unsigned long):
push rbp
mov rbp, rsp
sub rsp, 32 # just reserve enough for shadow space for callee
mov QWORD PTR [rbp+16], rcx # spill incoming arg to our own shadow space
call get_rsp()
...
Without the ms_abi
attribute, both GCC and clang use sub rsp, 16
.