Why Linux/gnu linker chose address 0x400000?
The start address is usually set by a linker script.
For example, on GNU/Linux, looking at /usr/lib/ldscripts/elf_x86_64.x
we see:
...
PROVIDE (__executable_start = SEGMENT_START("text-segment", 0x400000)); \
. = SEGMENT_START("text-segment", 0x400000) + SIZEOF_HEADERS;
The value 0x400000
is the default value for the SEGMENT_START()
function on this platform.
You can find out more about linker scripts by browsing the linker manual:
% info ld Scripts
ld
's default linker script has that 0x400000
value baked in for non-PIE executables.
PIEs (Position Independent Executables) don't have default base address; they're always relocated by the kernel, with the kernel's default being 0x0000555...
plus some ASLR offset unless ASLR is disabled for this process or system-wide. ld
has no control over this. Note that most modern systems configure GCC to use -fPIE -pie
by default, so it passes -pie
to ld
, and turns C into asm that's position-independent. Hand-written asm has to follow the same rules if you link it that way.
But what makes 0x400000
(4 MiB) a good default?
It has to be above mmap_min_addr
= 65536 = 64K by default.
And being plenty far away from 0 gives plenty more room to guard against NULL deref with an offset reading .text
or .data
/.bss
memory (array[i]
where array
is NULL). Even without increasing mmap_min_addr
(which this leave room for without breaking executables), usually mmap
randomly picks high addresses so in practice we have at least 4MiB of guard against NULL deref.
2M-aligned is good
This puts it at the start of a page-directory in the next level up of the page tables means the same number of 4K page-table-entries will be split across fewer 2M page directory entries, saving kernel page-table memory and helping page-walk hardware cache better. For big static arrays, close to the start of a 1G subtree of the next level up is also good.
IDK why 4MiB instead of 2MiB, or what the developers' reasoning was. 4MiB is the 32-bit largepage size without PAE (4-byte PTEs so 10 bits per level instead of 9), but a CPU has to be using x86-64 page tables to be in 64-bit mode.
A low start address allows nearly 2 GiB of static arrays
(Without using a larger code model, where at least large arrays have to be addressed in ways that are sometimes less efficient. See section 3.5.1 Architectural Constraints in the x86-64 System V ABI document for details on code models.)
The default code model for non-PIE executables ("small") lets programs assume that any static address is in the low 2GiB of virtual address space. So any absolute address in .text
/.rodata
, .data
, .bss
can be used as a 32-bit sign-extended immediate in the machine code where that's more efficient.
(This is not the case in a PIE or shared library: see 32-bit absolute addresses no longer allowed in x86-64 Linux? for the things you / the compiler can't do in x86-64 asm as a result, notably addss xmm0, [foo + rdi*4]
instead requires a RIP-relative LEA to get the array start address into a register. x86-64's only RIP-relative addressing mode is [RIP+rel32], without any general-purpose registers.)
Starting the executable's sections/segments near the bottom of virtual address space leaves almost the whole 2GiB available for text+data+bss to be that big. (It might have been possible to have a higher default, and have large executables make ld choose a lower address to make them fit, but that would be a more complicated linker script.)
This includes zero-initialized arrays in the .bss which don't make the executable file huge, just the process image in memory. In practice, Fortran programmers tend to run into this more than C and C++, since static arrays are popular there. For example gfortran for dummies: What does mcmodel=medium do exactly? has a good explanation of a build error with the default small
model, and the resulting x86-64 asm difference for medium
(where objects above a certain size threshold are not assumed to be in the low 2G or within +-2G of the code. But code and smaller static data still is so the speed penalty is minor.)
For example static float arr[1UL<<28];
is a 1 GiB array. If you had 3 of them, they couldn't all start inside the low 2 GiB (which may be all you need for hand-written asm), let alone have each element accessible.
gcc -fno-pie
expects to be able to compile float *p = &arr[size-1];
to mov $arr+1073741820, %edi
, a 5-byte mov $imm32
. RIP-relative won't work either if the target address is more than 2GiB away from the code generating the address (or loading from it with movss arr+1073741820(%rip), %xmm0
; RIP-relative is the normal way to load/store static data even in a non-PIE, when there's no runtime-variable index.) That's why the small-PIC model also has a 2GiB size limit on text+data+bss (plus gaps between segments): all static data and code needs to be within 2GiB of any other that might want to reach it.
If your code only ever accesses high elements or their addresses via runtime-variable indices, you only need the start of each array, the symbol itself, to be in the low 2 GiB. I forget if the linker enforces having the end-of-bss within the low 2GiB; it might since the linker script puts a symbol there that some CRT startup code might reference.
Footnote 1: There aren't any useful smaller sizes for a code model smaller than 2GiB. x86-64 machine code uses either 8 or 32-bit for immediates and addressing mode. 8-bit (256 bytes) is too small to be usable, and many important instructions like call rel32
, mov r32, imm32
, and [rip+rel32]
addressing, are only available with 4-byte not 1-byte constants anyway.
Limiting to the low 2 GiB (instead of 4) means that addresses can safely be zero-extended as with mov edi, OFFSET arr
, or sign-extended, as with mov eax, [arr + rdi*4]
. Remember that addresses aren't the only use-case for [reg + disp32]
addressing modes; [rbp - 256]
can often make sense, so it's good that x86-64 machine code sign-extends disp8 and disp32 to 64-bit, not zero-extends.
Implicit zero-extension to 64-bit happens when writing a 32-bit register, as with mov
-immediate to put an address in a register, where 32-bit operand-size is a smaller machine-code instruction than 64-bit operand-size. See How to load address of function or label into register (which also covers RIP-relative LEA).
Related for 32-bit Windows
Raymond Chen wrote an article about why the same 0x400000
base address is the default for 32-bit Windows.
He mentions that DLLs get loaded at high addresses by default, and a low address is far from that. x86-64 SysV shared objects can get loaded anywhere there's a large enough gap of address space, with the kernel defaulting to near the top of user-space virtual address-space, i.e. the top of the canonical range. But ELF shared objects are required to be fully relocatable so would work fine anywhere.
The 4MiB choice for 32-bit Windows was also motivated by avoiding the low 64K (NULL deref), and by picking the start of a page-directory for legacy 32-bit page tables. (Where the "largepage" size is 4M, not 2M for x86-64 or PAE.) With a bunch of Win95 and Win3.1 legacy memory-map reasons why at least 1MiB or 4MiB was partially necessary, and stuff like working around CPU bugs.