Intel x86 vs x64 system call
General part
EDIT: Linux irrelevant parts removed
While not totally wrong, narrowing down to int 0x80
and syscall
oversimplifies the question as with sysenter
there is at least a 3rd option.
Using 0x80 and eax for syscall number, ebx, ecx, edx, esi, edi, and ebp to pass parameters is just one of many possible other choices to implement a system call, but those registers are the ones the 32-bit Linux ABI chose.
Before taking a closer look at the techniques involved, it should be stated that they all circle around the problem of escaping the privilege prison every process runs in.
Another choice to the ones presented here offered by the x86 architecture would have been the use of a call gate (see: http://en.wikipedia.org/wiki/Call_gate)
The only other possibility present on all i386 machines is using a software interrupt, which allows the ISR (Interrupt Service Routine or simply an interrupt handler) to run at a different privilege level than before.
(Fun fact: some i386 OSes have used an invalid-instruction exception to enter the kernel for system calls, because that was actually faster than an int
instruction on 386 CPUs. See OsDev syscall/sysret and sysenter/sysexit instructions enabling for a summary of possible system-call mechanisms.)
Software Interrupt
What exactly happens once an interrupt is triggered depends on whether switching to the ISR requires a privilege change or not:
(Intel® 64 and IA-32 Architectures Software Developer’s Manual)
6.4.1 Call and Return Operation for Interrupt or Exception Handling Procedures
...
If the code segment for the handler procedure has the same privilege level as the currently executing program or task, the handler procedure uses the current stack; if the handler executes at a more privileged level, the processor switches to the stack for the handler’s privilege level.
....
If a stack switch does occur, the processor does the following:
Temporarily saves (internally) the current contents of the SS, ESP, EFLAGS, CS, and > EIP registers.
Loads the segment selector and stack pointer for the new stack (that is, the stack for the privilege level being called) from the TSS into the SS and ESP registers and switches to the new stack.
Pushes the temporarily saved SS, ESP, EFLAGS, CS, and EIP values for the interrupted procedure’s stack onto the new stack.
Pushes an error code on the new stack (if appropriate).
Loads the segment selector for the new code segment and the new instruction pointer (from the interrupt gate or trap gate) into the CS and EIP registers, respectively.
If the call is through an interrupt gate, clears the IF flag in the EFLAGS register.
Begins execution of the handler procedure at the new privilege level.
... sigh this seems to be a lot to do and even once we're done it doesn't get too much better:
(excerpt taken from the same source as mentioned above: Intel® 64 and IA-32 Architectures Software Developer’s Manual)
When executing a return from an interrupt or exception handler from a different privilege level than the interrupted procedure, the processor performs these actions:
Performs a privilege check.
Restores the CS and EIP registers to their values prior to the interrupt or exception.
Restores the EFLAGS register.
Restores the SS and ESP registers to their values prior to the interrupt or exception, resulting in a stack switch back to the stack of the interrupted procedure.
Resumes execution of the interrupted procedure.
Sysenter
Another option on the 32-bit platform not mentioned in your question at all, but nevertheless utilized by the Linux kernel is the sysenter
instruction.
(Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z)
Description Executes a fast call to a level 0 system procedure or routine. SYSENTER is a companion instruction to SYSEXIT. The instruction is optimized to provide the maximum performance for system calls from user code running at privilege level 3 to operating system or executive procedures running at privilege level 0.
One disadvantage of using this solution is, that it is not present on all 32-bit machines, so the int 0x80
method still has to be provided in case the CPU doesn't know about it.
The SYSENTER and SYSEXIT instructions were introduced into the IA-32 architecture in the Pentium II processor. The availability of these instructions on a processor is indicated with the SYSENTER/SYSEXIT present (SEP) feature flag returned to the EDX register by the CPUID instruction. An operating system that qualifies the SEP flag must also qualify the processor family and model to ensure that the SYSENTER/SYSEXIT instructions are actually present
Syscall
The last possibility, the syscall
instruction, pretty much allows for the same functionality as the sysenter
instruction. The existence of both is due to the fact that one (systenter
) was introduced by Intel while the other (syscall
) was introduced by AMD.
Linux specific
In the Linux kernel any of the three possibilities mentioned above may be chosen to realize a system call.
See also The Definitive Guide to Linux System Calls.
As already stated above, the int 0x80
method is the only one of the 3 chosen implementations, that can run on any i386 CPU so this is the only one that is always available for 32-bit user-space.
(syscall
is the only one that's always available for 64-bit user-space, and the only one you should ever use in 64-bit code; x86-64 kernels can be built without CONFIG_IA32_EMULATION
, and int 0x80
still invokes the 32-bit ABI which truncates pointers to 32-bit.)
To allow to switch between all 3 choices every process run is given access to a special shared object that gives access to the system call implementation chosen for the running system. This is the strange looking linux-gate.so.1
you already might have encountered as unresolved library when using ldd
or the like.
(arch/x86/vdso/vdso32-setup.c)
if (vdso32_syscall()) {
vsyscall = &vdso32_syscall_start;
vsyscall_len = &vdso32_syscall_end - &vdso32_syscall_start;
} else if (vdso32_sysenter()){
vsyscall = &vdso32_sysenter_start;
vsyscall_len = &vdso32_sysenter_end - &vdso32_sysenter_start;
} else {
vsyscall = &vdso32_int80_start;
vsyscall_len = &vdso32_int80_end - &vdso32_int80_start;
}
To utilize it all you have to do is load all your registers system call number in eax, parameters in ebx, ecx, edx, esi, edi as with int 0x80
system call implementation and call
the main routine.
Unfortunately it is not all that easy; as to minimize the security risk of a fixed predefined address, the location at which the vdso
(virtual dynamic shared object) will be visible in a process is randomized, so you will have to figure out the correct location first.
This address is individual to each process and is passed to the process once it is started.
In case you didn't know, when started in Linux, every process gets pointers to the parameters passed once it was started and pointers to a description of the environment variables it is running under passed on its stack - each of them terminated by NULL.
Additionally to these a third block of so called elf-auxiliary-vectors gets passed following the ones mentioned before. The correct location is encoded in one of these carrying the type-identifier AT_SYSINFO
.
So stack layout looks like this (addresses grow downwards):
- parameter-0
- ...
- parameter-m
- NULL
- environment-0
- ....
- environment-n
- NULL
- ...
- auxilliary elf vector:
AT_SYSINFO
- ...
- auxilliary elf vector:
AT_NULL
Usage example
To find the correct address you will have to first skip all arguments and all environment pointers and then start scanning for AT_SYSINFO
as shown in the example below:
#include <stdio.h>
#include <elf.h>
void putc_1 (char c) {
__asm__ ("movl $0x04, %%eax\n"
"movl $0x01, %%ebx\n"
"movl $0x01, %%edx\n"
"int $0x80"
:: "c" (&c)
: "eax", "ebx", "edx");
}
void putc_2 (char c, void *addr) {
__asm__ ("movl $0x04, %%eax\n"
"movl $0x01, %%ebx\n"
"movl $0x01, %%edx\n"
"call *%%esi"
:: "c" (&c), "S" (addr)
: "eax", "ebx", "edx");
}
int main (int argc, char *argv[]) {
/* using int 0x80 */
putc_1 ('1');
/* rather nasty search for jump address */
argv += argc + 1; /* skip args */
while (*argv != NULL) /* skip env */
++argv;
Elf32_auxv_t *aux = (Elf32_auxv_t*) ++argv; /* aux vector start */
while (aux->a_type != AT_SYSINFO) {
if (aux->a_type == AT_NULL)
return 1;
++aux;
}
putc_2 ('2', (void*) aux->a_un.a_val);
return 0;
}
As you will see by taking a look at the following snippet of /usr/include/asm/unistd_32.h
on my system:
#define __NR_restart_syscall 0
#define __NR_exit 1
#define __NR_fork 2
#define __NR_read 3
#define __NR_write 4
#define __NR_open 5
#define __NR_close 6
The syscall I used is the one numbered 4 (write) as passed in the eax register. Taking filedescriptor (ebx = 1), data-pointer (ecx = &c) and size (edx = 1) as its arguments, each passed in the corresponding register.
To put a long story short
Comparing a supposedly slow running int 0x80
system call on any Intel CPU with a (hopefully) much faster implementation using the (genuinely invented by AMD) syscall
instruction is comparing apples to oranges.
IMHO: Most probably the sysenter
instruction instead of int 0x80
should be to the test here.
There are three things that needs to happen when you call the kernel (making a system call):
- The system goes from "user mode" to "kernel mode" (ring 0).
- The stack switches from "user mode" to "kernel mode".
- A jump is made to a suitable part of the kernel.
Obviously, once inside the kernel, the kernel code will need to know what you actually want the kernel to do, hence putting something in EAX, and often more things in other registers since there are things like "name of the file you want to open" or "buffer to read data from a file into" etc, etc.
Different processors have different ways to achieve the above three steps. In x86, there are several choices, but the two most popular for hand-written asm are int 0xnn
(32-bit mode) or syscall
(64-bit mode). (There's also 32-bit mode sysenter
, introduced by Intel for the same reason AMD introduced the 32-bit mode version of syscall
: as a faster alternative to the slow int 0x80
. 32-bit glibc uses whichever efficient system-call mechanism is available, only using the slow int 0x80
if nothing better is available.)
The 64-bit version of the syscall
instruction was introduced with the x86-64 architecture as a faster way to enter a system call. It has a set of registers (using the x86 MSR mechanisms) that contain the address RIP we wish to jump to, what selector values to load into CS and SS, and for doing the Ring3 to Ring0 transition. It also stores the return address in ECX/RCX. [Please read the instruction set manual for all the details of this instruction - it is not entirely trivial!]. Since the processor knows this will switch to Ring0, it can directly do the right thing.
One of the key points is that syscall
only manipulates registers; it doesn't do any loads or stores. (This is why it overwrites RCX with the saved RIP and R11 with the saved RFLAGS). Memory access depends on page tables, and page table entries have a bit that can make them only valid for the kernel, not user-space, so doing memory access while changing the privilege level might need to wait vs. just writing registers. Once in kernel mode, the kernel will normally use swapgs
or some other way of finding the kernel stack. (syscall
does not modify RSP; it's still pointing at the user stack on entry to the kernel.)
When returning using the SYSRET instruction, the values are restored from predetermined values in registers, so again, it's quick, because the processor just has to set a few registers up. The processor knows that it will change from Ring0 to Ring3, so can do the right things quickly.
(AMD CPUs support the syscall
instruction from 32-bit user-space; Intel CPUs do not. x86-64 was originally AMD64; this is why we have syscall
in 64-bit mode. AMD redesigned the kernel side of syscall
for 64-bit mode, so the 64-bit syscall
kernel entry point is significantly different from the 32-bit syscall
entry point in 64-bit kernels.)
The int 0x80
variant used in 32-bit mode will decide what to do based on the value in the interrupt descriptor table, which means reading from memory. There it finds the new CS and EIP/RIP values. The new CS register determines the new "ring" level - Ring0 in this case. It will then use the new CS value to look into the Task State Segment (based on the TR register) to find out which stack pointer (ESP/RSP and SS), and then finally jumps to the new address. Since this is a less direct and more generic solution it is also slower. The old EIP/RIP and CS is stored on the new stack, along with the old values of the SS and ESP/RSP.
When returning, using the IRET instructon,the processor reads the return address and the stack pointer values from the stack, loading the new stack segment and code segment values from the stack as well. Again, the process is generic, and takes quite a few memory reads. Since it's generic, the processor also will have to check "are we changing mode from Ring0 to Ring3, if so change these things".
So, in summary, it's faster because it was meant to work that way.
For 32-bit code, yes, you can definitely use the slow and compatible int 0x80
if you want.
For 64-bit code, int 0x80
is slower than syscall
and will truncate your pointers to 32-bit, so don't use it. See What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? Plus, int 0x80
isn't available in 64-bit mode on all kernels, so it's not safe even for a sys_exit
which doesn't take any pointer args: CONFIG_IA32_EMULATION
can be disabled, and notably is disabled on Windows Subsystem for Linux.