Fastest Linux system call
One that doesn't exist, and therefore returns -ENOSYS quickly.
From arch/x86/entry/entry_64.S:
#if __SYSCALL_MASK == ~0
cmpq $__NR_syscall_max, %rax
#else
andl $__SYSCALL_MASK, %eax
cmpl $__NR_syscall_max, %eax
#endif
ja 1f /* return -ENOSYS (already in pt_regs->ax) */
movq %r10, %rcx
/*
* This call instruction is handled specially in stub_ptregs_64.
* It might end up jumping to the slow path. If it jumps, RAX
* and all argument registers are clobbered.
*/
#ifdef CONFIG_RETPOLINE
movq sys_call_table(, %rax, 8), %rax
call __x86_indirect_thunk_rax
#else
call *sys_call_table(, %rax, 8)
#endif
.Lentry_SYSCALL_64_after_fastpath_call:
movq %rax, RAX(%rsp)
1:
Use an invalid system call number so the dispatching code simply returns with
eax = -ENOSYS
instead of dispatching to a system-call handling function at all.
Unless this causes the kernel to use the iret
slow path instead of sysret
/ sysexit
. That might explain the measurements showing an invalid number being 17 cycles slower than syscall(SYS_getpid)
, because glibc error handling (setting errno
) probably doesn't explain it. But from my reading of the kernel source, I don't see any reason why it wouldn't still use sysret
while returning -ENOSYS
.
This answer is for sysenter
, not syscall
. The question originally said sysenter
/ sysret
(which was weird because sysexit
goes with sysenter
, while sysret
goes with syscall
). I answered based on sysenter
for a 32-bit process on an x86-64 kernel.
Native 64-bit syscall
is handled more efficiently inside the kernel. (Update; with Meltdown / Spectre mitigation patches, it still dispatches via C do_syscall_64
in 4.16-rc2).
My What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? Q&A gives an overview of the kernel side of system-call entry points from compat mode into an x86-64 kernel (entry_64_compat.S
). This answer is just taking the relevant parts of that.
The links in that answer and this are to Linux 4.12 sources, which doesn't contain the Meltdown mitigation page-table manipulation, so that will be significant extra overhead.
int 0x80
and sysenter
have different entry points. You're looking for entry_SYSENTER_compat
. AFAIK, sysenter
always goes there, even if you execute it in a 64-bit user-space process. Linux's entry point pushes a constant __USER32_CS
as the saved CS value, so it will always return to user-space in 32-bit mode.
After pushing registers to construct a struct pt_regs
on the kernel stack, there's a TRACE_IRQS_OFF
hook (no idea how many instructions that amounts to), then call do_fast_syscall_32
which is written in C. (Native 64-bit syscall
dispatching is done directly from asm, but 32-bit compat system calls are always dispatched through C).
do_syscall_32_irqs_on
in arch/x86/entry/common.c
is pretty light-weight: just a check if the process is being traced (I think this is how strace
can hook system calls via ptrace
), then
...
if (likely(nr < IA32_NR_syscalls)) {
regs->ax = ia32_sys_call_table[nr]( ... arg );
}
syscall_return_slowpath(regs);
}
AFAIK, the kernel can use sysexit
after this function returns.
So the return path is the same whether or not EAX had a valid system call number, and obviously returning without dispatching at all is the fastest path through that function, especially in a kernel with Spectre mitigation where the indirect branch on the table of function pointers would go through a retpoline and always mispredict.
If you want to really test sysenter/sysexit without all that extra overhead, you'll need to modify Linux to put a much simpler entry point without checking for tracing or pushing / popping all the registers.
You'd probably also want to modify the ABI to pass a return address in a register (like syscall
does on its own) instead of saved on the user-space stack which Linux's current sysenter
ABI does; it has to get_user()
to read the EIP value it should return to.
Of if all this overhead is part of what you want to measure, you're definitely all set with an eax that gives you -ENOSYS
; at worst you'll be getting one extra branch miss from the range-check if branch predictors are hot for that branch based on normal 32-bit system calls.
In this benchmark by Brendan Gregg (linked from this blog post which is interesting reading on the topic) close(999)
(or some other fd not in use) is recommended.