Why is there a need to modify system call tables in Linux?
You can check if they are read-only by looking up the kernel symbols. The "R" means read-only.*
$ grep sys_call_table /proc/kallsyms
0000000000000000 R sys_call_table
0000000000000000 R ia32_sys_call_table
0000000000000000 R x32_sys_call_table
So they are read-only, and have been since kernel 2.6.16. However, a kernel rootkit has the ability to make them writable again. All it needs to do is execute a function like this† in kernel mode (directly or via sufficiently-flexible ROP gadgets, of which there are plenty) with each address as the argument:
static void set_addr_rw(const unsigned long addr)
{
unsigned int level;
pte_t *pte = lookup_address(addr, &level);
if (pte->pte &~ _PAGE_RW)
pte->pte |= _PAGE_RW;
local_flush_tlb();
}
This changes the permissions of the syscall table and makes it possible to edit it. If this doesn't work for whatever reason, write protection in the kernel can be globally disabled with the following ASM:
cli
mov %cr0, %eax
and $~0x10000, %eax
mov %eax, %cr0
sti
This disables interrupts, disables the WP (Write-Protect) bit in CR0, and re-enables interrupts. Using assembly lets this work despite write_cr0(read_cr0() & ~0x10000)
failing due to the predefined function for writing to CR0 now pinning sensitive bits. Make sure you re-enable WP after, though!
So why is it marked as read-only if it's so easy to disable? One reason is that vulnerabilities exist which allow modifying kernel memory but not necessarily directly executing code. By marking critical areas of the kernel as read-only, it becomes more difficult to exploit them without finding an additional vulnerability to mark the pages as writable (or disable write-protection altogether). Now, this doesn't provide very strong security, so the main reason that it is marked as read-only is to make it easier to stop accidental overwrites from causing a catastrophic and unrecoverable system crash.
* The particular example given is for an x86_64 processor. The first table is for syscalls in native 64-bit mode (x64). The second is for syscalls in 32-bit mode (IA32). The third is for the rarely used x32 syscall ABI that allow programs to use all the features of 64-bit mode (e.g. SSE instead of x87 for floating point operations) while using 32-bit pointers and values.
† The kernel's internal API changes all the time, so this exact function may not work on older kernels or newer kernels. Globally disabling CR0.WP
in ASM however is guaranteed to work on all x86 systems regardless of the kernel version.
As noted by forest, modern Linux does not allow this, but it's easy to override.
However, historically it was useful (and maybe still is) for security purposes: hot-patching against vulnerabilities. Back in the 1990s and early 2000s, whenever a new vulnerability was announced for a syscall I didn't absolutely need (ptrace
was a really common one back then), I'd write a kernel module to overwrite the function address in the syscall table with the address of a function that just performed return -ENOSYS;
. This eliminated the attack surface until an upgraded kernel was available. For some dubious syscalls I didn't need that repeatedly had vulnerabilities, I just preemptively did this for them and left the module enabled all the time.