How system calls work on recent Linux x86 glibc
This post explains how system calls are implemented on recent Linux system. It covers only the x86_32 platform, on a recent Linux kernel and GNU Libc (where recent means “released after 2005”).
Processor facility for making syscall
On x86, userspace processes run in ring 3, while the kernel is in ring
0. Only the kernel can do the interface between the resources and the
processes.
A resource can be an access to a hardware device, a kernel object or any
kind of IPC. In other words, each time it is needed to do such action,
the userspace application has to make a request to the kernel; this is
what we call a system call (syscall), basically, this is the transition
from a ring to another one.
Historically, on Linux and x86, the best known method for performing a
syscall is to generate an interruption (the classic int $0x80
instruction) which is trapped by the kernel and then processed.
It was the most efficient way until the Pentium 4 where it became the
slowest mechanism available. The best method became the
sysenter/sysexit
instructions on x86_32
which is usable the same way
than with the interrupt. For instance, here is a simple call to
_exit(42)
:
mov $1, %eax ;; __NR_exit = 1
mov $42, %ebx ;; status = 42
sysenter ;; perform the syscall!
On AMD64, a similar mechanism exists: syscall/sysret
which is, by
the way, known to be a better interface and more performant than its
Intel equivalent. Anyway.
Usually, except shellcodes, syscalls are generated by the libc and,
depending on the processor, using one solution or another can have
strong impact on performances : if the libc keeps using int $0x80
even
on modern CPU, the performances will be bad.
The problem is that, usually, Linux distributions provide only one
compiled version of the libc: it has to run equally well on all CPU
versions (486, 586 or 686). Thus, there was a need for an abstraction
layer called by the libc which would choose the best mechanism at
runtime.
This is done by the kernel: it is compiled with all syscalls mechanisms
and selects the best one at boot time. Once a method is chosen, it
exposes a function to userspace calling directly the selected methods.
This way of exposing page is called a Virtual Dynamical Shared Object,
or VDSO.
From the other side, in the libc, making a system call is just a matter
of calling a VDSO’s function, without knowing if a historical interrupt
will be done or a sysenter
.
If we rewrite our previous snippet and make it use the vdso:
movl $1, %eax ;; __NR_exit = 1
movl $42, %ebx ;; status = 42
call *%gs:0x10 ;; Here, the offset (0x10) is platform-dependent
;; The memory page %gs:0x10 is located in the VDSO
Virtual Dynamic Shared Object
A Virtual Dynamic Shared Object (VDSO) is a page maintained by the
kernel and exposed to userspace by mapping this page into its address
space. For instance:
$ cat /proc/self/maps
08048000-08051000 r-xp 00000000 fd:01 14450888 /bin/cat
08051000-08052000 rw-p 00009000 fd:01 14450888 /bin/cat
083d7000-083f8000 rw-p 00000000 00:00 0 [heap]
b7475000-b7633000 r–p 00000000 fd:01 592041 /usr/lib/locale/locale-archive
b7633000-b7634000 rw-p 00000000 00:00 0
b7634000-b7775000 r-xp 00000000 fd:01 5769153 /lib/i686/cmov/libc-2.10.2.so
b7775000-b7777000 r–p 00141000 fd:01 5769153 /lib/i686/cmov/libc-2.10.2.so
b7777000-b7778000 rw-p 00143000 fd:01 5769153 /lib/i686/cmov/libc-2.10.2.so
b7778000-b777b000 rw-p 00000000 00:00 0
b7794000-b7796000 rw-p 00000000 00:00 0
b7796000-b7797000 r-xp 00000000 00:00 0 [vdso]
b7797000-b77b3000 r-xp 00000000 fd:01 2818106 /lib/ld-2.10.2.so
b77b3000-b77b4000 r–p 0001b000 fd:01 2818106 /lib/ld-2.10.2.so
b77b4000-b77b5000 rw-p 0001c000 fd:01 2818106 /lib/ld-2.10.2.so
bfafd000-bfb12000 rw-p 00000000 00:00 0 [stack]
Here, the VDSO is one page long (4096 bytes). It contains the syscall
abstraction interface, but also some shared variables (low level
information like rdtsc counter, real-time timer, stack canary, etc.)
The selection of the right syscall method is done by the Linux kernel in
arch/x86/vdso/vdso32-setup.c
in the sysenter_setup
function (which is called very early at kernel
initialization by identify_boot_cpu()
).
int __init sysenter_setup(void)
{
void *syscall_page = (void *)get_zeroed_page(GFP_ATOMIC);
const void *vsyscall;
size_t vsyscall_len;
vdso32_pages[0] = virt_to_page(syscall_page);
#ifdef CONFIG_X86_32
gate_vma_init();
#endif
if (vdso32_syscall()) {
vsyscall = &vdso32_syscall_start;
vsyscall_len = &vdso32_syscall_end - &vdso32_syscall_start;
} else if (vdso32_sysenter()){
vsyscall = &vdso32_sysenter_start;
vsyscall_len = &vdso32_sysenter_end - &vdso32_sysenter_start;
} else {
vsyscall = &vdso32_int80_start;
vsyscall_len = &vdso32_int80_end - &vdso32_int80_start;
}
memcpy(syscall_page, vsyscall, vsyscall_len);
relocate_vdso(syscall_page);
return 0;
}
The implementation of the sysenter
method is in
arch/x86/vdso/vdso32/sysenter.S
.
The routine called by the libc (with the call *%gs:0x10
) is named
__kernel_vsyscall
:
__kernel_vsyscall:
.LSTART_vsyscall:
push %ecx
.Lpush_ecx:
push %edx
.Lpush_edx:
push %ebp
.Lenter_kernel:
movl %esp,%ebp
sysenter
/* 7: align return point with nop’s to make disassembly easier */
.space 7,0x90
/* 14: System call restart point is here! (SYSENTER_RETURN-2) */
jmp .Lenter_kernel
/* 16: System call normal return point is here! */
VDSO32_SYSENTER_RETURN: /* Symbol used by sysenter.c via vdso32-syms.h */
pop %ebp
.Lpop_ebp:
pop %edx
.Lpop_edx:
pop %ecx
.Lpop_ecx:
ret
Linus Torvalds is the proud owner of this code because he managed to
handle the system call restarting thanks to a CPU particularity: when
the kernel is done with a system call and want to give the control back
to the process, it just have to perform the sysexit
instruction.
Prior to that, the kernel specified to the CPU that at sysexit
, it has
to jump to a specific static address. This address is the
VDSO32_SYSENTER_RETURN
label saw in the previous routine.