Just Another Geek

20 Feb 2010

How system calls work on recent Linux x86 glibc

This post explains how system calls are implemented on recent Linux system. It covers only the x86_32 platform, on a recent Linux kernel and GNU Libc (where recent means “released after 2005”).

Processor facility for making syscall

On x86, userspace processes run in ring 3, while the kernel is in ring 0. Only the kernel can do the interface between the resources and the processes.
A resource can be an access to a hardware device, a kernel object or any kind of IPC. In other words, each time it is needed to do such action, the userspace application has to make a request to the kernel; this is what we call a system call (syscall), basically, this is the transition from a ring to another one.
Historically, on Linux and x86, the best known method for performing a syscall is to generate an interruption (the classic int $0x80 instruction) which is trapped by the kernel and then processed.
It was the most efficient way until the Pentium 4 where it became the slowest mechanism available. The best method became the sysenter/sysexit instructions on x86_32 which is usable the same way than with the interrupt. For instance, here is a simple call to _exit(42):

mov $1, %eax   ;; __NR_exit = 1
mov $42, %ebx  ;; status = 42
sysenter       ;; perform the syscall!

On AMD64, a similar mechanism exists: syscall/sysret which is, by the way, known to be a better interface and more performant than its Intel equivalent. Anyway.

Usually, except shellcodes, syscalls are generated by the libc and, depending on the processor, using one solution or another can have strong impact on performances : if the libc keeps using int $0x80 even on modern CPU, the performances will be bad.

The problem is that, usually, Linux distributions provide only one compiled version of the libc: it has to run equally well on all CPU versions (486, 586 or 686). Thus, there was a need for an abstraction layer called by the libc which would choose the best mechanism at runtime.
This is done by the kernel: it is compiled with all syscalls mechanisms and selects the best one at boot time. Once a method is chosen, it exposes a function to userspace calling directly the selected methods. This way of exposing page is called a Virtual Dynamical Shared Object, or VDSO.
From the other side, in the libc, making a system call is just a matter of calling a VDSO’s function, without knowing if a historical interrupt will be done or a sysenter.
If we rewrite our previous snippet and make it use the vdso:
movl $1, %eax ;; __NR_exit = 1 movl $42, %ebx ;; status = 42 call *%gs:0x10 ;; Here, the offset (0x10) is platform-dependent ;; The memory page %gs:0x10 is located in the VDSO

Virtual Dynamic Shared Object

A Virtual Dynamic Shared Object (VDSO) is a page maintained by the kernel and exposed to userspace by mapping this page into its address space. For instance:
$ cat /proc/self/maps 08048000-08051000 r-xp 00000000 fd:01 14450888 /bin/cat 08051000-08052000 rw-p 00009000 fd:01 14450888 /bin/cat 083d7000-083f8000 rw-p 00000000 00:00 0 [heap] b7475000-b7633000 r–p 00000000 fd:01 592041 /usr/lib/locale/locale-archive b7633000-b7634000 rw-p 00000000 00:00 0 b7634000-b7775000 r-xp 00000000 fd:01 5769153 /lib/i686/cmov/libc-2.10.2.so b7775000-b7777000 r–p 00141000 fd:01 5769153 /lib/i686/cmov/libc-2.10.2.so b7777000-b7778000 rw-p 00143000 fd:01 5769153 /lib/i686/cmov/libc-2.10.2.so b7778000-b777b000 rw-p 00000000 00:00 0 b7794000-b7796000 rw-p 00000000 00:00 0 b7796000-b7797000 r-xp 00000000 00:00 0 [vdso] b7797000-b77b3000 r-xp 00000000 fd:01 2818106 /lib/ld-2.10.2.so b77b3000-b77b4000 r–p 0001b000 fd:01 2818106 /lib/ld-2.10.2.so b77b4000-b77b5000 rw-p 0001c000 fd:01 2818106 /lib/ld-2.10.2.so bfafd000-bfb12000 rw-p 00000000 00:00 0 [stack]

Here, the VDSO is one page long (4096 bytes). It contains the syscall abstraction interface, but also some shared variables (low level information like rdtsc counter, real-time timer, stack canary, etc.)
The selection of the right syscall method is done by the Linux kernel in arch/x86/vdso/vdso32-setup.c in the sysenter_setup function (which is called very early at kernel initialization by identify_boot_cpu()).
int __init sysenter_setup(void) { void *syscall_page = (void *)get_zeroed_page(GFP_ATOMIC); const void *vsyscall; size_t vsyscall_len;

    vdso32_pages[0] = virt_to_page(syscall_page);

#ifdef CONFIG_X86_32

    if (vdso32_syscall()) {
        vsyscall = &vdso32_syscall_start;
        vsyscall_len = &vdso32_syscall_end - &vdso32_syscall_start;
    } else if (vdso32_sysenter()){
        vsyscall = &vdso32_sysenter_start;
        vsyscall_len = &vdso32_sysenter_end - &vdso32_sysenter_start;
    } else {
        vsyscall = &vdso32_int80_start;
        vsyscall_len = &vdso32_int80_end - &vdso32_int80_start;

    memcpy(syscall_page, vsyscall, vsyscall_len);

    return 0;

The implementation of the sysenter method is in arch/x86/vdso/vdso32/sysenter.S. The routine called by the libc (with the call *%gs:0x10) is named __kernel_vsyscall:
__kernel_vsyscall: .LSTART_vsyscall: push %ecx .Lpush_ecx: push %edx .Lpush_edx: push %ebp .Lenter_kernel: movl %esp,%ebp sysenter /* 7: align return point with nop’s to make disassembly easier */ .space 7,0x90

    /* 14: System call restart point is here! (SYSENTER_RETURN-2) */
    jmp .Lenter_kernel
    /* 16: System call normal return point is here! */
VDSO32_SYSENTER_RETURN: /* Symbol used by sysenter.c via vdso32-syms.h */
    pop %ebp
    pop %edx
    pop %ecx

Linus Torvalds is the proud owner of this code because he managed to handle the system call restarting thanks to a CPU particularity: when the kernel is done with a system call and want to give the control back to the process, it just have to perform the sysexit instruction.
Prior to that, the kernel specified to the CPU that at sysexit, it has to jump to a specific static address. This address is the VDSO32_SYSENTER_RETURN label saw in the previous routine.