Just another geek

A blogging framework for hackers.

Document Review of Qubes OS

Qubes OS

You must have heard about it, Invisible Things Lab released their own operating system, named Qubes OS (If you ask me, I would have refer to it as a Linux distribution instead). Their distribution focuses on security isolation and is based on their virtualization experience (for the record, Joanna and Rafal are the people behind most of the virtualization vulnerabilities found in the previous years).

Disclaimer: I do not had the occasion to test the system, this post is only based on my reading of their (great) QubesOs architecture paper (version 0.3). I did not read the source code or whatever so be careful with what is following :)

Target of the distribution

Maybe I am guessing wrong, but this distribution seems to be really dedicated for classified environments. Even if it is usable by anyone, some concepts make me believe this work will be sold to government or military people because it full-fills most of the requirements. Anyway, this is awesome to release it to the public, in an Open Source licence.

I am sure that this release will be really helpful for the people involved in ”SEC&SI challenge” organized by the french government. This research project is dedicated to the construction of a secure Desktop platform (Linux based) usable by my grandma.

What is new?

There is nothing new by itself: only components already available in the community have been used (Linux kernel, Xen, Xorg, LUKS, device mapper, etc.), few code have been developed.

The beauty of their solution is that every techniques used are individually known but nobody had the idea to put them together like they did. Bravo!

The big picture

User activities can be identified in multiple security levels: web browsing, banking, corporate work, social networking, e-shopping, etc.

Each activity represents a security domain: banking activities are more critical than social networking (right?). Usually, on most of operating systems, every activities are in the same “space”: if you are compromised while browsing youtube.com, the attacker can access to your sensitive data (bank account or corporate email).

So QubesOS resolves this problem by running each domain into a virtual machine (called AppVM in QubesOS terminology), the virtualization solution they chose is Xen. They explain why they chose Xen instead of KVM in their architecture guide.

Thanks to the Xen hypervisor, each AppVM is isolated and cannot have access to other ressources than its own.

Architecture

Multimedia

  • Display

    Until Qubes OS, every Linux distribution providing “multi level security” use one X server running on the Host (in Dom0) and each virtual machine uses the display thanks to:

    • VNC-over-ssh. Problem: VNC client and server have not really been audited and they are crippled of security issues.
    • X11 forwarding. Problem: There is basically no security control in X11, a X client can do anything on other windows like capturing or injecting keystrokes, snooping other applications, etc. This is problematic when windows do not have the same security level.

    A lesser used alternative is the use of one Xorg server per virtual machine: each security level is inside a virtual terminal. However, unless there is a video card per X session, every servers share the same hardware resource so a vulnerability in Xorg would impact other Xorg servers.

    The innovation of Qubes OS is to not use any of theses methods. Each AppVM runs a Xorg instance (with a “dummy graphic driver”, which I guess is not tied to any hardware device) and a AppVM Window Manager.

    The dom0 domain runs the “real” Xorg server tied to the graphic card and multiple AppViewers, each AppViewer communicates with one AppVM Window Manager (which is inside a AppVM) via the Xen Ring buffer protocol. The task of each AppViewer is to proxify input devices to the right AppVM (depending on who has the focus). Each virtual machine uses a homemade input xorg-driver called xf86-input-mfndev (which gets its input from the ring buffer).

    Each AppVM Window Manager sends notifications to its associated AppViewer. The events monitored are: creation of new window, content refresh or change of window focus.

    When an AppViewer receives a content refresh notification, it requests to the AppVM Window Manager its composition buffer (the bitmap of the window content in other words). It receives theses bytes from the ring buffer and displays it on the screen.

    The optimization, which is still being investigate, is to ask the address of the composition buffer instead of the sending the raw bitmap. This is possible because the dom0 has access to the address space of every AppVM so it can directly use the bytes to render it without involving a double-copy.

    However, I do not know if this optimization would be sufficient to handle video playback: the paper suggests that the user can watch video on youtube so it seems to work, but I don’t see how. Even on my “normal” desktop, the system goes slow if I simply disable Xvideo overlay.

  • Audio

    Audio support is not yet implemented but will be certainly based on the same principle than the “composition buffer”: audio stream will be “written” in a buffer readable by AppViewer.

    That way, a dom0 daemon just has to mix every AppViewer audio streams and eventually sends the final stream to the sound card.

  • Clipboard

    “Applicative clipboard” (in opposition to X11 clipboard mechanism) operations are supported between AppVM. The user has to press a special shortcut (S-C-v) which is intercepted by the dom0 and not passed to AppVM. At this point, the AppViewer triggers a command on the virtual machine, sending the content of the cursor selection through the Xen ring buffer. The bytes are then stored in a volatile file on dom0.

    When the special paste shortcut is pressed, the dom0 injects the stored result via the ring buffer again and emulates the paste action.

Storage architecture

AppsVMs share the same “base filesystem” in order to not waste disk space. For that matter, each domain mounts a read-only block device and mounts, on top of it, a copy-on-write block device (thanks to the kernel’s device-mapper) accessible only to the AppVM.

Each time an AppVM is started, the copy-on-write volume is deleted in order to have a clean environment. Persistent data (like user documents) are stored in another private volume which is restored at AppVM creation time.

Every block devices are exported by the Storage Domain. This abstraction layer is needed to make possible file-sharing between AppVMs (thanks to a homemade cryptographic protocol).

We can see that the Storage Domain has great powers. To counterbalance it, cryptography was used.

The “base read-only block device” is signed (on a per-block basis). The private key is available only to the TPM and the dom0.

Application specific volumes (the copy-on-write overlay and the persistent block device) are encrypted (with LUKS) with a key available only to AppVMs and the dom0.

Thanks to this design, a compromission of the Storage Domain would be worthless because any attempt to modify data would be detected and persistent files are encrypted so an attacker would be disappointed :)

Network architecture

Most of the remote vulnerabilities found in the Linux kernel have been discovered in device drivers like network adapters. Because any bug found in the kernel puts in danger the whole system, it would be great to find a way to isolate theses drivers.

Thanks to recent CPU features, it is now possible to do such thing: Intel VT-d technology permits to safely give to a virtual machine access to a hardware device.

In other words, QubesOS now delegates the PCI wireless card to an AppVM, called Network domain. At this point, if a vulnerability is found in the wifi driver, only the virtual machine is compromised.

The Network domain is the border router: every AppVM routes its traffic through it. One of its task is also to enforce traffic policy: AppVMs are not allowed to communicate between each other, only HTTPS flows are allowed for the banking domain, only VPN traffic is allowed for corporate domain, etc.

Conclusion

On the paper, Qubes OS seems really well designed and robust from a security point of view. By glancing at the screenshots, the user experience seems good. I don’t know how good/bad are the performances: memory usage must be really high (because AFAIK, Xen does not implement the ”Kernel Samepage Merging” feature available in KVM since 2.6.32).

But, anyway, congratulations to “Invisible Things Lab” for this great architecture!

CVE-2010-0740: Record of Death Vulnerability in OpenSSL

A new vulnerability (CVE-2010-0740) was found in OpenSSL, affectionately called ”Record of death” (in reference to the ping of death vulnerability back in 1996) was fixed by the patch below:

--- ssl/s3_pkt.c 24 Jan 2010 13:52:38 -0000 1.57.2.9
+++ ssl/s3_pkt.c 24 Mar 2010 00:00:00 -0000
@@ -291,9 +291,9 @@
  if (version != s->version)
   {
   SSLerr(SSL_F_SSL3_GET_RECORD,SSL_R_WRONG_VERSION_NUMBER);
-  /* Send back error using their
-   * version number :-) */
-  s->version=version;
+                if ((s->version & 0xFF00) == (version & 0xFF00))
+                 /* Send back error using their minor version number :-) */
+   s->version = (unsigned short)version;
   al=SSL_AD_PROTOCOL_VERSION;
   goto f_err;
   }

Arno and myself had a look on this vuln, but at a glance, it’s hard to understand the consequences of theses two modifications:

  • Comparison of the server version and the packet version
  • Use of cast for the assignment.

The latter is the interesting part. s->version is declared as a int (32 bits signed value on x86) and version is a short (16 bits signed value on x86).

When doing the following assignment:

int i_version;
short s_version;

i_version = s_version;

What are the problems? On x86, i_version is big enough to store s_version so there is no truncation or overflow issues. However, theses two variables are signed and the C has the following rule:

Conversion of an operand value to a compatible type causes no change to the value or the representation.

In other words:

short s;
int i;

s = -1;
i = s; /* i must be equal to -1 */

To do this, the compiler has to perform a sign extension, which means that if the short value was negative, its integer value must stays negative.

Internally, the most significant bit (msb) of the short variable will be propagated in the integer variable for the “upper bits”. Examples:

  |--------+-----+--------------|                                                                                                                                                                                                                                                                                               
  |  short | msb | integer      |                                                                                                                                                                                                                                                                                               
  |--------+-----+--------------|                                                                                                                                                                                                                                                                                               
  | 0x0000 |   0 | 0x00000000   |                                                                                                                                                                                                                                                                                               
  | 0x7000 |   0 | 0x00070000   |                                                                                                                                                                                                                                                                                               
  | 0x8000 |   1 | 0xffff0000   |                                                                                                                                                                                                                                                                                               
  | 0xffff |   1 | 0xffffffff   |  
  |--------+-----+--------------|                                                                                                                                                                                                                                                                                               

So if version >= 0x8000, s->version will have a value >= 0xffff0000 (a big negative value).

According to the advisory, this bug can cause a crash of an OpenSSL end-point due to a read attempt at NULL.
The OpenSSL code uses extensively indirect function pointers for callbacks so it is hard to follow the code path without spending some time, so I cannot confirm neither my hypothesis nor the impact of the bug.

GSM 7 Bits Encoding

I implemented some GSM protocol parts in scapy so I had to implement the infamous “7 bits alphabet”.

This is used for SMS encoding for example, the principle is simple: each character is coded on 7 bits, which means that inside one byte, there are two (parts of) characters.

My google-fu was not sufficient to find a readable implementation so I gave it a try:

def decode_gsm7bits(x):
    shift=0
    remain=0
    s=''
    if not x:
        return s
    for byte in x:
        i = (ord(byte) << shift) | remain
        remain = (i >> 7)
        i = i & 0x7f
        s+=chr(i)
        shift = (shift+1)%7
        if shift == 0:
            s+=chr(remain)
            remain=0
    if s[-1] == '\x00': # padding issue
        s=s[:-1]
    return s

def encode_gsm7bits(x):
    shift=0
    remain=0
    srclen  = len(x)
    i=0
    stream=''
    mask=0
    while i < srclen:
        if i+1 == srclen:
            next = 0
        else:
            next = ord(x[i+1]) << (7-shift)
        cur  = (ord(x[i]) >> shift) | next
        stream += chr(cur & 0xff)
        i+=1
        shift = (shift+1)%7
        if shift == 0:
            mask=0
            i+=1
    return stream

As far as I can tell, it works like a charm: I successfully manage to send raw messages to mobiles :)
As soon as possible, I will post the GSM layers on scapy’s trac.

SECCOMP as a Sandboxing Solution ?

Sandboxing technology?

SECCOMP is a Linux feature introduced in 2.6.23 (2005) by Andrea Arcangeli, initially designed for grid computing applications. The idea was to sell CPU times to the public by running untrusted binaries.

When a process goes into SECCOMP mode, it can only do 4 syscalls: read, write, _exit and sigreturn. The kernel will enforce this limitation by killing (by a SIGKILL signal) the process if an unauthorized system call is made.

The security warranty here is pretty strong: the only way to evade the protection is to use file descriptors already opened or access to shared memory.

SECCOMP is the perfect solution for a sandbox because the kernel attack surface is really small! For the record, in the whole kernel security history, no vulnerability was ever found in theses syscalls.

The downside of this feature is its limitation! Once in SECCOMP mode, it is impossible to do anything except some arithmetics. Another SECCOMP problem is that the action of entering in SECCOMP mode is voluntary: the program needs to issue itself a prctl() call with appropriate arguments: that means the application needs to be developed specifically.

The purpose of a sandbox is to run untrusted binaries without requiring sources modifications. Currently, there are two main problems:

  • Enter in SECCOMP mode
  • Prevent the untrusted process from issuing system call

Both problems need to be solved without requiring a recompilation. How to do it despite this constraint?

Entering in SECCOMP mode

Basically, we need to inject a call to prctl() into a given process. The best known method is to write directly into the memory of the process by using the ptrace() interface.

Beside the evident problems of portability and the inherent difficulties of injecting instructions in a process, this solution was not investigated because of its hackish nature.

Instead, let’s take a look at a simple binary:

$ objdump -f a.out
a.out:     file format elf32-i386
architecture: i386, flags 0x00000112:
EXEC_P, HAS_SYMS, D_PAGED
start address 0x080482e0

The entry point of the binary, 0x080482e0, is the _start routine provided by the compiler and shown here:

080482e0 <_start>:
 80482e0:       31 ed                   xor    ebp,ebp
 80482e2:       5e                      pop    esi
 80482e3:       89 e1                   mov    ecx,esp
 80482e5:       83 e4 f0                and    esp,0xfffffff0
 80482e8:       50                      push   eax
 80482e9:       54                      push   esp
 80482ea:       52                      push   edx
 80482eb:       68 b0 83 04 08          push   0x80483b0
 80482f0:       68 c0 83 04 08          push   0x80483c0
 80482f5:       51                      push   ecx
 80482f6:       56                      push   esi
 80482f7:       68 94 83 04 08          push   0x8048394
 80482fc:       e8 c7 ff ff ff          call   80482c8 <__libc_start_main@plt>

It initializes the stack and then calls the “init function” of the GNU libc which will eventually execute the main() function. At this point, the program is effectively ran.

The interesting property of this routine is how the libc function is called: by using the Procedure Linkage Table (PLT). In a few words, that means the linker will have to resolve the symbol.

Thanks to the LD_PRELOAD feature, it’s possible to overload ELF symbols. This is how we are issuing the prctl() call: by overriding the __libc_start_main function and calling it on our own to be totally transparent, here is how it’s done:

typedef int (*main_t)(int, char **, char **);
main_t realmain;

int __libc_start_main(main_t main,
                      int argc,
                      char *__unbounded *__unbounded ubp_av,
                      ElfW(auxv_t) *__unbounded auxvec,
                      __typeof (main) init,
                      void (*fini) (void),
                      void (*rtld_fini) (void), void *__unbounded
                      stack_end)
{
        void *libc;
        int (*libc_start_main)(main_t main,
                               int,
                               char *__unbounded *__unbounded,
                               ElfW(auxv_t) *,
                               __typeof (main),
                               void (*fini) (void),
                               void (*rtld_fini) (void),
                               void *__unbounded stack_end);

        libc = dlopen("libc.so.6", RTLD_LOCAL  | RTLD_LAZY);
        if (!libc)
                ERROR("  dlopen() failed: %s\n", dlerror());
        libc_start_main = dlsym(libc, "__libc_start_main");
        if (!libc_start_main)
                ERROR("     Failed: %s\n", dlerror());

        realmain = main;
        void (*__malloc_initialize_hook) (void) = my_malloc_init;
        return (*libc_start_main)(wrap_main, argc, ubp_av, auxvec,
        init, fini, rtld_fini, stack_end);
}

In a nutshell:

  1. The first parameter of the function is the address of the main
  2. We open the libc library object
  3. We find the location of the original __libc_start_main
  4. We save the original main function into a global variable
  5. We call the original __libc_start_main by replacing the original main by our own (wrap_main) shown here:
int wrap_main(int argc, char **argv, char **environ)
{
        if (prctl(PR_SET_SECCOMP, 1, 0, 0) == -1) {
                perror("prctl(PR_SET_SECCOMP) failed");
                printf("Maybe you don't have the CONFIG_SECCOMP support built into your kernel?\n");
                exit(1);
        }

        (*realmain)(argc, argv, environ);
}

At this point, the original main() is called and the program is executed under SECCOMP. The drawback of this method is its incompatibility with statically linked binary. In this case, the _start routine calls directly __libc_start_main function without using the PLT.

The big vulnerability here is the case of a malicious binary with a _start routine not calling __libc_start_main, in that case, the prctl() would not be done and the program would run without sandboxing. This issue was ignored for the moment but it will require some thought…

There is still the option of modifying the memory with some ptrace() calls or rewriting some memory mapping thanks to the method of Sebastian Krahmer presented in lasso.

Interception of syscalls

Now that the application is running under SECCOMP, it’s not possible anymore to do a syscall (except read, write, _exit and sigreturn). Because we made the assumption that the sandboxed program was not designed to run SECCOMP, we have to prevent it from issuing such forbidden call.

Thus, we need to intercept the syscall before the kernel, process it if possible and emulate the kernel behavior. The interception of syscalls is usually done, again, with the ptrace() interface, the main drawback of this method is the lack of debugging mean: because all debuggers use ptrace and a process can only be traced once, that means that each bug would be a nightmare.

Furthermore, the ptrace interface is known to be crippled and a lot of security bugs have been found, fortunately, this was from the tracer side, but there was some advisories where the tracee could harm the tracer process.

Another solution was investigated based on the analysis of the syscall handling in the Libc. We saw in my previous post “How system calls work on Linux?” that the GNU Libc was making syscalls by doing a call *%gs:0x10 (where 0x10 is variable).

Hijacking VDSO

In order to intercept (legit) sycalls, we need to intercept the previous call instruction. This is easy, we have to overwrite the pointer stored at the address %gs:0x10 and redirect the process to our own function.

This what we do immediatly after turning on SECCOMP:

static void hijack_vdso_gate(void) {
        asm("mov %%gs:0x10, %%ebx\n"
            "mov %%ebx, %0\n"

            "mov %1, %%ebx\n"
            "mov %%ebx, %%gs:0x10\n"

            : "=m" (real_handler)
            : "r" (handler)
            : "ebx");
} __attribute__((always_inline));

From now on, every syscalls are trapped by our handler, even the one which are “allowed” by SECCOMP.

Demultiplexing syscalls

The purpose of the handler is to look at the syscall requested, see if we need to honor it ourself (because it’s a forbidden syscall) or run the original VDSO’s function.

Our handler needs to be carefully written in order to not mess up with the registers: our function must not modify any register. That is the reason why it was written in assembly:

void handler(void) {
        /* syscall_proxy() is the "forbidden syscalls" handler */
        void (*syscall_proxy_addr)(void) = syscall_proxy;

        asm("cmpl $4, %%eax\n"
            "je do_syscall\n"

            "cmpl $3, %%eax\n"
            "je do_syscall\n"

            "cmpl $0xfc, %%eax\n"
            "jne wrapper\n"

            "movl $1, %%eax\n"
            "jmp do_syscall\n"

            "wrapper:\n"
            "                   call *%0\n"
            "                   jmp out\n"

            "do_syscall:\n"
            "                   call *%1\n"
            "out:               nop\n"

            : /* output */
            : "m" (syscall_proxy_addr),
              "m" (real_handler)); /* real_handler is the original
                                    * VDSO function, performing 
                                    * effectively the syscall 
                                    */
}

Each time the libc makes a syscall, we either perform the action directly or we call our “syscall proxy”. More on that later…

No More ASLR Bypass on Linux 2.6.30

While trying to exploit a local setuid application, I had the unhappiness (as an attacker) to see that the security of the ASLR Linux kernel has increased, removing a whole method of exploitation. But let’s begin from the start:
The minimalist vulnerable example could be this vuln.c:
#include <stdio.h>
#include <unistd.h>

int main( int argc, char *argv[] )
{
        char buf[4];

        printf("%#p\n", &buf);
        strcpy( buf, argv[1] );
        return 0;
}
Because of the Address Space Layout Randomization (ASLR), this bug is tough to exploit: if the binary is compiled with the right options and the kernel is configured to fully randomize the address space, it becomes impossible to guess where the buffer is, nor the location of the functions’ libraries.
But there was a trick (firstly published by Jon Erickson in his book): the randomization is computed at exec*() time, the seed used to generate the entropy was rekeyed every X milliseconds with the PID and the jiffies variable (which is the number of clock interruptions since the boot), it was known to be cryptographically weak but it was good enough for daemons: remotely, it’s not possible to guess either the PID or jiffies (except in case of a format string vulnerability or an information leak).
But locally, the entropy was just useless: a minimalistic process which would just exec*() another one would get the same memory layout because both program has the same PID and the jiffies would not be updated.
Practically, even on a fully randomized system, it was possible to guess the addresses, here is the minimalistic program just printing the address of its buffer and executing the vulnerable binary (which itself prints its buffer address):
#include <unistd.h>

int main(int argc, char **argv) {
        char dummy[4] = "AAA";

        printf("%#p\n", dummy);
        execl("./vuln", dummy, NULL);
}
The following Python code based on expect runs the exploit multiple times and compute the differences between the addresses of ./vuln and ./exploit :
#! /usr/bin/python

import pexpect

while True:
    child = pexpect.spawn('./exploit')
    child.sendeof()

    a=int(child.readline()[:-2], 16)
    b=int(child.readline()[:-2], 16)

    print 'offset=%#x' % (b-a)
    child.expect(pexpect.EOF)
Let’s do it:
lenny32:/tmp$ uname -a
Linux lenny32 2.6.26-2-686 #1 SMP Wed Aug 19 06:06:52 UTC 2009 i686 GNU/Linux
lenny32:/tmp$ cat /proc/sys/kernel/randomize_va_space
2
lenny32:/tmp$ ./guess_offset
offset=0x10
offset=0x148160
offset=0x10
offset=0x10
offset=0x10
offset=0x1bf9d0
offset=0x10
offset=0x1d91f0
offset=0x10
offset=-0x2ba2a0
offset=0x3d050
offset=0x10
offset=0x10
offset=0x10
offset=-0x19a990
offset=0x10
offset=0x10
offset=0x10
offset=0x10
offset=0x10
offset=0x10
offset=0x10
KeyboardInterrupt
Most of the time, we can see that the offset is equals to 0x10, great! But on a 2.6.32 kernel, the result is totally different:
$ ./guess_offset
offset=0x4fddb0
offset=0x69f330
offset=0x137e40
offset=0x6b49f0
offset=0x407600
offset=0x14cf50
offset=0x3f4930
offset=0x4d0f80
offset=0x107d20
offset=0x1969b0
offset=0x1ae360
offset=0x409b30
In other words, it’s now impossible to guess the address space layout with this method.
When was patched the function in charge of the randomness, get_random_int()? Let’s use git-blame in order to annotate each source line with its modification date and commit:
% git blame -L 1688,1709 drivers/char/random.c
8a0a9bd4 DEFINE_PER_CPU(__u32 [4], get_random_int_hash);
^1da177e unsigned int get_random_int(void)
^1da177e {
8a0a9bd4  struct keydata *keyptr;
8a0a9bd4  __u32 *hash = get_cpu_var(get_random_int_hash);
8a0a9bd4  int ret;
8a0a9bd4 
8a0a9bd4  keyptr = get_keyptr();
26a9a418  hash[0] += current->pid + jiffies + get_cycles();
8a0a9bd4 
8a0a9bd4  ret = half_md4_transform(hash, keyptr->secret);
8a0a9bd4  put_cpu_var(get_random_int_hash);
8a0a9bd4 
8a0a9bd4  return ret;
^1da177e }
^1da177e 

Arg! It was patched in commit 8a0a9bd4 by Linus Torvalds in response to CVE2009-3238 in May 2009. The first released kernel carrying this patch is the 2.6.30 in June 2009.

Actually, I’m not aware of any generic trick to achieve the same goal (now that information leaks on /proc entries have been fixed too).

How System Calls Work on Recent Linux X86 Glibc

This post explains how system calls are implemented on recent Linux system. It covers only the x86_32 platform, on a recent Linux kernel and GNU Libc (where recent means “released after 2005”).

Processor facility for making syscall

On x86, userspace processes run in ring 3, while the kernel is in ring 0. Only the kernel can do the interface between the resources and the processes.
A resource can be an access to a hardware device, a kernel object or any kind of IPC. In other words, each time it is needed to do such action, the userspace application has to make a request to the kernel; this is what we call a system call (syscall), basically, this is the transition from a ring to another one.
Historically, on Linux and x86, the best known method for performing a syscall is to generate an interruption (the classic int $0x80 instruction) which is trapped by the kernel and then processed.
It was the most efficient way until the Pentium 4 where it became the slowest mechanism available. The best method became the sysenter/sysexit instructions on x86_32 which is usable the same way than with the interrupt. For instance, here is a simple call to _exit(42):
mov $1, %eax   ;; __NR_exit = 1
mov $42, %ebx  ;; status = 42
sysenter       ;; perform the syscall!
On AMD64, a similar mechanism exists: syscall=/=sysret which is, by the way, known to be a better interface and more performant than its Intel equivalent. Anyway.
Usually, except shellcodes, syscalls are generated by the libc and, depending on the processor, using one solution or another can have strong impact on performances : if the libc keeps using int $0x80 even on modern CPU, the performances will be bad.
The problem is that, usually, Linux distributions provide only one compiled version of the libc: it has to run equally well on all CPU versions (486, 586 or 686). Thus, there was a need for an abstraction layer called by the libc which would choose the best mechanism at runtime.
This is done by the kernel: it is compiled with all syscalls mechanisms and selects the best one at boot time. Once a method is chosen, it exposes a function to userspace calling directly the selected methods. This way of exposing page is called a Virtual Dynamical Shared Object, or VDSO.
From the other side, in the libc, making a system call is just a matter of calling a VDSO’s function, without knowing if a historical interrupt will be done or a sysenter.
If we rewrite our previous snippet and make it use the vdso:
movl $1, %eax   ;; __NR_exit = 1
movl $42, %ebx   ;; status   = 42
call *%gs:0x10  ;; Here, the offset (0x10) is platform-dependent
                ;; The memory page %gs:0x10 is located in the VDSO

Virtual Dynamic Shared Object

A Virtual Dynamic Shared Object (VDSO) is a page maintained by the kernel and exposed to userspace by mapping this page into its address space. For instance:
$ cat /proc/self/maps
08048000-08051000 r-xp 00000000 fd:01 14450888   /bin/cat
08051000-08052000 rw-p 00009000 fd:01 14450888   /bin/cat
083d7000-083f8000 rw-p 00000000 00:00 0          [heap]
b7475000-b7633000 r--p 00000000 fd:01 592041     /usr/lib/locale/locale-archive
b7633000-b7634000 rw-p 00000000 00:00 0 
b7634000-b7775000 r-xp 00000000 fd:01 5769153    /lib/i686/cmov/libc-2.10.2.so
b7775000-b7777000 r--p 00141000 fd:01 5769153    /lib/i686/cmov/libc-2.10.2.so
b7777000-b7778000 rw-p 00143000 fd:01 5769153    /lib/i686/cmov/libc-2.10.2.so
b7778000-b777b000 rw-p 00000000 00:00 0 
b7794000-b7796000 rw-p 00000000 00:00 0 
b7796000-b7797000 r-xp 00000000 00:00 0          [vdso]
b7797000-b77b3000 r-xp 00000000 fd:01 2818106    /lib/ld-2.10.2.so
b77b3000-b77b4000 r--p 0001b000 fd:01 2818106    /lib/ld-2.10.2.so
b77b4000-b77b5000 rw-p 0001c000 fd:01 2818106    /lib/ld-2.10.2.so
bfafd000-bfb12000 rw-p 00000000 00:00 0          [stack]
Here, the VDSO is one page long (4096 bytes). It contains the syscall abstraction interface, but also some shared variables (low level information like rdtsc counter, real-time timer, stack canary, etc.)
The selection of the right syscall method is done by the Linux kernel in arch/x86/vdso/vdso32-setup.c in the sysenter_setup function (which is called very early at kernel initialization by identify_boot_cpu()).
int __init sysenter_setup(void)
{
    void *syscall_page = (void *)get_zeroed_page(GFP_ATOMIC);
    const void *vsyscall;
    size_t vsyscall_len;

    vdso32_pages[0] = virt_to_page(syscall_page);

#ifdef CONFIG_X86_32
    gate_vma_init();
#endif

    if (vdso32_syscall()) {
        vsyscall = &vdso32_syscall_start;
        vsyscall_len = &vdso32_syscall_end - &vdso32_syscall_start;
    } else if (vdso32_sysenter()){
        vsyscall = &vdso32_sysenter_start;
        vsyscall_len = &vdso32_sysenter_end - &vdso32_sysenter_start;
    } else {
        vsyscall = &vdso32_int80_start;
        vsyscall_len = &vdso32_int80_end - &vdso32_int80_start;
    }

    memcpy(syscall_page, vsyscall, vsyscall_len);
    relocate_vdso(syscall_page);

    return 0;
}
The implementation of the sysenter method is in arch/x86/vdso/vdso32/sysenter.S. The routine called by the libc (with the call *%gs:0x10) is named __kernel_vsyscall:
  __kernel_vsyscall:
  .LSTART_vsyscall:
      push %ecx
  .Lpush_ecx:
      push %edx
  .Lpush_edx:
      push %ebp
  .Lenter_kernel:
      movl %esp,%ebp
      sysenter
    /* 7: align return point with nop's to make disassembly easier */
    .space 7,0x90

    /* 14: System call restart point is here! (SYSENTER_RETURN-2) */
    jmp .Lenter_kernel
    /* 16: System call normal return point is here! */
VDSO32_SYSENTER_RETURN: /* Symbol used by sysenter.c via vdso32-syms.h */
    pop %ebp
.Lpop_ebp:
    pop %edx
.Lpop_edx:
    pop %ecx
.Lpop_ecx:
    ret
Linus Torvalds is the proud owner of this code because he managed to handle the system call restarting thanks to a CPU particularity: when the kernel is done with a system call and want to give the control back to the process, it just have to perform the sysexit instruction.
Prior to that, the kernel specified to the CPU that at sysexit, it has to jump to a specific static address. This address is the VDSO32_SYSENTER_RETURN label saw in the previous routine.

New Blog, New Rules

Sometimes, I receive emails asking me to translate my papers or blog posts in English; each time, I procrastinate and never do it. My year’s resolution is to address this issue, that’s why this blog goes english now.

Off course, I will not translate previous posts because it would be so time-consumming that even just identify which post is interesting enough to be translated would be too long.

That’s why I will keep available my previous posts at this address: http://chdir.org/~nico/blog/posts/ and also because there are indexed by search engines.

 That is also the occasion of using a modern blogging engine like before my switch to ikiwiki, so I use the Blogspot services from now on:
Now that comments are back, feel free to nitpick :)