mardi 23 février 2010

No more ASLR bypass on Linux 2.6.30

While trying to exploit a local setuid application, I had the unhappiness (as an attacker) to see that the security of the ASLR Linux kernel has increased, removing a whole method of exploitation. But let's begin from the start:
The minimalist vulnerable example could be this vuln.c:
#include <stdio.h>
#include <unistd.h>

int main( int argc, char *argv[] )
{
        char buf[4];

        printf("%#p\n", &buf);
        strcpy( buf, argv[1] );
        return 0;
}
Because of the Address Space Layout Randomization (ASLR), this bug is tough to exploit: if the binary is compiled with the right options and the kernel is configured to fully randomize the address space, it becomes impossible to guess where the buffer is, nor the location of the functions' libraries.
But there was a trick (firstly published by Jon Erickson in his book): the randomization is computed at exec*() time, the seed used to generate the entropy was rekeyed every X milliseconds with the PID and the jiffies variable (which is the number of clock interruptions since the boot), it was known to be cryptographically weak but it was good enough for daemons: remotely, it's not possible to guess either the PID or jiffies (except in case of a format string vulnerability or an information leak).
But locally, the entropy was just useless: a minimalistic process which would just exec*() another one would get the same memory layout because both program has the same PID and the jiffies would not be updated.
Practically, even on a fully randomized system, it was possible to guess the addresses, here is the minimalistic program just printing the address of its buffer and executing the vulnerable binary (which itself prints its buffer address):
#include <unistd.h>

int main(int argc, char **argv) {
        char dummy[4] = "AAA";

        printf("%#p\n", dummy);
        execl("./vuln", dummy, NULL);
}
The following Python code based on expect runs the exploit multiple times and compute the differences between the addresses of ./vuln and ./exploit :
#! /usr/bin/python

import pexpect

while True:
    child = pexpect.spawn('./exploit')
    child.sendeof()

    a=int(child.readline()[:-2], 16)
    b=int(child.readline()[:-2], 16)

    print 'offset=%#x' % (b-a)
    child.expect(pexpect.EOF)
Let's do it:
lenny32:/tmp$ uname -a
Linux lenny32 2.6.26-2-686 #1 SMP Wed Aug 19 06:06:52 UTC 2009 i686 GNU/Linux
lenny32:/tmp$ cat /proc/sys/kernel/randomize_va_space
2
lenny32:/tmp$ ./guess_offset
offset=0x10
offset=0x148160
offset=0x10
offset=0x10
offset=0x10
offset=0x1bf9d0
offset=0x10
offset=0x1d91f0
offset=0x10
offset=-0x2ba2a0
offset=0x3d050
offset=0x10
offset=0x10
offset=0x10
offset=-0x19a990
offset=0x10
offset=0x10
offset=0x10
offset=0x10
offset=0x10
offset=0x10
offset=0x10
KeyboardInterrupt
Most of the time, we can see that the offset is equals to 0x10, great! But on a 2.6.32 kernel, the result is totally different:
$ ./guess_offset
offset=0x4fddb0
offset=0x69f330
offset=0x137e40
offset=0x6b49f0
offset=0x407600
offset=0x14cf50
offset=0x3f4930
offset=0x4d0f80
offset=0x107d20
offset=0x1969b0
offset=0x1ae360
offset=0x409b30
In other words, it's now impossible to guess the address space layout with this method.
When was patched the function in charge of the randomness, get_random_int()? Let's use git-blame in order to annotate each source line with its modification date and commit:
% git blame -L 1688,1709 drivers/char/random.c
8a0a9bd4 DEFINE_PER_CPU(__u32 [4], get_random_int_hash);
^1da177e unsigned int get_random_int(void)
^1da177e {
8a0a9bd4  struct keydata *keyptr;
8a0a9bd4  __u32 *hash = get_cpu_var(get_random_int_hash);
8a0a9bd4  int ret;
8a0a9bd4 
8a0a9bd4  keyptr = get_keyptr();
26a9a418  hash[0] += current->pid + jiffies + get_cycles();
8a0a9bd4 
8a0a9bd4  ret = half_md4_transform(hash, keyptr->secret);
8a0a9bd4  put_cpu_var(get_random_int_hash);
8a0a9bd4 
8a0a9bd4  return ret;
^1da177e }
^1da177e 

Arg! It was patched in commit 8a0a9bd4 by Linus Torvalds in response to CVE2009-3238 in May 2009. The first released kernel carrying this patch is the 2.6.30 in June 2009.

Actually, I'm not aware of any generic trick to achieve the same goal (now that information leaks on /proc entries have been fixed too).

samedi 20 février 2010

How system calls work on recent Linux x86 glibc

This post explains how system calls are implemented on recent Linux system. It covers only the x86_32 platform, on a recent Linux kernel and GNU Libc (where recent means "released after 2005").

Processor facility for making syscall

On x86, userspace processes run in ring 3, while the kernel is in ring 0. Only the kernel can do the interface between the resources and the processes.
A resource can be an access to a hardware device, a kernel object or any kind of IPC. In other words, each time it is needed to do such action, the userspace application has to make a request to the kernel; this is what we call a system call (syscall), basically, this is the transition from a ring to another one.
Historically, on Linux and x86, the best known method for performing a syscall is to generate an interruption (the classic int $0x80 instruction) which is trapped by the kernel and then processed.
It was the most efficient way until the Pentium 4 where it became the slowest mechanism available. The best method became the sysenter/sysexit instructions on x86_32 which is usable the same way than with the interrupt. For instance, here is a simple call to _exit(42):
mov $1, %eax   ;; __NR_exit = 1
mov $42, %ebx  ;; status = 42
sysenter       ;; perform the syscall!
On AMD64, a similar mechanism exists: syscall=/=sysret which is, by the way, known to be a better interface and more performant than its Intel equivalent. Anyway.
Usually, except shellcodes, syscalls are generated by the libc and, depending on the processor, using one solution or another can have strong impact on performances : if the libc keeps using int $0x80 even on modern CPU, the performances will be bad.
The problem is that, usually, Linux distributions provide only one compiled version of the libc: it has to run equally well on all CPU versions (486, 586 or 686). Thus, there was a need for an abstraction layer called by the libc which would choose the best mechanism at runtime.
This is done by the kernel: it is compiled with all syscalls mechanisms and selects the best one at boot time. Once a method is chosen, it exposes a function to userspace calling directly the selected methods. This way of exposing page is called a Virtual Dynamical Shared Object, or VDSO.
From the other side, in the libc, making a system call is just a matter of calling a VDSO's function, without knowing if a historical interrupt will be done or a sysenter.
If we rewrite our previous snippet and make it use the vdso:
movl $1, %eax   ;; __NR_exit = 1
movl $42, %ebx   ;; status   = 42
call *%gs:0x10  ;; Here, the offset (0x10) is platform-dependent
                ;; The memory page %gs:0x10 is located in the VDSO

Virtual Dynamic Shared Object

A Virtual Dynamic Shared Object (VDSO) is a page maintained by the kernel and exposed to userspace by mapping this page into its address space. For instance:
$ cat /proc/self/maps
08048000-08051000 r-xp 00000000 fd:01 14450888   /bin/cat
08051000-08052000 rw-p 00009000 fd:01 14450888   /bin/cat
083d7000-083f8000 rw-p 00000000 00:00 0          [heap]
b7475000-b7633000 r--p 00000000 fd:01 592041     /usr/lib/locale/locale-archive
b7633000-b7634000 rw-p 00000000 00:00 0 
b7634000-b7775000 r-xp 00000000 fd:01 5769153    /lib/i686/cmov/libc-2.10.2.so
b7775000-b7777000 r--p 00141000 fd:01 5769153    /lib/i686/cmov/libc-2.10.2.so
b7777000-b7778000 rw-p 00143000 fd:01 5769153    /lib/i686/cmov/libc-2.10.2.so
b7778000-b777b000 rw-p 00000000 00:00 0 
b7794000-b7796000 rw-p 00000000 00:00 0 
b7796000-b7797000 r-xp 00000000 00:00 0          [vdso]
b7797000-b77b3000 r-xp 00000000 fd:01 2818106    /lib/ld-2.10.2.so
b77b3000-b77b4000 r--p 0001b000 fd:01 2818106    /lib/ld-2.10.2.so
b77b4000-b77b5000 rw-p 0001c000 fd:01 2818106    /lib/ld-2.10.2.so
bfafd000-bfb12000 rw-p 00000000 00:00 0          [stack]
Here, the VDSO is one page long (4096 bytes). It contains the syscall abstraction interface, but also some shared variables (low level information like rdtsc counter, real-time timer, stack canary, etc.)
The selection of the right syscall method is done by the Linux kernel in arch/x86/vdso/vdso32-setup.c in the sysenter_setup function (which is called very early at kernel initialization by identify_boot_cpu()).
int __init sysenter_setup(void)
{
    void *syscall_page = (void *)get_zeroed_page(GFP_ATOMIC);
    const void *vsyscall;
    size_t vsyscall_len;

    vdso32_pages[0] = virt_to_page(syscall_page);

#ifdef CONFIG_X86_32
    gate_vma_init();
#endif

    if (vdso32_syscall()) {
        vsyscall = &vdso32_syscall_start;
        vsyscall_len = &vdso32_syscall_end - &vdso32_syscall_start;
    } else if (vdso32_sysenter()){
        vsyscall = &vdso32_sysenter_start;
        vsyscall_len = &vdso32_sysenter_end - &vdso32_sysenter_start;
    } else {
        vsyscall = &vdso32_int80_start;
        vsyscall_len = &vdso32_int80_end - &vdso32_int80_start;
    }

    memcpy(syscall_page, vsyscall, vsyscall_len);
    relocate_vdso(syscall_page);

    return 0;
}
The implementation of the sysenter method is in arch/x86/vdso/vdso32/sysenter.S. The routine called by the libc (with the call *%gs:0x10) is named __kernel_vsyscall:
  __kernel_vsyscall:
  .LSTART_vsyscall:
      push %ecx
  .Lpush_ecx:
      push %edx
  .Lpush_edx:
      push %ebp
  .Lenter_kernel:
      movl %esp,%ebp
      sysenter
    /* 7: align return point with nop's to make disassembly easier */
    .space 7,0x90

    /* 14: System call restart point is here! (SYSENTER_RETURN-2) */
    jmp .Lenter_kernel
    /* 16: System call normal return point is here! */
VDSO32_SYSENTER_RETURN: /* Symbol used by sysenter.c via vdso32-syms.h */
    pop %ebp
.Lpop_ebp:
    pop %edx
.Lpop_edx:
    pop %ecx
.Lpop_ecx:
    ret
Linus Torvalds is the proud owner of this code because he managed to handle the system call restarting thanks to a CPU particularity: when the kernel is done with a system call and want to give the control back to the process, it just have to perform the sysexit instruction.
Prior to that, the kernel specified to the CPU that at sysexit, it has to jump to a specific static address. This address is the VDSO32_SYSENTER_RETURN label saw in the previous routine.

New blog, new rules

Sometimes, I receive emails asking me to translate my papers or blog posts in English; each time, I procrastinate and never do it. My year's resolution is to address this issue, that's why this blog goes english now.

Off course, I will not translate previous posts because it would be so time-consumming that even just identify which post is interesting enough to be translated would be too long.

That's why I will keep available my previous posts at this address: http://chdir.org/~nico/blog/posts/ and also because there are indexed by search engines.

 That is also the occasion of using a modern blogging engine like before my switch to ikiwiki, so I use the Blogspot services from now on:
Now that comments are back, feel free to nitpick :)