Just another geek

A blogging framework for hackers.

Linux Security in 2011, or My LKML’s Yearly Digest

Linux security in 2011, or LKML’s yearly digest

Disclaimer: I have nothing to do with the following, all credits go to their respective authors. I’m just publishing my 2011’s bookmarks about Linux kernel security with a one line summary based on my (possibly wrong) understanding
Do not hesitate to correct me (gently if possible :)) in comments or mail.

  • False boundaries of (certain) capabilities: Brad Spengler describes 19 capabilities (of 35) which can be used to regain full privileges. Coincidentally, Vasily Kulikov discovered a “funny” behavior of CAP_NET_ADMIN which permit to load any modules available in /lib/modules/ instead of limiting to network related modules only, AFAIK, this vulnerability was closed but the fix got reverted some weeks later because of some userspace breakages.
  • PaX team introduced a new range of stuff using the new GCC plugin infrastructure. At compile-time, pro-active code is automatically added to potentially dangerous paths:
    • constify_plugin.c enforces read-onlintroduces new constraints (__do_const and __no_const) enforcing read-only permissions at compilation-time and run-time. PaX then makes usage of theses new constraints by patching most of the “ops structures”. The plugin also automatically protects structures where all members are function pointers, this patching on-the-fly is required because patching directly the source kernel would never be integrated upstream.
    • stackleak_plugin.c adds instrumentation code before alloca() calls. This code checks that stack-frame size does not overlap with kernel task size. It circumvents techniques described in “Large memory management vulnerabilities” by Gaël Delalleau (2005) and “The stack is back” by Jon Oberheide (2012).
    • GCC 4.6 introduced named address spaces. It was initially specified for embedded processors but PaX team uses this feature to represent user and kernel space. checker_plugin.c thus introduces __user, __kernel and __iomem namespaces to spot non-legit flows between address spaces.
    • kallocstat_plugin.c produces statistics about the size given in parameter to various memory allocation functions
    • kernexec_plugin.c enforces non-executable pages like the KERNEXEC PaX feature, but without huge performance impact on AMD64.
  • pagexec also managed to compile Linux Kernel with clang by patching both Linux and clang. Now that gcc integrated plugins, it is less interesting  but llvm was the solely compiler with easy access to its internal structure, allowing external applications to perform static analysis…
  • A user space interface to kernel Crypto-API was submitted to kernel developers, an interesting use-case was to offer a way to deport key material between processes. Imagine process A in possession of private keys and another one, B, actually performing encryption / decryption stuff part. The idea was to initialize a “crypto socket” in A and pass this file descriptor to B (via a classic ancillary message).
  • Pseudo-files in /proc/<pid>/ have a different security model than “normal” files because of its ephemeral nature: checks need to happen during each system call and not at open() time because permissions can change at anytime. Halfdog discovered (and Kees Cook reported it to LKML) that not all files were protected accordingly. If a program opens /proc/self/auxv and keeps this file descriptor opened. Then, even after a execve() of a setuid binary, the file descriptor would still be available, leaking information! Fixing this vulnerability has been a long road and a pretty solution came up with the introduction of revoke(), a new syscall invalidating file descriptors. Unfortunately, the thread didn’t survive and ideas were lost… (by the way, it is funny that this kind of problem resuscitated in CVE-2012-0056 lately…)
  • As one goes along, execve() became almost magical, it had to support Set-User-Id, capabilities, and file capabilities. Each feature added complexity and different legacy behaviors to maintain. Instead of dropping these POSIX features, OpenWall 3.0 took a different approach by removing Suid binaries from its base install, thus preventing execve’s voodoo. This change is just a line in Owl’s changelog but is in fact a major achievement: it required them to re-architecture important softwares like crontab or user management tools.
    /bin/ping is setuid-root because it opens a raw socket and injects its packet on the wire directly. A new socket type, PROT_ICMP, was developed by Openwall team, it makes possible to send ICMP Echo messages without special privileges (caller’s GID has to be included in a range stored in a sysctl key). It is interesting to note that only replies (based on ICMP identifier field) are sent to userspace, not the whole ICMP traffic like in Mac OS X.
  • TCP Initial Sequence number is now a 32-bits random number using MD5. ISN was previously the concatenation of 24 random bits (MD4 of TCP end points with a secret rekeyed every 5 minutes) and an 8 bits counter (number of times secret key was regenerated)
  • Vasilily tried to push upstream additional checks for copy_{to,from}_user() (by checking if requested size fits boundaries fixed at compile time), this patch was a cut down version of PAX_USERCOPY but was NACKed by Linus asking him for more “balance and sanity”. However, he didn’t reject the idea itself, saying that a cleaner version might be accepted…

ld-linux.so ELF Hooker

Stéphane and myself are releasing a new tool injecting code at runtime, just between the ELF loader and target binary. It is an alternative to LD_PRELOAD, just a little bit more intrusive but 100% reliable :)

When a binary is execve(), the kernel extracts from the ELF headers the interpreter to be launched, usually  /lib/ld-linux.so.2The kernel creates a new process and prepares the environment (arguments and auxiliary data). The target ELF entry point is set in auxiliary vector of type “ENTRY”.

Then the kernel opens the requested interpreter, maps the memory regions and start its execution at ld’s ELF entry point. Then the loader analyzes the target ELF file, performs its loader work and sets EIP to target ELF entry point (extracted from auxv). At this point, main()’s program is eventually executed.

Our goal was to permit the execution of code for abitrary dynamically linked binary without patching each of them. So our interest moved on the loader, the common point between most executables. Thus, we decided to patch a normal ld in order to inject code. My awesome colleague, Stéphane Duverger (the ramooflax author!) and myself wrote ld-shatnerIts task is to patch ld-linux.so file accordingly:

  1. After ELF header, we shift “ELF program header” a few pages away
  2. In this new section, we inject a “loader routine” (hooked.s) and embedded code to be executed at runtime
  3. After having been saved in our section, ld’s ELF entry point is overwritten to jump directly on our routine. This routine extracts from auxiliary vectors the target ELF entry point and overwrites it with a pointer to our embedded code (func() in the payload).
  4. Original ld’s entry point is called and ld works as usual
  5. Eventually, it calls entry point set in auxiliary vector (which was replaced by a pointer to our payload)
  6. Embdded code runs
  7. It returns to our routine which finally jumps on original target entry point
Some pictures before/after ld-shatner voodoo:

ld-shatner voodo


$ make clean all
$ cp /lib/ld-linux.so.2 /bin/ls .
$ ./ld-shatner ld-linux.so.2 obj.elf
$ sudo cp ld-hook.so /lib/
$ ./interpatch ls
$ ./ls 
ld-hook <---------------------- output of obj.elf

(Ok, we cheat for the moment because we have to patch ls binary but we will not have to do that eventually)

So what?

My ultimate goal for ld-shatner is to use this method for starting applications in my sandbox project, seccomp-nurse. For the moment, I rely on LD_PRELOAD feature but this approach is… hackish and I have to work around some bugs because of this special context…

Introducing a Bit of Web Paranoia Into My Habits…

When I’m not slacking in Emacs, I now spend most of my time in Google Chrome. Almost everything I do is in the “cloud” (I hate this buzz word): mail, blog, chats, voip and even version control.

With the explosion of “social buttons” everywhere, I became really more paranoid than before about my privacy. And when I see new Facebook ‘Frictionless sharing’ feature, I don’t regret my move. What did I do? Simple, I’m just using dedicated browser profiles for each task:

  • The most sensitive: the one I use only for my mail account and nothing else. I even think to use the clever proxy hacks mentionned by Chris Evans to only authorized outbound connections to my mail provider. I didn’t do it yet because it would prevent me from reading HTML mails linking to external image (OK this is not a big loss and a potential privacy issue but useful sometimes). This is a dedicated profile because if you have access to mails, you have access to every web sites (ie “I lost my password”)
  • Then there is my main profile (using it for Google Reader, Google+, Twitter and Facebook). My biggest fear is to be tracked because of social buttons or because I clicked a link somewhere. So I changed my habit and instead of clicking, I drag and drop interesting pages to my sandbox profile
  • The sandbox profile is where I do searches, browsing web pages, etc. It is configured to never send anything, or to store information on disk. I never use this profile to log on a website and if I have to do that, I get back to the main profile.
To do this efficiently, when I boot, I spawn these browsers with specific profile directory (using –user-data-dir  Chrome option) and they are never closed. My window manager is configured to display the sandbox and my main profile side-by-side on the same workspace in order to switch rapidly.

For each profile, I use these Chrome extensions:
This setup works really well for me, I’m using it for more than 6 months now and it’s cool :)

The next step is to use dedicated UIDs for each profile, I didn’t do it yet because there is no “perfect solution” because of Xorg design: any X11 client can mess with other X11 client…

Net2pcap Revival

net2pcap is a packet capture tool written by Philippe Biondi back in 2003. It was designed to be as secure as possible in order to be run in hostile environment. To do so, its code is minimalist without any complicate feature, the result is 406 lines of simple C. On top of its security, it is also the most reliable tool I have ever used on high traffic link regarding packet loss, even dumpcap does not perform better.

Unfortunately, feature requests and bugs were lost in the middle of hundreds of spams in Phil’s bug tracker. To not lost patches, I have set up a net2pcap repository on github. This is not a fork, this is still maintained in collaboration with Phil, this is just a way to relieve him of the maintenance burden.

For those interested in the project, the following patches were already applied:

  • Privileges drop
  • Chroot
  • Compatible with 64 bits architecture
  • Large file support on x86_32
If you have any feature request or bug report, feel free to submit a ticket!

HOWTO Authenticate Ssh Server Through Certificates

In August 2010, OpenSSH 5.6 added support for certificate authentication (release notes), unfortunately, no documentation really exists at the moment (you are on your own with sshd_config(1)ssh-keygen(1) and ssh_config(1), good luck with that).  This is a surprising because this feature is awesome for system administrators, even for a small deployment.

Certificates allow you to sign user or host keys. In other words:
  • Thanks to a unique file (CA certificate) on the server, it can accept any (signed) user keys transparently
  • If every servers’ host keys are signed, clients only need to carry the CA to authenticate every servers of your network, which means no more “The authenticity of host foobar can’t be established. Fingerprint is…” message
Here is the HOWTO for the latter case.

Geek summary: Sign SSHd host key

$ ssh-keygen -f ~/.ssh/cert_signer
$ scp foobar.example.org:/etc/ssh/ssh_host_rsa_key.pub foobar.pub
$ ssh-keygen -h                             \ # sign host key
             -s ~/.ssh/cert_signer          \ # CA key
             -I foobar                      \ # Key identifier
             -V +1w                         \ # Valid only 1 week
             -n foobar,foobar.example.org   \ # Valid hostnames
             foobar.pub                       # Host pubkey file
$ scp foobar-cert.pub foobar.example.org:/etc/ssh/ssh_host_rsa_key-cert.pub

On foobar.example.org, add “HostCertificate /etc/ssh/ssh_host_rsa_key-cert.pub” in sshd_config and reload sshd. Now, configure the ssh client to use this authority:

$ (  echo -n '@cert-authority * '; \
     cat ~/.ssh/cert_signer.pub ) > ~/.ssh/known_hosts_cert
$ ssh -oUserKnownHostsFile=~/.ssh/known_hosts_cert foobar.example.org

At this point, you can connect to every servers without any annoying messages. You don’t even have to care when the server is replaced without conserving its old ssh keys.

No-release of Seccomp-nurse

This post in a nutshell
This was a draft since my presentation at Ekoparty, I will force myself to not procrastinate this time. This post announces the no-release of seccomp-nurse (it is not a release because it is still an advanced proof of concept). Quick links:

seccomp-nurse is a generic sandbox environnement for Linux, which doesn’t require any recompilation. Its purpose is to run legit applications in hostile environment, I repeat, it is not designed to run malicious binary.

How does it work? The following figure describes the architecture of seccomp-nurse. You can see two processes, one running the untrusted code and the trusted one. The trusted process is charge of intercepting syscalls and checking if the action is allowed.

How do we intercept syscalls? By using a x86_32 hack. If you remember my previous post, I described how the GNU Libc was executing syscalls: by making an indirect call in VDSO. seccomp-nurse overrides this page in order to call our own function instead of performing the syscall. Our handler retrieves CPU registers and directly sends them to the trusted process through a socket. The trusted process checks its policy engine, like: “can this process open this file?”

If action is allowed, how to execute it? SECCOMP only permits 4 syscalls, how to do? Well. SECCOMP flag is limited to the thread scope, that means that if a process has two threads, one can be sandboxed (which will be called untrustee) and the other (called trustee) is free to do whatever it wants, furthermore, if threads share everything, any action done in one thread has an impact on the other. This is pretty cool! But so dangereous!

Indeed, everything is shared, only the CPU registers are not shared between threads, that’s all! The trustee must consider its environment as hostile: its code must not do on memory access, only registers can be used. That’s why this part is written in assembly in order to control every instructions. It has been designed to be the simplest possible because this is the keystone of the sandbox, the security of the system relies on it.

This routine is completely dummy and has no intelligence at all, everything is done in the trusted process, the trustee understands only theses commands:
  • Execute this syscall
  • Raise a SIGTRAP (for debugging purpose)
  • Native exit
  • Poke/Peek memory
How are exchanged the information between both processes? Thanks to a POSIX shared memory, marked as read-only for the untrusted process. That way, when the trusted process wants to delegate a syscall, it writes the values of all registers in this shared memory and notifies the trustee to execute it. With this mechanism, there is no race condition: every syscall arguments are copied so they cannot be modified after the policy check.

Limitations: Because of our way of intercepting syscalls, we can only run dynamically linked binaries on 32 bits, using the GNU Libc. It is hoped that the situation will improve greatly in the following weeks… Stay tuned!

Performances: Hahem. I don’t know. Each time the untrustee makes a syscall, our sandbox makes a lot of back and forth between both processes (one back and forth = at least one read, one write).

Linux Security, One Year Later…

This post (tries to) describe what happened in 2010 about GNU/Linux security. What this post is not is a long list of vulnerabilities, there are some people doing it way better that me.

The first part of this post is dedicated to new vulnerability classes where the second one focuses on the defensive side, analyzing improvements made to the Linux kernel. Before closing this post, some selected quotes will be presented, pointing the finger at some of the Linux failures.

This post being (very) long and being syndicated by a few “planets”, I will cut this post on my feed, even if I know that a lot of people dislikes this behavior.

Yang: New attacks, new vulnerability classes

Thanks to the generalization of userspace hardening in common Linux distribution (packages compiled with most of the protection options like stack-protector, PIE, FORTIFY_SOURCE, writing of SELinux rules), vulnerability researchers had to find a milder field : the kernel.

In 2009, Tavis Ormandy and Julien Tinnes made a lot of noise with their NULL pointer dereference vulnerabilities.
Pro-active measures were developed to mitigate this kind of bug but the play of the cat and mouse never stopped to to bypass theses protections.

Bypassing of mmap_min_addr

Let’s remind that this protection consists of denying the allocation of memory pages below a limit, called mmap_min_addr (/proc/sys/vm/mmap_min_addr). Thus, it prevents an attacker to drop off his shellcode at address 0-or-something and then triggering the NULL pointer dereference.

A lot of methods were found in 2009 to bypass this restriction (Update: as pointed by Dan Rosenberg, the first one is not a mmap_min_addr bypass at all) , whereas this year was less fruitful with two one techniques:

  • Bug #1: Disabling frontier: The kernel has to validate each user-provided pointer to check if it is coming from user or kernel space. This is done by access_ok() with a simple comparison of the address against a limit.
    Sometimes, the kernel needs to use function normally designed to be called by userspace, and as such, theses functions checks the provenance of the pointer… which is embarrassing because the kernel only provides kernel pointers.
    So the kernel goes evil and cheats by manipulating the boundary via set_fs() in order to make access_ok() always successful. At this moment and until the kernel undoes its boundary manipulation, there is no more protection against malicious pointers provided by userland.
    Nelson Elhage found a brilliant way to get root: he triggers an assertion failure (via a BUG() or an Oops) that makes the kernel terminating the process with the do_exit() function. One Linux feature is to be able to notify the parent when one of its thread dies, the notification mechanism is as simple as writing a zero at a given address.
    Normally of course, this address is checked to be inside the parent address space but if do_exit() was triggered in a context where the boundary was faked, that means that access_ok(ptr) will always return true.
    This is what Nelson did by registering a pointer belonging to the kernel space for the notification and then triggered a NULL pointer dereference to enter into a “temporary” context. Boom!
  • Bug #2: Memory mapping: Tavis Ormandy discovered that when a process was instantiated, a carefully home made ELF binary could make the VDSO page be mapped one page below mmap_min_addr. This is particularly interesting on Red Hat Entreprise Linux’ kernel because it is configured with mmap_min_addr equals to 4096 (PAGE_SIZE).
    In other words, the VDSO page can be mapped on addresses 0 to 4096. In theory, that means the VDSO page could be used to “bounce” from a NULL pointer dereference.

Then in the end of 2010, this was the rediscovery of the impact of uninitialized variables, but in the kernel this time.

Uninitialized kernel variables

A typical vulnerable code looks like the following:

struct { short a; char b; int c; } s;

s.a = X;
s.b = Y;
s.c = Z;

copy_to_user(to, &s, sizeof s);

The problem here is that we don’t pay attention to the padding byte added by the compiler between .b and .c. This is needed in order to align structure members addresses on a CPU word.

The direct consequence in the kernel case is that copy_to_user() obviously copies the structure as a whole and not “member by member”, padding included.
The user process can thus get the value of this uninitialized byte, which can be totally useless, or as sensible as a key fragment.

The obvious fix?

The fix seems relatively simple, by adding a preliminary memset(&s, '\0', sizeof s). But this not that trivial because C99 states that the compiler is free to optimize the following cases:

  • Consider the memset() as superfluous as each structure member is assigned later, and thus removing it.
  • Later, the padding byte can be overridden when .b is assigned. C99 does not protect this byte in any way so if the compiler can optimize its code by doing a mov [ptr], eax instead of mov [ptr], ax, he is free to do it.

Furthermore, this memset-ification can be troublesome in fast paths like in the BPF filtering engine. netdev developers considered the array initialization too expensive to be added (even if this is as small as 16*4 bytes).
Instead, they had to write a “BPF checker”, validating the legitimacy of instructions accessing the array.

Impact of uninitialized variables

This kind of bug was already demonstrated dangerous in userland and this is even worse in kernel land!
However, motivating kernel developers to fix theses issues was not the easy part for some of them. For instance, the netdev maintainer’s scepticism lead Dan Rosenberg to make a blistering answer with the publication of an exploit on full-disclosure. A few days later, he admitted having published this exploit because he was doubting about the impact of this particular vulnerability.

But this stays anecdotal (isn’t it?) and kernel developers actively contributed to fix dozens occurrences of this kind of bug.

Kernel stack expansion

In 2005, Gaël Delalleau already discussed how interesting it was to make the stack and the heap collide in user land. In November 2010, Nelson Elhage, Ksplice founder, found a variant, but for the kernel this time.

The memory allocated to the kernel is minimal, a kernel task can not have more than two physical pages for its local variables (its stack). But this is merely a convention given the fact there is no enforcement against abnormal expansion like a guard page.
Next to the task’s stack (so after the “two pages”) is the location of its thread_info structure, a critical element containing data and function’s pointers… which would be really interesting to overwrite!
To happen, you have to find a task where you can control his stack usage, like an array where its size is somehow user controlled. Eventually, this expansion will transcend the two-pages-limit and will offer you a way to overwrite some values in thread_info structure. A concrete exploitation of this flaw overwrites one of the function’s pointers to redirect to a shell code.

Ying: New protections

Bug fixes

This year will not be the one of the change of Linus mentality towards security bugs but we catch up with it thanks to the efforts of security teams of various Linux distributions (Red hat, SuSe and Ubuntu mainly).

It seems that they closely follow kernel mailing lists looking for sensible commits with a security impact. For each report, a CVE number is assigned, the kind of thing soooo useful for an admin because it permits some kind of traceability and to know (more or less) how pierced our servers are :)
Eugene Teo maintains an atypical git repository which tags every CVE. This is particularly useful in audits for quickly identifying vulnerabilities available for a given version. This is somewhat the whitehat equivalent of kernel exploit lists used by hackers.

Proactive security

A lot of contributions were made to the kernel to improve its security proactively. Theses works try to make kernel exploitation more cumbersome, because frankly, we have to admit that the relative easiness to exploit a NULL pointer dereference is embarrassing :)

For instance, to understand the interest of this kind of proactive measures, let’s look back to Nelson’s vulnerabilities: to be successful, Dan’s exploit had to combine three vulnerabilities to transform a denial of service into a privilege escalation.

This defense in depth shows us how expensive it becomes to exploit a given vulnerability. This is what we keep saying: there will always a vulnerability somewhere in our system, so our only option is to try to make its exploitation insane.

But let’s see what are theses proactive measures…

Permission hardening

Brad Spengler, author of grsecurity, has long been vocal on the fact that too much information were leaked to user land. In consequence, grsec includes a lot of restrictions to prevent theses information leaks. But what are we talking about?

/proc, /sys and /debug pseudo-filesystems contain files revealing kernel addresses, statistics, memory mapping, etc.
Except in debugging session, theses information are totally useless and meaningless. Nevertheless, most of theses files are world readable by default. This is godsend if you are an attacker: no need to bruteforce kernel addresses (and we know that bruteforcing this kind of thing in kernel land is never a good idea)!

Dan Rosenberg and Kees Cook (of the Ubuntu security team) worked hard to merge theses restrictions into the official upstream tree:

  • dmesg_restrict: access to kernel log buffer (used by dmesg(8)) now require CAP_SYS_ADMIN capability.
  • Removal of addresses in /proc/timer_list, /proc/kallsyms, etc. Upstream developers tried hard to not merge theses patches thinking it was useless (because addresses are also readable in /boot/System.map) and above all, it would greatly complicate the work of maintainers reading bug report. That is why netdev maintainer netdev clearly NAKed this kind of patches. The zen and patience of Dan Rosenberg has to be highlighted here!
    Alternatives were suggested by both parties:
    • Since merely removing addresses from /proc files would break the ABI and thus a lot of scripts, it was proposed to replaced them by a dummy value (0x000000) if the reader was unprivileged.
    • Changing access permissions to theses files, this “simple” change had a nasty effect on an ancient version of klogd causing the machine to not boot anymore . This lead to the revert of the patch unfortunately: Never break userspace!
    • XOR displayed addresses with a secret value.
    • Etc.

The solution “retained” (there is never a formal “Yes this is it”, you have to write the code and then this is discussed…) is the first one: replacing addresses by arbitrary values if reader not privileged.
However, in order to prevent code duplication, the special format specifier %pK was added to printk(). Depending on the kptr_restrict sysctl, this specifier will restrict access to pointers.

For the occasion, the new capability CAP_SYSLOG was created for this purpose.

A lot of work is still needed however, for example, thanks to his new fuzzer, Dave Jones discovered that the loader of ACPI table was word-writable: anybody could load a new ACPI table if debugfs was mounted, oops :)

Marking kernel memory read only

Actually, the Linux kernel does not use all possibilities offered by the processor for its own memory management: read-only segments are not really marked as so internally. Things could be improved like what is now done in user space: data shall not be executable, code shall be read-only, etc.

This is still a work in progress, but developers try to remediate theses issues. To be successful, a few actions are needed:

  • Really use hardware permission for the .ro.data segment . Because for the moment, permissions for this segment are purely virtual despite the “.ro” in its name.
  • Function pointers never modified shall be marked as const-ant whenever possible. Indeed, one of the simplest method to exploit a kernel vulnerability is to overwrite a function pointer to jump in attacker area.
    Once a variable is marked const, it is moved into the previously seen .ro.data (you can guess that this move is only useful if the zone is really read only in hardware). Off course, it will not be possible to const-ify every function pointers, there will still be room for an attacker but this is not a reason to do nothing…
  • Disabling some entry points leading to set_kernel_text_rw() (the “kernel” equivalent of mprotect()) in order to not let attacker to change permissions after all.

A priori, developers do not seem opposed to this patch and they would be even happy to merge it in order to optimize virtualized guests.

Disabling module auto-loading

Most of the vulnerabilities target code paths barely used. This could, by the way, be the reason why bugs are still found.

Linux distributions don’t have other option than compiling every features and drivers to have a unique universal kernel. To not bloat the memory, this is done via modules with a way to load them on demand.

This auto-loading feature is particularly interesting for attackers: they just have to request an X.25 socket to have its associated module loaded, ready to be exploited.

Dan Rosenberg (again!) proposed to automatically load modules only if the triggering process is privileged. Even if this restriction is already inside grsecurity patches, this “feature” was considered too dangerous for distributions and was NAKed to prevent any breakage :-/

UDEREF support for AMD64 (finally)

PaX developers have always been clear: AMD64 Linux systems will never been as secure as their i386 cousin. This statement is due to the lack of the segmentation.

However, they did their best to implement UDEREF anyway.

As a reminder, UDEREF prevents the kernel to use memory owned by user land without stating it explicitly. This features offers protection against NULL pointer dereferences bugs.

On i386, this is easily done by using segmentation logic. But on AMD64, this stays a (dirty) hack by moving the user space zone at another place and change its permissions.

The problem is that we just shift the issue: now, instead of deferencing a null pointer, attacker now has to influence the kernel to dereference another address, but as pageexec said, if we are at this point, this should the last of our concern :)
As if this wasn’t enough, this hack “wastes” 5 bits of addressing (leaving 42 bits for the process) and some bits of d’ASLR by the way…
The icing on the cake is that the performance are impacted for each transition user-to-kernel and kernel-to-user because of the TLB flush.

Network security?

Network security is not really “sexy” enough to receive the same level of contributions to the Linux kernel, maybe because researchers prefers to work on offensive things.
Besides the Netfilter rewrite (called nftable) started last year, not so many things happened. One of the few things remarkable was the implementation of TCP Cookie Transactions et improvements to “old” syncookies.

When a system is overloaded, TCP syncookies are used to not store states until the connection is really opened. This “old-school” protection was designed to evade from SYN flood attacks. Nowadays, this is merely pointless since today’s DoS saturate the network bandwidth instead of the kernel memory.
Anyway, this is not a reason to do nothing :)

Previously, SYNcookies were considered as “has to be used in last resort” because TCP options carried by the first SYN packet were lost since the kernel was not saving it (congestion bit, window scaling or selective acknowledgement).

This is not true anymore: the kernel now codes theses information into the 9 lower bits of the TCP Timestamp’s SYN-ACK option when replying.
This means that syncookie is not harmful anymore for performances and can be used safely, despite what says the tcp(7) manpage (a bug was submitted to update the description).

Kernel confessions

While reading lists, I came across some interesting confessions:

The capabilities drama :
Quite frankly, the Linux capability system is largely a mess, with big bundled capacities that don’t make much sense and are hideously inconvenient with the capability system used in user space (groups).
Too many patches to review for the -stable branch :
> > I realise it wasn’t ready for stable as Linus only pulled it in
> > 2.6.37-rc3, but surely that means this neither of the changes
> > should have gone into
> Why didn’t you respond to the review??

I don’t actually read those review emails, there are too many of them.


A lot of good things happened in the Linux kernel last year thanks to the people cited in this post. Moreover, it is interesting to see that most of theses features have been written by security researchers and not “upstream kernel developer” (except Ingo Molnar who proved a lot of good will each time).
This may be the explanation why each patch merged was the fruit of never-ending threads (we can applause their patience)…
This is only now that I start understanding how much Brad Spengler was right when he declared war against LSM. Do “Security” subsystem maintainers should leave their ivory tower and start understanding the real life of a syadmin? The kind of guy who don’t have time to update every servers to the latest git version, nor to write SELinux which, by the way, would be useless once a kernel vulnerability is found.
Anyway, this is only the opinion of a guy involved in the security circus

However, we can still be happy to see theses changes finally merged. And with some luck, we can hope that someday, mmap_min_addr will not be bypassable… And that proactive features will require researchers to combine multiple vulnerabilities to exploit one flaw.
I don’t say that there will be no more bugs, perish the throught, but I hope that the exploitation cost will be so high that only a tiny fraction of attacker will be able to do it.
At this point, security researchers will have to dive into “logic bugs”, like Taviso’s vulnerabilities LD_PRELOAD/LD_AUDIT which were bypassing most of available hardening protections.

La Sécurité Sous Linux, Un an Plus Tard…

Sorry english folks: this post is in french, but it will be translated soon, translated and updated post is available here.

Plus qu’une longue liste de vulnérabilités, ce post a pour objectif de décrire ce qu’il s’est passé en 2010 dans l’écosystème de la sécurité sous GNU/Linux.

La première partie est dédiée aux nouvelles classes de vulnérabilité. La deuxième partie se concentre sur la défense avec l’analyse des différentes améliorations tendant à améliorer la sécurité de nos systèmes. Enfin pour terminer ce post, il y aura quelques citations de développeurs noyau assez révélatrices.

Ce post étant plutôt très long et puisque je suis syndiqué sur plusieurs “planets”, je préfère le couper, désolé Sid :)

Yang: Nouvelles classes de vulnérabilité

Grâce à la popularisation des différents mécanismes de protection userspace dans les distributions “grand public” (compilation des paquets avec les différentes options de durcissement (stack-protector, PIE, FORTIFY_SOURCE, écriture des règles d’accès SELinux), les chercheurs de vulnérabilité ont dû trouver un nouveau terrain de jeu plus clément : celui du noyau. Grâce aux démonstrations de Tavis Ormandy et Julien Tinnes, 2009 avait été marqué par les vulnérabilités du type NULL pointer dereference. Des fonctionnalités pro-actives avaient été développées pour mitiger l’impact de ce genre de bug mais le jeu du chat et de la souris ne s’est jamais arrêté afin de trouver de nouveaux moyens de contourner ces protections.

Contournement de mmap_min_addr

Pour rappel, la protection principale du noyau contre cette classe de vulnérabilité est d’interdire l’allocation d’une page de mémoire si son adresse virtuelle est en dessous de mmap_min_addr (/proc/sys/vm/mmap_min_addr), cela afin d’éviter qu’un attaquant n’y dépose son shellcode et déclenche un déréférencement de pointeur NULL.

Beaucoup de moyens de contourner cette vérification avaient été trouvés en 2009, pourtant encore deux méthodes de contournement ont été publiées cette année :

  • Lorsque le noyau utilise des pointeurs manipulés par l’userland, il vérifie qu’ils pointent bien depuis/vers une zone utilisateur. C’est le rôle d’access_ok() de vérifier qu’une adresse est en dessous de la frontière userspace/kernelspace.
    De temps en temps, le noyau utilise des fonctions normalement dédiées à l’espace utilisateur, or ces dernières vérifient que les adresses manipulées sont bien dans l’espace userland, ce qui n’arrange pas le noyau parce qu’il aimerait utiliser les fonction pour lui-même (avec des adresses kernelspaces).
    Afin de contourner cette vérification, le noyau manipule la “frontière” à l’aide de set_fs() avant l’appel à la fonction puis la rétablit au retour, ni vu ni connu. Cela signifie que temporairement, pendant l’exécution de la fonction, aucune vérification ne sera effectuée.
    Nelson Elhage a brillamment trouvé comment exploiter cette particularité : lorsque le noyau traite un Kernel Oops ou un BUG(), il termine le processus ayant généré l’exception à l’aide de do_exit(). Cette fonction peut notifier la mort du processus à d’autres threads en écrivant 0 à une adresse arbitraire contrôlée par access_ok().
    L’exploit consiste dès lors à déclencher une exception pendant le traitement d’une fonction tournant avec access_ok() désactivé. Lorsque l’exception sera déclenché, do_exit() sera appelé et puisqu’access_ok() sera désactivé, la valeur 0 sera écrite à une adresse arbitraire. Boom ! Première méthode.
  • Deuxième méthode maintenant. Tavis Ormandy a constaté qu’à la création des mapping mémoires, le VDSO pouvait être projeter une page en dessous de mmap_min_addr, ce qui est particulièrement intéressant pour les noyaux Redhat puisque mmap_min_addr == 4096.
    En théorie, cela signifie qu’une exploitation déférencement de pointeur NULL devrait utiliser les octets du VDSO pour rebondir.

En fin d’année 2010, cela a été la redécouverte des problèmes de variables non initialisées, mais dans le noyau cette fois-ci.

Variables non-initialisées

Un code vulnérable typique ressemble à cela :

struct { short a; char b; int c; } s;

s.a = X;
s.b = Y;
s.c = Z;

copy_to_user(to, &s, sizeof s);

Le problème ici est qu’on ne fait pas attention à l’octet de padding ajouté par le compilateur entre .b et .c afin d’aligner la structure sur un mot processeur. En pratique, cela signifie que le processus userspace peut récupérer un octet de mémoire “aléatoire”.


Le correctif pourrait sembler assez simple, avec l’ajout d’un memset(&s, '\0', sizeof s), néanmoins, les choses ne sont pas aussi faciles puisque d’après la norme C99, le compilateur est libre d’optimiser les cas suivants :

  • Considérer que le memset() est superflu et le supprimer puisque chaque membre de la structure est initialisé
  • Plus tard, écraser l’octet de padding en faisant une assignation dans .b

De plus, dans le cas des filtres BPF, les développeurs de netdev ont considérés que forçer l’initialisation d’un tableau (de 16 mots de 32 bits) étaient beaucoup trop couteux, car appelé pour chaque paquet. À la place, ils ont écrit un vérificateur de code BPF afin de vérifier que chaque accès au tableau était valide.


Ce type de bug a déjà été démontré dangereux en espace utilisateur et ses conséquences sont pires dans le noyau, pourtant, il a fallu donner quelques coups dans la fourmillière pour faire bouger les choses ; Comme ce fût le cas face au scepticisme du mainteneur de netdev : la réponse de Dan Rosenberg a été cinglante avec la publication d’un exploit sur full-disclosure, même si plus tard, il a avoué avoir publié cet exploit car il doutait de sa criticité.

Malgré cet épisode, les développeurs noyau ont bien pris en compte ce type de vulnérabilité et des dizaines de correctifs ont été appliqués depuis.

Expansion de la pile noyau

En 2005 déjà, Gaël Delalleau discutait de l’intérêt de faire se rencontrer la pile et le tas en espace utilisateur, en novembre 2010, Nelson Elhage, l’auteur de Ksplice, remettait au goût du jour cette attaque pour le noyau.

La mémoire allouée pour le noyau lui-même est minimale : une tâche noyau ne peut avoir au plus que deux pages mémoire pour ses variables locales (sa pile). Mais cette limitation est juste “conventionnelle” puisqu’aucun méchanisme n’empêche la tâche de s’étendre, il n’y a pas de page de garde par exemple.
En pratique, si nous sommes capables de faire “grossir” la pile d’une tâche noyau au delà de ses deux pages réglementaires (voir CVE-2010-3848 pour un exemple concret), la pile va recouvrir la structure thread_info de la tâche courante.

En écrasant certains pointeurs de fonction disponibles à l’intérieur de cette structure, nous sommes capables de détourner le flot d’éxécution.

Ying: Nouvelles protections

Correction de bugs

Cette année ne sera pas l’année du changement de mentalité de Linus Torvalds concernant les bugs de sécurité, mais on s’y rapproche grâce aux efforts des équipes de Redhat, SuSe ou Ubuntu.

Il semblerait qu’elles suivent de près les listes de diffusion du noyau afin d’identifier des commits “sensibles” et un numéro de CVE est assigné. Eugene Teo maintient d’ailleurs un repository git avec tous les CVE taggés, ce qui est particulièrement utile lors d’audits puisqu’il est facile d’identifier les vulnérabilités d’un noyau donné. C’est un petit peu l’équivalent whitehat des listes d’exploits par noyau utilisé par les pirates.

Sécurité proactive

De nombreuses contributions ont été faites dans le noyau Linux pour améliorer sa sécurité en amont. Beaucoup de chantiers ont été commençés afin de rendre la tâche beaucoup plus compliquée aux développeurs d’exploits. Par exemple, si on revient sur les vulnérabilités de Nelson Elhage, l’exploit de Dan Rosenberg aura nécessité la combinaison de trois vulnérabilités pour transformer un DoS en élévation de privilèges.

Cette défense en profondeur permet de voir à quel point il devient coûteux d’exploiter certaines vulnérabilités. Mais revenons sur les chantiers qui ont eu lieu en 2010.

Renforcement des permissions

Brad Spengler l’a répété de nombreuses fois les années précédentes : beaucoup trop d’informations sont disponibles à l’utilisateur. C’est la raison pour laquelle son patch grsecurity restreint au maximum les droits d’accès sur les fichiers spéciaux du noyau.

En effet, on retrouve dans ces fichiers les adresses d’objets du noyau, ce qui est très pratique lorsqu’on exploite une vulnérabilité puisque cela évite de faire du bruteforce, ce qui est rarement une bonne chose à faire en kernel land :)

Dan Rosenberg et Kees Cook ont donc oeuvrés pour intégrer ces restrictions dans la branche officielle :

La solution qui semble la mieux engagée est le remplacement des adresses par une valeur arbitraire lorsque le lecteur ne dispose pas de privilège suffisant. Mais pour éviter la duplication de code, le format specifier %pK a été implémenté : en fonction de la variable sysctl kptr_restrict, l’adresse sera affichée ou non.

À l’occasion de ces restrictions, une nouvelle capability CAP_SYSLOG a été créé. C’est ce privilège qui conditionne l’accès aux adresses.

Beaucoup de travail reste encore à faire. Grâce à son nouveau fuzzer, Dave Jones a découvert que n’importe quel utilisateur pouvait charger une nouvelle table ACPI si debugfs était monté à cause de permissions laxistes.

Marquage en lecture seule

Un chantier pas encore terminé à ce jour est le marquage de certaines zones mémoires en lecture seule, pour cela, plusieurs actions sont nécessaires :

  • Mettre de réelle permission matérielle sur le segment .ro.data. Pour le moment, les permissions sont purement virtuelles, ce patch permet de marquer physiquement la page en lecture seule (ceci étant contrôlé par le CPU)
  • Marquer les pointeurs de fonctions comme const-ant lorsque cela est possible. Une des techniques les plus simples pour exploiter une vulnérabilité noyau est d’écraser un pointeur de fonction, le passage de ces pointeurs en constante permet de déplacer ces variables dans la zone .ro.data et donc empêcher la réécriture. Bien sûr, il restera toujours des pointeurs de fonction en écriture, mais ce n’est pas une raison pour ne rien faire…
  • Désactivation des points d’entrée vers set_kernel_text_rw() afin de ne pas laisser un attaquant changer la permission d’une page.

À priori, les développeurs ne semblaient pas opposés à ce patch et ils seraient même plutôt heureux de l’intégrer pour faire des optimisations de virtualization.

Empêcher le chargement automatique des modules

La pluspart des vulnérabilités exploitées touchent des parties de code assez peu utilisées, c’est d’ailleurs peut-être la raison pour laquelle on y trouve des bugs.

En général, les distributions n’ont pas d’autres choix que de compiler le noyau avec toutes les fonctionnalités, le tout en module afin de pas se retrouver avec un noyau monolithique de 30 Mo en mémoire.

Afin que ce soit transparent, le noyau est capable de charger automatiquement en mémoire le module chargé de réaliser l’opération demandée, ce qui est plutôt une bonne chose pour les attaquants : il suffit de demander le support de X.25 pour qu’il soit chargé, prêt à être exploité.

Dan Rosenberg (encore !) a proposé de charger automatiquement les modules uniquement si le processus déclencheur est root. Cette restriction est déjà présente dans la suite de patches grsecurity mais cette limitation est jugée impactante pour les distributions et a donc été refusé de peur de casser l’existant :-/

Support de UDEREF sur architecture AMD64

Les développeurs de PaX ont toujours été clairs que les systèmes AMD64 ne seraient jamais aussi bien protégés que sur i386 à cause du manque de la segmentation.

Néanmoins, ils font du best-effort et nous le prouve encore avec l’implémentation d’UDEREF pour cette architecture.

Pour rappel, UDEREF empêche le noyau d’utiliser de la mémoire userspace sans l’avoir demandé explicitement. Cette fonctionnalité empêche ainsi l’exploitation de NULL pointer dereferences.

Sur i386, c’est plutôt facile en utilisant la segmentation. Mais sur AMD64, c’est plutôt une bidouille plutôt sale : déplacer la zone de mémoire userspace et la marquer comme non-exécutable.

Le problème, c’est qu’on ne fait que le déplacer : désormais, plutôt que déréférencer un pointeur nul, il faudrait influencer le noyau pour déréférencer une autre adresse (mais comme le dit pageexec, si on en arrive là, c’est le dernier de nos soucis).
Ensuite, on perd 5 bits d’adressage donc un processus voit son espace d’adressage réduit à 42 bits et un peu d’ASLR au passage…
Et cerise sur le gateau, chaque transition user-to-kernel et kernel-to-user subit le coût d’un vidage de la TLB (dû au déplacement de la zone mémoire).


La sécurité réseau est à l’image des soumissions sur le sujet dans les conférences : ce n’est malheureusement pas assez sexy pour que les chercheurs s’y intéressent. Mise à part le début de réécriture d’iptables appelé nftable en 2009, pas grand chose n’est arrivé en 2010. Parmi les choses remarquables, il y a le support des TCP Cookie Transactions et l’amélioration des “anciens” syncookies.

Les TCP syncookies sont utilisés pour ne pas créér d’entrées dans la table des connexions tant qu’elles n’ont pas rééllement ouvertes, cela est particulièrement utile lors d’un DoS par SYN flooding.
Auparavant, les SYNcookies étaient considérés comme “à utiliser en dernier recours” car ont perdait les options de négociation TCP (bit de congestion, window scaling ou selective acknowledgement).

Cela est désormais terminé puisque le noyau stocke désormais ces informations dans les 9 bits de poids faible de l’option TCP Timestamp (à noter que la page de manuel tcp(7) n’a toujours pas été mise à jour). Ce qui signifie que l’utilisation de cette fonctionnalité n’est plus aussi impactante sur les performances qu’auparavant.

Aveux d’échec

Le drame des capabilities :
Quite frankly, the Linux capability system is largely a mess, with big bundled capacities that don’t make much sense and are hideously inconvenient with the capability system used in user space (groups).
Trop de patches à relire pour la branche -stable du noyau :
> > I realise it wasn’t ready for stable as Linus only pulled it in
> > 2.6.37-rc3, but surely that means this neither of the changes
> > should have gone into
> Why didn’t you respond to the review??

I don’t actually read those review emails, there are too many of them.


Beaucoup de bonnes choses ont pris places dans le noyau Linux, en majeure partie grâce au travail des différentes personnes citées dans ce post, il est d’ailleurs frappant de se rendre compte que toutes ces améliorations sont le résultat de chercheurs en sécurité plutôt que des développeurs du noyau. C’est peut-être la raison pour laquelle chaque patch a fait l’objet d’interminables discussions (admirons encore la patience de ces derniers)…
Ce n’est d’ailleurs que maintenant que je comprends à quel point spender avait raison dans sa déclaration de guerre contre les LSM. Est-ce que les mainteneurs du sous-système “Security” ne seraient pas dans leur tour d’ivoire sans comprendre les problématiques de la “vraie vie” ? Là où le sysadmin n’a pas le temps d’utiliser la dernière release du noyau sur chaque serveur, ni le courage d’écrire des règles SELinux qui seraient de toutes façons contourner au premier bug noyau…
Enfin, ce n’est que l’avis de quelqu’un du security circus

Malgré tout, on ne peut qu’être heureux de voir les progrès de cette année. On peut presque espérer qu’on n’arrivera peut-être plus à échapper à mmap_min_addr… Et que toutes les modifications pro-actives qui ont été faites nécessiteront la combinaison de multiples vulnérabilités pour être exploitables. Je ne dis pas qu’il n’y aura plus d’exploits, loin de là, mais plutôt que le coût d’exploitation sera trop élevé pour le pirate moyen. À ce moment là, les chercheurs devront se plonger dans les bugs “logiques” comme les vulnérabilités LD_PRELOAD/LD_AUDIT.

What Is Really the Attack Surface of the Kernel Running a SECCOMP Process?

In a previous post, I said the attack surface of the kernel for processes running SECCOMP was really low. To confirm this assumption, each vulnerability affecting the 2.6 kernel was reviewed.

Only those triggerable from a SECCOMPed process were kept. On 440 vulnerabilities, 13 were qualified:

HIGHinfinite loop triggering signal handleri386CVE-2004-0554
MEDIUMaudit_syscall_entry bypassamd64CVE-2009-0834
MEDIUMSECCOMP bypassamd64CVE-2009-0835
MEDIUMNon-sign extension of syscall argumentss390CVE-2009-0029
MEDIUMEFLAGS leak on context switchamd64/i386CVE-2006-5755
MEDIUMNested faultsamd64CVE-2005-1767
MEDIUMNot handling properly certain privileged instructionss390CVE-2004-0887
LOWFix register leak in 32 bits syscall audititingamd6481766741f
LOW64-bit kernel register leak to 32-bit processesamd6424e35800c
LOWRegister leakamd64CVE-2009-2910
LOWDoS by using malformed LDTamd64CVE-2008-3247
LOWDoS on floating point exceptionspowerpc HTXCVE-2007-3107
LOWDoS on 32-bit compatibility modeamd64CVE-2005-1765

In other words, if you are running a pure 32 bits environment, our initial intuition was almost good with two bugs so far (in 2004 and 2006). However, on AMD64, I wouldn’t bet.

Disclaimer: Off course, theses numbers are meaningless because of the non-disclosure policy of the kernel’s developpers.

Massive Reverse Address DNS Resolver

Just for the record (and newsoft :), here is a basic reverse DNS bruteforce implemented with Node.js: thanks to this awesome event-based library, it is possible to write powerful tools in a few Javascript lines!

The following code will resolve a /24 netblock in less than 5 seconds.

#! /usr/bin/nodejs

var baseaddr = '88.191.98.';

var sys = require('sys');
var dns = require('dns');
var events = require('events');

function reverse_addr(addr) {
    var e = new events.EventEmitter();
    dns.reverse(addr, function(err, domains) {
        if (err) {
            if (err.errno == dns.NOTFOUND)
                e.emit('response', addr, 'NOTFOUND');
                e.emit('error', addr, err);
        } else
            e.emit('response', addr, domains);
    return e;

for (var i = 0 ; i < 255 ; i++) {
    var currentaddr = baseaddr+i;

    reverse_addr(currentaddr).addListener('error', function (addr, err) {
        sys.debug(addr + ' failed: ' + err.message);
    }).addListener('response', function(addr, domains) {
        sys.puts(addr + ' = ' + domains);

There is no retry mechanism if the remote server returns a SERVFAIL but this is left as exercise to the reader…