time to bleed by Joe Damato

technical ramblings from a wanna-be unix dinosaur

Author Archive

Digging out the craziest bug you never heard about from 2008: a linux threading regression

View Comments

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.


This blog post will show how a fix for XFree86 and linuxthreads ended up causing a major threading regression about 7 years after the fix was created.

The regression was in pthread_create. Thread creation performed very slowly as the number of threads in a process increased. This bug is present on CentOS 5.3 (and earlier) and other linux distros as well.

It is also very possible that this bug impacted research done before August 15, 2008 (in the best case because Linux distro releases are slow) on building high performance threaded applications.

Digging this thing out was definitely one of the more interesting bug hunts in recent memory.

Hopefully, my long (and insane) story will encourage you to thoroughly investigate suspicious sources of performance degradation before applying the “[thing] is slow” label.

[thing] might be slow, but it may be useful to understand why.


This blog post will be talking about:

  • glibc 2.7. Earlier versions are probably affected, but you will need to go look and see for sure.
  • Linux kernels earlier than 2.6.27. Some linux distributions will backport fixes, so it is possible you could be running an earlier kernel version, but without the bug described below. You will need to go look and see for sure.

Linux’s mmap and MAP_32BIT

mmap is a system call that a program can use to map regions of memory into its address space. mmap can be used to map files into RAM and to share these mappings with other processes. mmap is also used by memory allocators to obtain large regions of memory that can be carved up and handed out to a program.

On June 29, 2001, Andi Kleen added a flag to Linux’s mmap called MAP_32BIT. This commit message reads in part1:

This adds a new mmap flag to force mappings into the low 32bit address space.
Useful e.g. for XFree86′s ELF loader or linuxthreads’ thread local
data structures.

As Kleen mentions, XFree86 has its own ELF loader that appears to have been released as part of the 4.0.1 release back in 2000. The purpose of this ELF loader is to allow loadable module support for XFree86 even on systems that don’t necessarily have support for loadable modules. Another interesting side effect of the decision to include an ELF loader is that loadable modules can be built once and then reused on any system that XFree86 supports without recompiling the module source.

It appears that Kleen added MAP_32BIT to allow programs (like XFree86) which assumed mmap would always return 32-bit addresses to continue to work properly as 64-bit processors were beginning to enter the market.

Then, on November 11, 2002, Egbert Eich added some code to XFree86 to actually use the MAP_32BIT flag, the commit message says:

532. Fixed module loader to map memory in the low 32bit address space on
x86-64 (Egbert Eich).

Thus, 64-bit XFree86 builds would now have a working ELF loader since those builds would be able to get memory with 32-bit addresses.

I will touch on the threading implications mentioned in Kleen’s commit message a bit later.

ELF small code execution model and a tweak to MAP_32BIT

The AMD64 ABI lists several different code models which differ in addressing, code size, data size, and address range.

Specifically, the spec defines something called the small code model.

The small code model is defined such that that all symbols are known to be located in the range from 0 to 0x7EFFFFFF (among other things that are way beyond the scope of this blogpost).

In order to support this code execution model, Kleen added a small tweak to MAP_32BIT to limit the range of addresses that mmap would return in order to support the small code execution model.

Unfortunately, I have not been able to track down the exact commit with Kleen’s commit message (if there was one), but it occurred sometime between November 28, 2002 (kernel 2.4.20) and June 13, 2003 (kernel 2.4.21).

I did find what looks like a merge commit or something. It shows the code Kleen added and a useful comment explaining why the address range was being limited:

+	} else if (flags & MAP_32BIT) { 
+		/* This is usually used needed to map code in small
+		   model: it needs to be in the first 31bit. Limit it
+		   to that.  This means we need to move the unmapped
+		   base down for this case.  This may give conflicts
+		   with the heap, but we assume that malloc falls back
+		   to mmap. Give it 1GB of playground for now. -AK */ 
+		if (!addr) 
+			addr = 0x40000000; 
+		end = 0x80000000;		

Unfortunately, now the flag’s name MAP_32BIT is inaccurate.

The range has been limited to a single gigabyte from 0x40000000 (1gb) – 0x80000000 (2gb). This is good enough for the ELF small code model mentioned above, but this means any memory successfully mapped with MAP_32BIT is actually mapped within the first 31 bits and thus this flag should probably be called MAP_31BIT or something else that more accurately describes its behavior.


pthread_create and thread stacks

When you create a thread using pthread_create there are two ways to allocate a region of memory for that thread to use as its stack:

  • Allow libpthread to allocate the stack itself. You do this by simply calling pthread_create. This is the common case for most programs. Use pthread_attr_getstacksize and pthread_attr_setstacksize to get and set the stack size.


  • A three step process:

    1. Allocate a region of memory possibly with mmap, malloc (which may just call mmap for large allocations), or statically.
    2. Use pthread_attr_setstack to set the address and size of the stack in a pthread attribute object.
    3. Pass said attribute object along to pthread_create and the thread which is created will have your memory region set as its stack.

Slow context switches, glibc, thread local storage, … wow …

A lot of really crazy shit happened in 2003, so I will try my best to split into digestible chunks.

Slow context switching on AMD k8

On February 12, 2003, it was reported that early AMD P4 CPUs were very slow when executing the wrmsr instruction. This instruction is used to write to model specific registers (MSRs). This instruction was used a few times in context switch code and removing it would help speed up context switch time. This code was refactored, but the data gathered here would be used as a justification for using MAP_32BIT in glibc a few months later.

MAP_32BIT being introduced to glibc

On March 4, 2003, it appears that Ulrich Drepper added code to glibc to use the MAP_32BIT flag in glibc. As far as I can tell, this was the first time MAP_32BIT was introduced to glibc2 .

An interesting comment is presented with a small piece of the patch:

+/* For Linux/x86-64 we have one extra requirement: the stack must be
+   in the first 4GB.  Otherwise the segment register base address is
+   not wide enough.  */
+  if ((uintptr_t) stackaddr > 0x100000000ul                                  \
+      || (uintptr_t) stackaddr + stacksize > 0x100000000ul)                  \
+    /* We cannot handle that stack address.  */                                      \
+    return EINVAL

To understand this comment it is important to understand how Linux deals with thread local storage.


  • Each thread has a thread control block (TCB) that contains various internal information that nptl needs, including some data that can be used to access thread local storage.
  • The TCB is written to the start of the thread stack by nptl.
  • The address of the TCB (and thus the thread stack) needs to be stored in such a way that nptl, code generated by gcc, and the kernel can access the thread control block of the currently running thread. In other words, it needs to be context switch friendly.
  • The register set of Intel processors is saved and restored each time a context switch occurs.
  • Saving the address of the TCB in a register would be ideal.

x86 and x86_64 processors are notorious for not having many registers available, however Linux does not use the FS and GS segment selectors for segmentation. So, the address of the TCB can be stored in FS or GS if it will fit.

Unfortunately, the segment selectors FS and GS can only store 32-bit addresses and this is why Ulrich added the above code. Addresses above 4gb could not be stored in FS or GS.

It appears that this comment is correct for all of the Linux 2.4 series kernels and all Linux 2.5 kernels less than 2.5.65. On these kernel versions, only the segment selector is used for storing the thread stack address and as a result no thread stack above 4gb can be stored.

32-bit and 64-bit thread stacks

Starting with Linux 2.5.65 (and more practically, Linux 2.6.0), support for both 32-bit and 64-bit thread stacks had made its way into the kernel. 32-bit thread stacks would be stored in a segment selector while 64-bit thread stack addresses would be stored in a model specific register (MSR).

Unfortunately, as was reported back in February 12, 2003, writing to MSRs is painfully slow on early AMD K8 processors. To avoid writing to an MSR, you would need to supply a 32-bit thread stack address, and thus Ulrich added the following code to glibc on May 9, 2003:

+/* We prefer to have the stack allocated in the low 4GB since this
+   allows faster context switches.  */
+/* If it is not possible to allocate memory there retry without that
+   flag.  */
+#define ARCH_RETRY_MMAP(size) \
+  mmap (NULL, size, PROT_READ | PROT_WRITE | PROT_EXEC,                              \

This code is interesting for two reasons:

  1. The justification for using MAP_32BIT has changed from the kernel not supporting addresses above 4gb to decreasing context switch cost.
  2. A retry mechanism is added so that if no memory is available when using MAP_32BIT, a thread stack will be allocated somewhere else.

At some unknown (by me) point in 2003, MAP_32BIT had been sterilized as explained earlier to deal with the ELF small code model in the AMD64 ABI.

The end result being that user programs have only 1gb of space with which to allocate all their thread stacks (or other low memory requested with MAP_32BIT).

This seems bad.

Fast forward to 2008: an bug

On August 13, 2008 an individual named “Pardo” from Google posted a message to the Linux kernel mailing list about a regression in pthread_create:

mmap() is slow on MAP_32BIT allocation failure, sometimes causing
NPTL’s pthread_create() to run about three orders of magnitude slower.
As example, in one case creating new threads goes from about 35,000
cycles up to about 25,000,000 cycles — which is under 100 threads per

Pardo had filled the 1gb region that MAP_32BIT tries to use for thread stacks causing glibc to fallback to the retry mechanism that Drepper added back in 2003.

Unfortunately, the failing mmap call with MAP_32BIT was doing a linear fucking search of all the “low address space” memory regions trying to find a fit before falling back and calling mmap a second time without MAP_32BIT.

And so after a few thousand threads, every new pthread_create call would trigger two system calls, the first of which would do a linear search of low memory before failing. The linear search and retry dramatically increased the time to create new threads.

This is pretty bad.

bugfixes for the kernel and glibc

So, how does glibc and the kernel fix this problem?

Ingo Molnar convinced everyone that the best solution was to add a new flag to the Linux kernel called MAP_STACK. This flag would be defined as “give out an address that is best suited for process/thread stacks”. This flag would actually be ignored by the kernel. This change appeared in Linux kernel 2.6.27 and was added on August 13, 2008.

Ulrich Drepper updated glibc to use MAP_STACK instead of MAP_32BIT and he removed the retry mechanism he added in 2003 since MAP_STACK should always succeed if there is any memory available. This change was added on August 15, 2008.

MAP_32BIT cannot be removed from the kernel, unfortunately because there are many programs out in the wild (older versions of glibc, Xfree86, older versions of ocamlc) that rely on this flag existing to actually work.

And so, MAP_32BIT will remain. A misnamed, strange looking wart that will probably be around forever to remind us that computers are hard.

Bad research (or not)?

I recall reading a paper that benchmarked thread creation time from back in the 2005-2008 time period which claimed that as the number of threads increased the time to create new threads also increased and thus, threads were bad.

I can’t seem to find that paper and it is currently 3:51 AM PST, so, who knows I could be misremembering things. If some one knows what paper I am talking about, please let me know.

If such a paper exists (and I think it does), this blog post explains why thread creation benchmarks would have resulted in really bad looking graphs.


I really don’t even know what to say, but:

  • Computers are hard
  • Ripping the face off of bugs like this is actually pretty fun
  • Be careful when adding linear time searches to potential hot paths
  • Make sure to carefully read and consider the effects of low level code
  • Git is your friend and you should use it to find data about things from years long gone
  • Any complicated application, kernel, or whatever will have a large number of ugly scars and warts that cannot be changed to maintain backward compatibility. Such is life.

If you enjoyed this article, subscribe (via RSS or e-mail) and follow me on twitter.


  1. http://web.archiveorange.com/archive/v/iv9U0zrDmBRAagTHyhHz []
  2. The sha for this change is: 0de28d5c71a3b58857642b3d5d804a790d9f6981 for those who are curious to read more. []

Written by Joe Damato

May 6th, 2013 at 4:13 am

Posted in bugfix,linux

realtalk.io: an podcast for technical discussion

View Comments

James Golick and I have released the inaugural episode of our new, highly technical podcast realtalk.io.

We will be doing frequent technical deep dives and releasing our conversations raw and unedited with all errors, omissions, awkward pauses, and curse words intact.

Check out the website, tune in, and subscribe.

Written by Joe Damato

April 28th, 2013 at 10:23 pm

Posted in Uncategorized

Notes about an odd, esoteric, yet incredibly useful library: libthread_db

View Comments

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.


This blog post will examine one of the weirder libraries I’ve come across: libthread_db.

libthread_db is typically used by debuggers, tracers, and other low level debugging/profiling applications to gather information about the threads in a running target process. Unfortunately, the documentation about how to use this library is a bit lacking and using it is not straightforward at all.

This library is pretty strange and there are several gotchas when trying to write a debugger or tracing program that makes use of the various features libthread_db provides.

Loading the library (and probably failing)

As strange as it may seem to those who haven’t used this library before, loading and linking to libthread_db is not as straight forward as simply adding -lthread_db to your linker flags.

The key thing to understand is that different target programs may use different threading libraries. Individual threading libraries may or may not have a corresponding libthread_db that works with a particular threading library, or even with a particular version of a particular threading library.

So until you attach to a target process, you have no idea which of the possibly several libthread_db libraries on the system you will need to use to gather threading information from a target process.

You don’t even know where the corresponding libthread_db library may live.

So, to load libthread_db in your debugger/tracer, you must:

  1. Attach to your target process, usually via ptrace.
  2. Traverse the target process’ link map to determine which libraries are currently loaded. Your program should search for the threading library of the process (often libpthread, but maybe your target program uses something else instead).
  3. Once found, your program can search in nearby directories for the location of libthread_db. In the most common case, a program will use libpthread as its threading library and the corresponding libthread_db will be located in the same directory. Of course, you could also allow the user to specify the exact location.
  4. Once found, simply use libdl to dlopen the libary.
  5. If your target process is a linux process which uses libpthread (a common casse), libthread_db fails to load with libdl. Other libthread_db libraries may or may not load fine.

libthread_db’s numerous undefined symbols

If you’ve followed the above steps to attempt to locate libthread_db and are targeting a linux process that uses libpthread, you have now most likely failed to load it due to a number of undefined symbols.

Let’s use ldd to figure out what is going on:

joe@ubuntu:~$ ldd -r /lib/x86_64-linux-gnu/libthread_db.so.1 | grep undefined
undefined symbol: ps_pdwrite	(/lib/x86_64-linux-gnu/libthread_db.so.1)
undefined symbol: ps_pglobal_lookup	(/lib/x86_64-linux-gnu/libthread_db.so.1)
undefined symbol: ps_lsetregs	(/lib/x86_64-linux-gnu/libthread_db.so.1)
undefined symbol: ps_getpid	(/lib/x86_64-linux-gnu/libthread_db.so.1)
undefined symbol: ps_lgetfpregs	(/lib/x86_64-linux-gnu/libthread_db.so.1)
undefined symbol: ps_lsetfpregs	(/lib/x86_64-linux-gnu/libthread_db.so.1)
undefined symbol: ps_lgetregs	(/lib/x86_64-linux-gnu/libthread_db.so.1)
undefined symbol: ps_pdread	(/lib/x86_64-linux-gnu/libthread_db.so.1)

Bring your own symbols to this party

libthread_db will fail to load due to undefined symbols because the library expects your program to provide the implementations of these symbols. Unfortunately, the only way to determine which functions must be implemented is to examine the source code of the libthread_db implementation(s) you are targeting.

The libthread_db implementations that come with glibc include a header file named proc_service.h which list all the functions and prototypes that your program must provide. I’ve noticed that other libthread_db implementations also provide a similar header file.

These functions are all very platform specific and to maximize the portability of the various implementations of libthread_db the implementations are left to the program using libthread_db.

In general, your program must provide implementations of:

  • Functions to read from and write to the address space of a targeted process. Typically implemented with ptrace.
  • Functions to read and write the general purpose registers and floating point registers (if there are any). Typically implemented with ptrace.
  • A function to locate a specified shared object and search that object for a particular symbol. This function is significantly more complex than the other functions. Your program could use something like libbfd or libelf to make locating a library and searching it’s symbol tables easier. If you are implementing a debugger or tracer, you likely already have the pieces you need to implement this function.
  • A structure struct ps_prochandle that libthread_db will pass through to the functions you implemented that are described above. You will place whatever data your functions will need. Typically this is something like a pid that you can pass through to ptrace.

libthread_db still fails to load

So, you’ve implemented the symbols you were required to implement, but you are still unable to load libthread_db with libdl because you are getting undefined symbol: ... errors.

Even stranger, you are getting these errors even though you are providing the symbols listed in the error messages!

The problem that you are running into is that the symbols are not being placed into the correct ELF symbol table. When you build an executable with gcc, the exported symbols of the executable are placed in the ELF section named .symtab. When libthread_db gets loaded with libdl, only the symbols in the .dynsym symbol table are examined to resolve dependencies. Thus, your symbols will not be found and libthread_db will fail to load.

Why this happens is beyond the scope of this blog post, but I’ve written about dynamic linking and symbol tables before here and here, if you are curious to learn a bit more.

Use this one weird trick for getting your symbols in the dynamic symbol table

There are actually two ways to make sure your symbols end up in the dynamic symbol table.

The first way to do it is to use the large hammer approach and pass the flag --export-dynamic to ld. This will add all exported symbols to the dynamic symbol table and you will be able to load libthread_db.

The second way to do it is much cleaner and strongly recommend over the previous method.

  • Create a file which specifies the symbol names you want added to the dynamic symbol table.
  • Use the linker flag --dynamic-list=FILENAME to let ld know which symbols you want placed in the dynamic symbol table.

Your file might look something like this:

   /* more symbol names would go here... */

If you are using gcc, you can then simply pass the flag: -Wl,--dynamic-list=FILENAME and your executable will have the symbols listed in the file placed in the dynamic symbol table.

Regardless of which method you use be sure to verify the results by using readelf to determine if the symbols actually made it to the correct symbol table.

Calling the initialize function and allocating a libthread_db handle

So, after all that work you will finally be able to load the library.

Since the library was loaded with libdl, you will need to use dlsym to grab function pointers to all the functions you intend to use. This is kind of tedious, but you can make clever use of C macros to help you, as long as you also make use of documentation to explain how they work.

So, to find and call the initialize function (without any macros for sanity and clarity):

   /* find the init function */
   td_init = dlsym(handle, "td_init");
   if (td_init == NULL) {
     fprintf(stderr, "Unable to find td_init");
     return -1;

  /* call the init function */
  err = td_init();
  if (err != TD_OK) {
     fprintf(stderr, "td_init: %d\n",err);
     return -1;

  /* find the libthread_db handle allocator function */
  td_ta_new = dlsym(handle, "td_ta_new");                                       
  if (td_ta_new == NULL) {                                                      
     fprintf(stderr, "Unable to find td_ta_new");                                           
     return -1;                                                                      
  /* call td_ta_new */
  err = td_ta_new(&somestructure->ph, &somestructure->ta);
  if (err != TD_OK) {                                                               
     fprintf(stderr, "td_ta_new failed: %d\n", err);                           
     return -1;

  /* XXX don't forget about td_ta_delete */

A cool version check

td_ta_new performs a rather interesting version check when called before allocating a handle:

  1. First, it uses the ps_pglobal_lookup symbol you implemented to search for the symbol nptl_version in the libpthread library linked into the remote process. Your function should find this symbol and return the address.
  2. Next, td_ta_new reads several bytes from the target process at the address your ps_pglobal_lookup returned using your ps_pdread function.
  3. Lastly, the bytes read from the target process are checked against libthread_db‘s internal version to determine if the versions match.

So, the library you load calls functions you implemented to search the symbol tables of a process you are attached to in order to read a series of bytes out of that process’ address space to determine if that process’ threading library matches the version of libthread_db you loaded into your debugger.

Fucking rad.

By the way, if you were wondering why libpthread is one of the few libraries that is not stripped on Linux, now you know. If it were stripped, this check would fail, unless of course your ps_pglobal_lookup function searched debug information.

Now you can use the library

At this point, you’ve done enough setup to be able to dlsym search for and call various functions to iterate over the threads in a remote process, to be notified asynchronously when threads are created or destroyed, and to access thread local data if you want to.


Here’s a summary of the steps you need to go through to load, link, and use libthread_db:

  • Implement a series of functions and structures specified in the libthread_db implementation(s) you are targeting. You can find these in the header file called proc_service.h.
  • Attach to the remote process, determine the path of the threading library it is using and look nearby to find libthread_db. Alternatively, allow the user to specify the location of libthread_db.
  • Use libdl to load the library by calling dlopen.
  • Use dlsym to find td_init and td_ta_new. Call these functions to initialize the library.
  • Ensure you are using either --export-dynamic or --dynamic-list=FILENAME to place the symbols in the correct symbol table so that the runtime dynamic linker will find them when you load libthread_db.
  • Make sure to use lots of error checking and debug output to ensure that your implemented functions are being hit and that they are returning the proper return values as specified in proc_service.h.
  • Sit back and consider that this entire process actually works and allows you to debug or trace processes with multiple threads.

If you enjoyed this article, subscribe (via RSS or e-mail) and follow me on twitter.

Written by Joe Damato

April 22nd, 2013 at 2:24 am

How a crazy GNU assembler macro helps you debug GRUB with GDB

View Comments

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.


Debugging boot loaders and other low level pieces of a computer system can be pretty tricky, especially because you may not have multiprocess support or access to a hard drive or other devices.

This blog post examines one way of debugging these sorts of systems by examining an insanely clever GNU assembler macro and some GDB stub code in the GRUB boot loader.

This piece of stub code allows a programmer to debug GRUB with GDB over a serial cable to help diagnose a broken boot loader.


Firstly, the macro that will be examined is truly a thing of beauty. The macro generates assembly code stubs for a range of interrupts via recursion and, with very clever use of labels, automatically writes the addresses of the generated assembly to an array so that those addresses can later be used as interrupt handler offsets.

Secondly, I think debugging is actually much more interesting than programming in most cases. In particular, debugging low level things like GRUB are particularly interesting to me because of the weird situations that arise. Imagine you are trying to debug something, but you have no keyboard, maybe video is only sort of working, you don’t have multiprocess support, and you aren’t able to communicate with your hard drive.

How do you debug a page fault in a situation like this?

This blog post will attempt to explain how GRUB overcomes this by using some really clever code coupled with GDB’s remote debug functionality.

overview of what you are about to see

  • GRUB’s GDB module is loaded.
  • The module calls a function named grub_gdb_idtinit.
  • grub_gdb_idtinit loads the interrupt descriptor table with addresses of functions to be executed when various interrupts are raised on the system.
  • The addresses of the interrupt handlers are from an array called grub_gdb_trapvec.
  • The code for two different types of interrupt handlers is generated with a series of insanely clever macros, explained in detail below. The main macro named ent uses recursion and clever placement of labels to automatically generate the assembly code stubs and write their addresses to grub_gdb_trapvec.
  • The addresses of the interrupt handler code is filled in the grub_gdb_trapvec array by using labels.
  • The generated code of the interrupt handlers themselves call grub_gdb_trap.
  • grub_gdb_trap reads and writes packets according to GDB’s remote serial protocol.
  • The remove debugger is now able to set breakpoints, dump register contents, or step through instructions via serial cable.

Prepare yourself.

GRUB’s GDB module initialization

The GRUB 2.0 boot loader supports dynamically loaded modules to extend the functionality of GRUB. I’m going to dive right into the GDB module, but you can read more about writing your own modules here.

The GRUB’s GDB module has an init function that looks like this1:


  grub_gdb_idtinit ();

  cmd = grub_register_command ("gdbstub", grub_cmd_gdbstub,
                               N_("Start GDB stub on given port"));

  cmd_break = grub_register_command ("gdbstub_break", grub_cmd_gdb_break,
                                     0, N_("Break into GDB"));

/* other code */

This module init function starts by calling a function named grub_gdb_idtinit which has a lot interesting code that we will examine shortly. As we will see, this function creates a set of interrupt handlers and installs them so that any exceptions (divide by 0, page fault, etc) that are generated will trigger GDB on the remote computer.

After that, two commands named gdbstub and gdbstub_break are registered with GRUB. If the GRUB user issues one of these commands, the corresponding functions are executed.

The first command, gdbstub attaches a specified serial port to the GDB module so that the remote GDB session can communicate with this computer.

The second command, gdbstub_break simply raises a debug interrupt on the system by calling the function grub_gdb_breakpoint after some error checking2:

grub_gdb_breakpoint (void)
   asm volatile ("int $3");

This works just fine because the grub_gdb_idtinit has registered a handler for the debug interrupt.

entering the rabbit hole: grub_gdb_idtinit

The grub_gdb_idtinit function which is called during initialization is pretty straightforward. It simply creates interrupt descriptor table (IDT) entries which point at interrupt handlers for interrupt numbers 0 through 31. The basic idea here is that something bad happens (page fault, general protection fault, divide by zero, …) and the CPU calls a handler function to report the exception or error condition.

You can read more about interrupt and exception handling on the Intel 64 and and IA-32 CPUs by reading the Intel® 64 and IA-32 Architectures: Software Developer’s Manual volume 3A, chapter 6 available from Intel here.

Take a look at the C code for grub_gdb_idtinit3, paying close attention to the for loop:

/* Set up interrupt and trap handler descriptors in IDT.  */
grub_gdb_idtinit (void)
  int i;
  grub_uint16_t seg;

  asm volatile ("xorl %%eax, %%eax\n"
                "mov %%cs, %%ax\n" :"=a" (seg));

  for (i = 0; i <= GRUB_GDB_LAST_TRAP; i++)
      grub_idt_gate (&grub_gdb_idt[i],
                     grub_gdb_trapvec[i], seg,
                     GRUB_CPU_TRAP_GATE, 0);

  grub_gdb_idt_desc.base = (grub_addr_t) grub_gdb_idt;
  grub_gdb_idt_desc.limit = sizeof (grub_gdb_idt) - 1;
  asm volatile ("sidt %0" : : "m" (grub_gdb_orig_idt_desc));
  asm volatile ("lidt %0" : : "m" (grub_gdb_idt_desc));

You'll notice that this function maps interrupt numbers to handler function addresses in a for-loop. The function addresses come from an array named grub_gdb_trapvec.

The grub_idt_gate function called above simply constructs the interrupt descriptor table entry, given:

  • a memory location for the entry to live (above: grub_gdb_idt[i])
  • the address of the handler function from the grub_gdb_trapvec array (above: grub_gdb_trapvec[i])
  • the segment selector (above: seg)
  • and finally the gate type (above: GRUB_CPU_TRAP_GATE) and privilege bits (above: 0)

Note that the last two inline assembly statements store existing IDT descriptor and set a new IDT descriptor, respectively.

Naturally, the next question is: where do the function addresses in grub_gdb_trapvec come from and what, exactly, do those handler functions do when executed?

grub_gdb_trapvec: a series of clever macros

It turns out that grub_gdb_trapvec is an array which is constructed through a series of really fucking sexy macros in an assembly.

Let's first examine grub_gdb_trapvec4:

/* some things removed for brevity */
.data VECTOR
        ent EC_ABSENT,  0, 7
        ent EC_PRESENT, 8
        ent EC_ABSENT,  9
        ent EC_PRESENT, 10, 14
        ent EC_ABSENT,  15, GRUB_GDB_LAST_TRAP

This code creates a global symbol named grub_gdb_trapvec in the data section of the compiled object. The contents of grub_gdb_trapvec are constructed by a series of invocations of the ent macro.

Let's take a look at the ent macro (I removed some Apple specific code for brevity) and go through it piece by piece5:

.macro ent ec beg end=0
#define EC \ec
#define BEG \beg
#define END \end

        .if EC
                add $4, %esp

This is the start of the ent macro. This code creates a macro named ent and gives names to the arguments handed over to the macro. It assigns a default value of 0 to end, the third argument.

After that, it uses C preprocessor macros named EC,BEG, and END. This is done to assist with cross-platform builds of this source (specifically for dealing with OSX weirdness).

Next, some code is added to the text section of the object. The start of the code is going to be given the label 1, so that it can be easily referred to later. This label is what will be used to automatically fill in the addresses of the assembly code stubs a little later.

Finally, the add $4, %esp code is included in the assembled object only if EC is non-zero.

EC is the first argument to the ent macro which could either be EC_ABSENT (0) or EC_PRESENT (1) as you saw above. EC stands for "error code." Some interrupts/exceptions that can be generated put an error code on the stack when they occur. If this interrupt/exception places an error code on the stack, this line of code is adding to the stack pointer (remember: on x86 the stack grows down, from higher addresses to lower addresses) to position the stack pointer above the error code. This is done to ensure the stack is at the same position regardless of whether or not an error code is inserted on the stack. In the case where an error code does exist, it is ignored by the code below.

        mov     $EXT_C(grub_gdb_stack), %esp
        mov     $(BEG), %eax    /* trap number */
        call    EXT_C(grub_gdb_trap)

This next piece of code begins by using another macro called save_context which writes out the current register values to memory. Next, the address of a piece of memory called grub_gdb_stack is written to %esp. After this instruction, all future code that runs will be using stack space backed by a section of memory named grub_gdb_stack. The interrupt number is written to the %eax register and then the C function grub_gdb_trap is called. We'll take a look at what this function does in a bit. The load_context macro does the opposite of save_context and restores all register values from memory.

Finally, an iret instruction is used to continue execution. In most cases, this instruction restores the system to a broken state where it will hang, trigger another exception, or just reboot itself depending on how many levels deep you have gotten yourself in exceptions.

         * Address entry in trapvec array.

       .data VECTOR
       .long 1b

         * Next... (recursion).

        .if END-BEG > 0
                ent \ec "(\beg+1)" \end

This is the last piece of the amazing ent macro. It refers to a data section created earlier when the grub_gdb_trapvec symbol was being created and in this section the address where label 1 exists is written.

Thus, the address of the code which saves the CPU context, switches out the stack, and invokes grub_gdb_trap is written out.

The ent macro ends by re-invoking itself to generate more code in the .text and fill in more addresses in the .data section for each interrupt/exception in the range passed in to ent as BEG and ENG.

Wow. An macro.


The code generated by the ent macro calls grub_gdb_trap. In other words, this function is called whenever an interrupt/execption is raised while GRUB is running and the GDB module is loaded.

This function pulls data off the serial port (which you set up when you ran gdbstub in GRUB as seen earlier). The data coming in on the serial port are packets as per GDB's remote serial protocol. These packets contain commands from the remote GDB session. grub_gdb_trap parses these packets, executes the commands, and replies. So, packets are parsed and registers are updated, memory is written or read, and data is passed back over the serial port. This is what allows a remote GDB session on another computer connected via serial cable to set breakpoints, examine registers, or single step code.


  • GDB's remote serial protocol is very powerful.
  • Likewise, knowing how to use GNU as can help you construct really clever macros to generate repetitive assembly code easily.
  • Writing a C-stub to parse GDB's remote serial protocol and carry out the commands can allow you to debug weird things, even if the target system lacks multiprocess support, system calls, or a hard drive.
  • Go read the GRUB source code. It's pretty interesting.

If you enjoyed this article, subscribe (via RSS or e-mail) and follow me on twitter.


  1. grub-2.00/grub-core/gdb/gdb.c []
  2. grub-2.00/grub-core/gdb/i386/idt.c []
  3. grub-2.00/grub-core/gdb/i386/idt.c []
  4. grub-2.00/grub-core/gdb/i386/machdep.S []
  5. grub-2.00/grub-core/gdb/i386/machdep.S []

Written by Joe Damato

November 26th, 2012 at 1:09 am

slides from highload++ 2012

View Comments

Written by Joe Damato

October 23rd, 2012 at 3:30 am

Posted in Uncategorized