technical ramblings from a wanna-be unix dinosaur

## tl;dr

This blog post will examine one of the weirder libraries I’ve come across: libthread_db.

libthread_db is typically used by debuggers, tracers, and other low level debugging/profiling applications to gather information about the threads in a running target process. Unfortunately, the documentation about how to use this library is a bit lacking and using it is not straightforward at all.

This library is pretty strange and there are several gotchas when trying to write a debugger or tracing program that makes use of the various features libthread_db provides.

As strange as it may seem to those who haven’t used this library before, loading and linking to libthread_db is not as straight forward as simply adding -lthread_db to your linker flags.

The key thing to understand is that different target programs may use different threading libraries. Individual threading libraries may or may not have a corresponding libthread_db that works with a particular threading library, or even with a particular version of a particular threading library.

So until you attach to a target process, you have no idea which of the possibly several libthread_db libraries on the system you will need to use to gather threading information from a target process.

You don’t even know where the corresponding libthread_db library may live.

So, to load libthread_db in your debugger/tracer, you must:

1. Attach to your target process, usually via ptrace.
2. Traverse the target process’ link map to determine which libraries are currently loaded. Your program should search for the threading library of the process (often libpthread, but maybe your target program uses something else instead).
3. Once found, your program can search in nearby directories for the location of libthread_db. In the most common case, a program will use libpthread as its threading library and the corresponding libthread_db will be located in the same directory. Of course, you could also allow the user to specify the exact location.
4. Once found, simply use libdl to dlopen the libary.
5. If your target process is a linux process which uses libpthread (a common casse), libthread_db fails to load with libdl. Other libthread_db libraries may or may not load fine.

If you’ve followed the above steps to attempt to locate libthread_db and are targeting a linux process that uses libpthread, you have now most likely failed to load it due to a number of undefined symbols.

Let’s use ldd to figure out what is going on:

joe@ubuntu:~$ldd -r /lib/x86_64-linux-gnu/libthread_db.so.1 | grep undefined undefined symbol: ps_pdwrite (/lib/x86_64-linux-gnu/libthread_db.so.1) undefined symbol: ps_pglobal_lookup (/lib/x86_64-linux-gnu/libthread_db.so.1) undefined symbol: ps_lsetregs (/lib/x86_64-linux-gnu/libthread_db.so.1) undefined symbol: ps_getpid (/lib/x86_64-linux-gnu/libthread_db.so.1) undefined symbol: ps_lgetfpregs (/lib/x86_64-linux-gnu/libthread_db.so.1) undefined symbol: ps_lsetfpregs (/lib/x86_64-linux-gnu/libthread_db.so.1) undefined symbol: ps_lgetregs (/lib/x86_64-linux-gnu/libthread_db.so.1) undefined symbol: ps_pdread (/lib/x86_64-linux-gnu/libthread_db.so.1)  ## Bring your own symbols to this party libthread_db will fail to load due to undefined symbols because the library expects your program to provide the implementations of these symbols. Unfortunately, the only way to determine which functions must be implemented is to examine the source code of the libthread_db implementation(s) you are targeting. The libthread_db implementations that come with glibc include a header file named proc_service.h which list all the functions and prototypes that your program must provide. I’ve noticed that other libthread_db implementations also provide a similar header file. These functions are all very platform specific and to maximize the portability of the various implementations of libthread_db the implementations are left to the program using libthread_db. In general, your program must provide implementations of: • Functions to read from and write to the address space of a targeted process. Typically implemented with ptrace. • Functions to read and write the general purpose registers and floating point registers (if there are any). Typically implemented with ptrace. • A function to locate a specified shared object and search that object for a particular symbol. This function is significantly more complex than the other functions. Your program could use something like libbfd or libelf to make locating a library and searching it’s symbol tables easier. If you are implementing a debugger or tracer, you likely already have the pieces you need to implement this function. • A structure struct ps_prochandle that libthread_db will pass through to the functions you implemented that are described above. You will place whatever data your functions will need. Typically this is something like a pid that you can pass through to ptrace. ## libthread_db still fails to load So, you’ve implemented the symbols you were required to implement, but you are still unable to load libthread_db with libdl because you are getting undefined symbol: ... errors. Even stranger, you are getting these errors even though you are providing the symbols listed in the error messages! The problem that you are running into is that the symbols are not being placed into the correct ELF symbol table. When you build an executable with gcc, the exported symbols of the executable are placed in the ELF section named .symtab. When libthread_db gets loaded with libdl, only the symbols in the .dynsym symbol table are examined to resolve dependencies. Thus, your symbols will not be found and libthread_db will fail to load. Why this happens is beyond the scope of this blog post, but I’ve written about dynamic linking and symbol tables before here and here, if you are curious to learn a bit more. ## Use this one weird trick for getting your symbols in the dynamic symbol table There are actually two ways to make sure your symbols end up in the dynamic symbol table. The first way to do it is to use the large hammer approach and pass the flag --export-dynamic to ld. This will add all exported symbols to the dynamic symbol table and you will be able to load libthread_db. The second way to do it is much cleaner and strongly recommend over the previous method. • Create a file which specifies the symbol names you want added to the dynamic symbol table. • Use the linker flag --dynamic-list=FILENAME to let ld know which symbols you want placed in the dynamic symbol table. Your file might look something like this: { ps_pdread; ps_pdwrite; ps_pglobal_lookup; /* more symbol names would go here... */ };  If you are using gcc, you can then simply pass the flag: -Wl,--dynamic-list=FILENAME and your executable will have the symbols listed in the file placed in the dynamic symbol table. Regardless of which method you use be sure to verify the results by using readelf to determine if the symbols actually made it to the correct symbol table. ## Calling the initialize function and allocating a libthread_db handle So, after all that work you will finally be able to load the library. Since the library was loaded with libdl, you will need to use dlsym to grab function pointers to all the functions you intend to use. This is kind of tedious, but you can make clever use of C macros to help you, as long as you also make use of documentation to explain how they work. So, to find and call the initialize function (without any macros for sanity and clarity):  /* find the init function */ td_init = dlsym(handle, "td_init"); if (td_init == NULL) { fprintf(stderr, "Unable to find td_init"); return -1; } /* call the init function */ err = td_init(); if (err != TD_OK) { fprintf(stderr, "td_init: %d\n",err); return -1; } /* find the libthread_db handle allocator function */ td_ta_new = dlsym(handle, "td_ta_new"); if (td_ta_new == NULL) { fprintf(stderr, "Unable to find td_ta_new"); return -1; } /* call td_ta_new */ err = td_ta_new(&somestructure->ph, &somestructure->ta); if (err != TD_OK) { fprintf(stderr, "td_ta_new failed: %d\n", err); return -1; } /* XXX don't forget about td_ta_delete */  ## A cool version check td_ta_new performs a rather interesting version check when called before allocating a handle: 1. First, it uses the ps_pglobal_lookup symbol you implemented to search for the symbol nptl_version in the libpthread library linked into the remote process. Your function should find this symbol and return the address. 2. Next, td_ta_new reads several bytes from the target process at the address your ps_pglobal_lookup returned using your ps_pdread function. 3. Lastly, the bytes read from the target process are checked against libthread_db‘s internal version to determine if the versions match. So, the library you load calls functions you implemented to search the symbol tables of a process you are attached to in order to read a series of bytes out of that process’ address space to determine if that process’ threading library matches the version of libthread_db you loaded into your debugger. Fucking rad. By the way, if you were wondering why libpthread is one of the few libraries that is not stripped on Linux, now you know. If it were stripped, this check would fail, unless of course your ps_pglobal_lookup function searched debug information. ## Now you can use the library At this point, you’ve done enough setup to be able to dlsym search for and call various functions to iterate over the threads in a remote process, to be notified asynchronously when threads are created or destroyed, and to access thread local data if you want to. ## Conclusion Here’s a summary of the steps you need to go through to load, link, and use libthread_db: • Implement a series of functions and structures specified in the libthread_db implementation(s) you are targeting. You can find these in the header file called proc_service.h. • Attach to the remote process, determine the path of the threading library it is using and look nearby to find libthread_db. Alternatively, allow the user to specify the location of libthread_db. • Use libdl to load the library by calling dlopen. • Use dlsym to find td_init and td_ta_new. Call these functions to initialize the library. • Ensure you are using either --export-dynamic or --dynamic-list=FILENAME to place the symbols in the correct symbol table so that the runtime dynamic linker will find them when you load libthread_db. • Make sure to use lots of error checking and debug output to ensure that your implemented functions are being hit and that they are returning the proper return values as specified in proc_service.h. • Sit back and consider that this entire process actually works and allows you to debug or trace processes with multiple threads. If you enjoyed this article, subscribe (via RSS or e-mail) and follow me on twitter. Written by Joe Damato April 22nd, 2013 at 2:24 am ## How a crazy GNU assembler macro helps you debug GRUB with GDB View Comments If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter. ## tl;dr Debugging boot loaders and other low level pieces of a computer system can be pretty tricky, especially because you may not have multiprocess support or access to a hard drive or other devices. This blog post examines one way of debugging these sorts of systems by examining an insanely clever GNU assembler macro and some GDB stub code in the GRUB boot loader. This piece of stub code allows a programmer to debug GRUB with GDB over a serial cable to help diagnose a broken boot loader. ## why? Firstly, the macro that will be examined is truly a thing of beauty. The macro generates assembly code stubs for a range of interrupts via recursion and, with very clever use of labels, automatically writes the addresses of the generated assembly to an array so that those addresses can later be used as interrupt handler offsets. Secondly, I think debugging is actually much more interesting than programming in most cases. In particular, debugging low level things like GRUB are particularly interesting to me because of the weird situations that arise. Imagine you are trying to debug something, but you have no keyboard, maybe video is only sort of working, you don’t have multiprocess support, and you aren’t able to communicate with your hard drive. How do you debug a page fault in a situation like this? This blog post will attempt to explain how GRUB overcomes this by using some really clever code coupled with GDB’s remote debug functionality. ## overview of what you are about to see • GRUB’s GDB module is loaded. • The module calls a function named grub_gdb_idtinit. • grub_gdb_idtinit loads the interrupt descriptor table with addresses of functions to be executed when various interrupts are raised on the system. • The addresses of the interrupt handlers are from an array called grub_gdb_trapvec. • The code for two different types of interrupt handlers is generated with a series of insanely clever macros, explained in detail below. The main macro named ent uses recursion and clever placement of labels to automatically generate the assembly code stubs and write their addresses to grub_gdb_trapvec. • The addresses of the interrupt handler code is filled in the grub_gdb_trapvec array by using labels. • The generated code of the interrupt handlers themselves call grub_gdb_trap. • grub_gdb_trap reads and writes packets according to GDB’s remote serial protocol. • The remove debugger is now able to set breakpoints, dump register contents, or step through instructions via serial cable. Prepare yourself. ## GRUB’s GDB module initialization The GRUB 2.0 boot loader supports dynamically loaded modules to extend the functionality of GRUB. I’m going to dive right into the GDB module, but you can read more about writing your own modules here. The GRUB’s GDB module has an init function that looks like this1: GRUB_MOD_INIT (gdb) { grub_gdb_idtinit (); cmd = grub_register_command ("gdbstub", grub_cmd_gdbstub, N_("PORT"), N_("Start GDB stub on given port")); cmd_break = grub_register_command ("gdbstub_break", grub_cmd_gdb_break, 0, N_("Break into GDB")); /* other code */  This module init function starts by calling a function named grub_gdb_idtinit which has a lot interesting code that we will examine shortly. As we will see, this function creates a set of interrupt handlers and installs them so that any exceptions (divide by 0, page fault, etc) that are generated will trigger GDB on the remote computer. After that, two commands named gdbstub and gdbstub_break are registered with GRUB. If the GRUB user issues one of these commands, the corresponding functions are executed. The first command, gdbstub attaches a specified serial port to the GDB module so that the remote GDB session can communicate with this computer. The second command, gdbstub_break simply raises a debug interrupt on the system by calling the function grub_gdb_breakpoint after some error checking2: void grub_gdb_breakpoint (void) { asm volatile ("int$3");
}


This works just fine because the grub_gdb_idtinit has registered a handler for the debug interrupt.

## entering the rabbit hole: grub_gdb_idtinit

The grub_gdb_idtinit function which is called during initialization is pretty straightforward. It simply creates interrupt descriptor table (IDT) entries which point at interrupt handlers for interrupt numbers 0 through 31. The basic idea here is that something bad happens (page fault, general protection fault, divide by zero, …) and the CPU calls a handler function to report the exception or error condition.

You can read more about interrupt and exception handling on the Intel 64 and and IA-32 CPUs by reading the Intel® 64 and IA-32 Architectures: Software Developer’s Manual volume 3A, chapter 6 available from Intel here.

Take a look at the C code for grub_gdb_idtinit3, paying close attention to the for loop:

/* Set up interrupt and trap handler descriptors in IDT.  */
void
grub_gdb_idtinit (void)
{
int i;
grub_uint16_t seg;

asm volatile ("xorl %%eax, %%eax\n"
"mov %%cs, %%ax\n" :"=a" (seg));

for (i = 0; i <= GRUB_GDB_LAST_TRAP; i++)
{
grub_idt_gate (&grub_gdb_idt[i],
grub_gdb_trapvec[i], seg,
GRUB_CPU_TRAP_GATE, 0);
}

grub_gdb_idt_desc.limit = sizeof (grub_gdb_idt) - 1;
asm volatile ("sidt %0" : : "m" (grub_gdb_orig_idt_desc));
asm volatile ("lidt %0" : : "m" (grub_gdb_idt_desc));
}


You'll notice that this function maps interrupt numbers to handler function addresses in a for-loop. The function addresses come from an array named grub_gdb_trapvec.

The grub_idt_gate function called above simply constructs the interrupt descriptor table entry, given:

• a memory location for the entry to live (above: grub_gdb_idt[i])
• the address of the handler function from the grub_gdb_trapvec array (above: grub_gdb_trapvec[i])
• the segment selector (above: seg)
• and finally the gate type (above: GRUB_CPU_TRAP_GATE) and privilege bits (above: 0)

Note that the last two inline assembly statements store existing IDT descriptor and set a new IDT descriptor, respectively.

Naturally, the next question is: where do the function addresses in grub_gdb_trapvec come from and what, exactly, do those handler functions do when executed?

## grub_gdb_trapvec: a series of clever macros

It turns out that grub_gdb_trapvec is an array which is constructed through a series of really fucking sexy macros in an assembly.

Let's first examine grub_gdb_trapvec4:

/* some things removed for brevity */
.data VECTOR
VARIABLE(grub_gdb_trapvec)
ent EC_ABSENT,  0, 7
ent EC_PRESENT, 8
ent EC_ABSENT,  9
ent EC_PRESENT, 10, 14
ent EC_ABSENT,  15, GRUB_GDB_LAST_TRAP


This code creates a global symbol named grub_gdb_trapvec in the data section of the compiled object. The contents of grub_gdb_trapvec are constructed by a series of invocations of the ent macro.

Let's take a look at the ent macro (I removed some Apple specific code for brevity) and go through it piece by piece5:

.macro ent ec beg end=0
#define EC \ec
#define BEG \beg
#define END \end

.text
1:
.if EC
add $4, %esp .endif  This is the start of the ent macro. This code creates a macro named ent and gives names to the arguments handed over to the macro. It assigns a default value of 0 to end, the third argument. After that, it uses C preprocessor macros named EC,BEG, and END. This is done to assist with cross-platform builds of this source (specifically for dealing with OSX weirdness). Next, some code is added to the text section of the object. The start of the code is going to be given the label 1, so that it can be easily referred to later. This label is what will be used to automatically fill in the addresses of the assembly code stubs a little later. Finally, the add$4, %esp code is included in the assembled object only if EC is non-zero.

EC is the first argument to the ent macro which could either be EC_ABSENT (0) or EC_PRESENT (1) as you saw above. EC stands for "error code." Some interrupts/exceptions that can be generated put an error code on the stack when they occur. If this interrupt/exception places an error code on the stack, this line of code is adding to the stack pointer (remember: on x86 the stack grows down, from higher addresses to lower addresses) to position the stack pointer above the error code. This is done to ensure the stack is at the same position regardless of whether or not an error code is inserted on the stack. In the case where an error code does exist, it is ignored by the code below.

        save_context
mov     $EXT_C(grub_gdb_stack), %esp mov$(BEG), %eax    /* trap number */
call    EXT_C(grub_gdb_trap)
iret


This next piece of code begins by using another macro called save_context which writes out the current register values to memory. Next, the address of a piece of memory called grub_gdb_stack is written to %esp. After this instruction, all future code that runs will be using stack space backed by a section of memory named grub_gdb_stack. The interrupt number is written to the %eax register and then the C function grub_gdb_trap is called. We'll take a look at what this function does in a bit. The load_context macro does the opposite of save_context and restores all register values from memory.

Finally, an iret instruction is used to continue execution. In most cases, this instruction restores the system to a broken state where it will hang, trigger another exception, or just reboot itself depending on how many levels deep you have gotten yourself in exceptions.

        /*
* Address entry in trapvec array.
*/

.data VECTOR
.long 1b

/*
* Next... (recursion).
*/

.if END-BEG > 0
ent \ec "(\beg+1)" \end
.endif
.endm


This is the last piece of the amazing ent macro. It refers to a data section created earlier when the grub_gdb_trapvec symbol was being created and in this section the address where label 1 exists is written.

Thus, the address of the code which saves the CPU context, switches out the stack, and invokes grub_gdb_trap is written out.

The ent macro ends by re-invoking itself to generate more code in the .text and fill in more addresses in the .data section for each interrupt/exception in the range passed in to ent as BEG and ENG.

Wow. An macro.

## grub_gdb_trap

The code generated by the ent macro calls grub_gdb_trap. In other words, this function is called whenever an interrupt/execption is raised while GRUB is running and the GDB module is loaded.

This function pulls data off the serial port (which you set up when you ran gdbstub in GRUB as seen earlier). The data coming in on the serial port are packets as per GDB's remote serial protocol. These packets contain commands from the remote GDB session. grub_gdb_trap parses these packets, executes the commands, and replies. So, packets are parsed and registers are updated, memory is written or read, and data is passed back over the serial port. This is what allows a remote GDB session on another computer connected via serial cable to set breakpoints, examine registers, or single step code.

## Conclusion

• GDB's remote serial protocol is very powerful.
• Likewise, knowing how to use GNU as can help you construct really clever macros to generate repetitive assembly code easily.
• Writing a C-stub to parse GDB's remote serial protocol and carry out the commands can allow you to debug weird things, even if the target system lacks multiprocess support, system calls, or a hard drive.
• Go read the GRUB source code. It's pretty interesting.

## References

1. grub-2.00/grub-core/gdb/gdb.c []
2. grub-2.00/grub-core/gdb/i386/idt.c []
3. grub-2.00/grub-core/gdb/i386/idt.c []
4. grub-2.00/grub-core/gdb/i386/machdep.S []
5. grub-2.00/grub-core/gdb/i386/machdep.S []

Written by Joe Damato

November 26th, 2012 at 1:09 am

## the setup

So, you have some sort of OSX app. Maybe it’s Twitter.app, TweetDeck, or something else that has a secret stored inside the binary. You want to extract this secret, maybe because you want to impersonate the official client of a service or maybe just because you want to see if you can hack the gibson.

I don’t actually really care about Twitter clients, personally. I just wanted to see if I could rip the OAuth token out of some official clients and how long it would take me.

## strings, MITM, objdump, gdb, et al.

Not surprisingly, there are many different ways to rip data out of a binary. You can use strings to dump printable strings, play with mitmproxy, or simply reverse engineer the binary by reading objdump (or GDB or whatever) output. I’ve used all of these methods before with great success when attempting to hack the planet, but I had an idea for something a little bit more interesting that can be easily reused.

## what happens at a low level

Turns out that, at least at a low level, usually malloc/calloc/whatever and free end up getting called to allocate and deallocate memory regions used by applications. Sure, some apps only use static memory, other apps are built in languages that have a custom allocator, but there are enough apps out there that after you peel away the various candy coated layers of abstraction just end up calling malloc and free provided in the libc on their system provided by the vendor.

So, I assumed that TweetDeck, Twitter.app, and everyone else would be doing something like this underneath all the fancy frameworks:

/* psuedo code, obviously */
buf = malloc(N);
memcpy(buf, "secretkey", strlen("secretkey"));
/* some functions that talk to the api server and do other stuff */
free(buf);

## malloc shims

Many malloc implementations provide an interface for the user to create custom shim functions to execute in place of the system-provided malloc/calloc/realloc/free functions. These shim interfaces are useful for many reasons, including but not limited to memory profilers, leak checkers, and other useful debugging tools.

## abuse

What if I abuse malloc’s shim interface on OSX and provide a free function that prints the contents of every buffer it is supposed to free before actually freeing it?

## shim code

About 50 lines of horrible C code (also available here.):

void (*real_free)(malloc_zone_t *zone, void *ptr);
void (*real_free_definite_size)(malloc_zone_t *zone, void *ptr, size_t size);

void my_free(malloc_zone_t *zone, void *ptr)
{
char *tmp = ptr;
char tmp_buf[1025] = {0};
size_t total = 0;

/* lol its fine */
while (*tmp != '\0') {
tmp_buf[total] = *tmp;
total++;
if (total == 1024)
break;
tmp++;
}

malloc_printf("%s\n", tmp_buf);
real_free(zone, ptr);
}

void my_free_definite_size(malloc_zone_t *zone, void *ptr, size_t size)
{
char tmp_buf[1024] = {0};

if (size < 1024) {
memcpy(tmp_buf, ptr, size);
} else {
memcpy(tmp_buf, ptr, 1023);
}

malloc_printf("%s\n", tmp_buf);
real_free_definite_size(zone, ptr, size);
}

void __attribute__((constructor)) my_init() {
malloc_zone_t *zone = malloc_default_zone();

/* save the addresses of the REAL free functions */
real_free = zone->free;
real_free_definite_size = zone->free_definite_size;

/* replace there with my shims */
zone->free_definite_size = my_free_definite_size;
zone->free = my_free;
}


## insertion

All you have to do is build a dylib of the above C code and insert it like this:

% DYLD_INSERT_LIBRARIES="mallshim.dylib" /Applications/Twitter.app/Contents/MacOS/Twitter


And, boom. A lot of strings will get printed out, so you should use grep to help sift through the output.

## output

Let’s see what happens if we insert this little guy into TweetDeck and Twitter.app:

% DYLD_INSERT_LIBRARIES="mallshim.dylib" /Applications/Twitter.app/Contents/MacOS/Twitter 2>&1| egrep -i "oauth_token|oauth_consumer|oauth_timestamp|oauth_nonce" --color=auto

(I censored out some of the good stuff)

### TweetDeck

DYLD_INSERT_LIBRARIES="mallshim.dylib" /Applications/TweetDeck.app/Contents/MacOS/TweetDeck -psn_0_12827707 2>&1 | egrep -i "oauth_token|oauth_consumer|oauth_timestamp|oauth_nonce" --color=auto

## arms race

There isn’t much the app can do about this sort of hack. I mean, sure, the app could zero memory the memory before freeing it. But, then I’ll just use GDB or a hexeditor or whatever to disable the call to memcpy. So on and so forth.

If you ship a binary to a person’s computer and that binary has a secret embedded in it, that secret will eventually be discovered.

## other apps

What other interesting strings fall out of OSX apps you use everyday?

Written by Joe Damato

August 20th, 2012 at 9:14 am

## tl;dr

This post describes the relatively undocumented API for debuggers (or other low level programs) that can be used to enumerate the existing threads in a process and receive asynchronous notifications when threads are created or destroyed. This API also provides asynchronous notifications of other interesting thread-related events and feels very similar to the interface exposed by libdl for notifying debuggers when libraries are loaded dynamically at run time.

## amd64 and gnu syntax

As usual, everything below refers to amd64 unless otherwise noted. Also, all assembly is in AT&T syntax.

## software breakpoints

It’s important to begin first by examining how software breakpoints work. We’ll see shortly why this is important, but for now just trust me.

A debugger sets a software breakpoint by using the ptrace system call to write a special instruction into a target process’ address space. That instruction raises software interrupt #3 which is defined as the Breakpoint Exception in the Intel 64 Architecture Developers Manual.1 When this interrupt is raised, the processor undergoes a privilege level change and calls a function specified by the kernel to handle the exception.

The exception handler in the kernel executes to deliver the SIGTRAP signal to the process. However, if a debugger is attached to a process with ptrace, all signals are first delivered to the debugger. In the case of SIGTRAP, the debugger can examine the list of breakpoints set by the user and take the appropriate action (draw a UI, update the console, or whatever).

The debugger finishes up by masking this signal from the process it is attached to, preventing that process from being killed (most processes will not have a signal handler for SIGTRAP).

In practice most binaries generated by compilers will not have this instruction; it is up to the debugger to write this instruction into the process’ address space during runtime. If you are so inclined, you can raise interrupt #3 via inline assembly or by calling an assembly stub yourself. Many debuggers will catch this signal and trigger an update of some form in the UI.

All that said, this is what the instruction looks like when disassembled:

int 0x03


You may find it useful to check out an earlier and more in-depth article I wrote a while ago about signal handling.

## Enumerating threads when first attaching

When a debugger first attaches to a program the program has an unknown number of threads that must be enumerated. glibc exposes a straightforward API for this called td_ta_thr_iter2 found in glibc at nptl_db/td_ta_thr_iter.c. This function takes a callback as one of its arguments. The callback is called once per thread and is passed a handle to an object describing each thread in the process.

We can see the code in GDB3 which uses this API to hand over a callback which will be hit to enumerate the existing threads in a process:

static int
td_err_e *errp)
{
volatile struct gdb_exception except;
struct callback_data data;
td_err_e err = TD_ERR;

data.info = info;

{
&data,
TD_THR_ANY_STATE,
TD_THR_LOWEST_PRIORITY,
TD_THR_ANY_USER_FLAGS);
}
/* ... */


That’s pretty straightforward, but there are some hairy race conditions, as we can see in this code snippet from thread_db_find_new_threads_2 which calls find_new_threads_once:

if (until_no_new)
{
/* Require 4 successive iterations which do not find any new threads.
The 4 is a heuristic: there is an inherent race here, and I have
seen that 2 iterations in a row are not always sufficient to
for (i = 0, loop = 0; loop < 4; ++i, ++loop)
if (find_new_threads_once (info, i, NULL) != 0)
/* Found some new threads.  Restart the loop from beginning.»·*/
loop = -1;
}


It's fiiiiiiiiiinnnneeee.

Now, on to the more interesting interface that is, IMHO, much less straightforward.

A debugger can also gather thread create and destroy events through an interesting asynchronous interface. Let's go step by step and see how a debugger can listen for create and destroy events.

First, process wide event notification has to be enabled. This API looks very much like some pieces of the signal API. First we have to create a set of events of we care about (from GDB4 ):

static void
{
td_thr_events_t events;
td_err_e err;

/* ... */

/* Set the process wide mask saying which events we're interested in.  */
td_event_emptyset (&events);

/* ... */

/* NB: the following is just a pointer to the function td_ta_set_event on linux */


The above code adds TD_CREATE and TD_DEATH to the (empty) set of events that GDB wants to get notifications about. Then the event mask is handed over to glibc with a call to the function td_ta_set_event, which just happens to be stored in a function pointer named td_ta_set_event_p in GDB.

The next step is interesting.

The debugger must use an API to get the addresses of a functions that will be called whenever a thread is created or destroyed. The debugger will then set a software breakpoint at those addresses. When the program creates a thread or a thread is killed the breakpoint will be triggered and the debugger can walk the thread list and update its internal state that describes the threads in the process.

This API is td_ta_event_addr. Let's check out how GDB uses this API. This code is from the same function as above, but happens after the code shown above:

static void
{

/* ... code above here ... */

/* Delete previous thread event breakpoints, if any.  */

/* Set up the thread creation event.  */

/* ... */

/* Set up the thread death event.  */


GDB's helper function enable_thread_event is pretty straightforward:

static td_err_e
{
td_notify_t notify;
td_err_e err;

/* Access an lwp we know is stopped.  */
info->proc_handle.ptid = inferior_ptid;

/* ... */

/* Set up the breakpoint.  */
gdb_assert (exec_bfd);
(target_gdbarch,
/* Do proper sign extension for the target.  */
(bfd_get_sign_extend_vma (exec_bfd) > 0
&current_target));

return TD_OK;
}


So, GDB stores the addresses of the functions that get called on TD_CREATE and TD_DEATH in td_create_bp_addr and td_death_bp_addr, respectively and sets breakpoints on these addresses in enable_thread_event.

### Check if the event has been triggered and drain the event queue

Next time a thread is stopped because a breakpoint has been hit, the debugger needs to check if the breakpoint occurred on an address that is associated with the registered events. If so, the thread event queue needs to be drained with a call to td_ta_event_getmsg and the thread's information can be retrieved with a call to  td_thr_get_info .

GDB does all this in a function called check_event:

/* Check if PID is currently stopped at the location of a thread event
breakpoint location.  If it is, read the event message and act upon
the event.  */

static void
check_event (ptid_t ptid)
{
/* ... */
td_event_msg_t msg;
td_thrinfo_t ti;
td_err_e err;
int loop = 0;

/* Bail out early if we're not at a thread event breakpoint.  */
stop_pc =  /* ... */
return;

/* Access an lwp we know is stopped.  */
info->proc_handle.ptid = ptid;

/* ... */

/* If we are at a create breakpoint, we do not know what new lwp
was created and cannot specifically locate the event message for it.
We have to call td_ta_event_getmsg() to get
the latest message.  Since we have no way of correlating whether
the event message we get back corresponds to our breakpoint, we must
loop and read all event messages, processing them appropriately.
This guarantees we will process the correct message before continuing
from the breakpoint.

Currently, death events are not enabled.  If they are enabled,
the death event can use the td_thr_event_getmsg() interface to
get the message specifically for that lwp and avoid looping
below.  */

loop = 1;

do
{
/* ... */

err = info->td_thr_get_info_p (msg.th_p, &ti);
/* ... */

ptid = ptid_build (GET_PID (ptid), ti.ti_lid, 0);

switch (msg.event)
{
case TD_CREATE:

break;

case TD_DEATH:

break;

default:
}
}
while (loop);
}


And that is how GDB finds out about existing threads and gets notified about new threads being created or existing threads dying. This asynchronous breakpoint interface is very similar to the interface exposed by libdl that I described briefly toward the end of a blog post I wrote a while ago.

## Notifications for other interesting events

Other interesting events are supported by the API but are currently not implemented in glibc, but a motivated programmer could build a shim which implements these events. Doing so would allow you to build some very interesting visualization applications for lock contention and scheduling:

/* Events reportable by the thread implementation.  */
typedef enum
{
TD_ALL_EVENTS,			/* Pseudo-event number.  */
TD_EVENT_NONE = TD_ALL_EVENTS, 	/* Depends on context.  */
TD_READY,				/* Is executable now. */
TD_SLEEP,				/* Blocked in a synchronization obj.  */
TD_SWITCHTO,				/* Now assigned to a process.  */
TD_SWITCHFROM,			/* Not anymore assigned to a process.  */
TD_LOCK_TRY,				/* Trying to get an unavailable lock.  */
TD_CATCHSIG,				/* Signal posted to the thread.  */
TD_IDLE,				/* Process getting idle.  */
TD_CREATE,				/* New thread created.  */
TD_PREEMPT,				/* Preempted.  */
TD_PRI_INHERIT,			/* Inherited elevated priority.  */
TD_REAP,				/* Reaped.  */
TD_CONCURRENCY,			/* Number of processes changing.  */
TD_TIMEOUT,				/* Conditional variable wait timed out.  */
TD_MAX_EVENT_NUM = TD_TIMEOUT,
TD_EVENTS_ENABLE = 31		/* Event reporting enabled.  */
} td_event_e;


## Take my shovel and flashlight and go look around

Check the reference section below which has links to some of the source file mentioned above. Also, be sure to check out the header file:

/usr/include/thread_db.h

That header lists the exported functions from glibc as well as the various flags and types necessary for interacting with this interface.

## Conclusion

• Debuggers have really interesting ways of interacting with lower level system libraries.
• Comments found tucked away in these pits of despair are pretty amazing.
• Don't be scared. Grab a shovel and see what other interesting things you can dig up in glibc or elsewhere.

## References

Written by Joe Damato

July 2nd, 2012 at 7:30 am

## tl;dr

This post is going to explain a serious design flaw of the object system used in MRI/REE/YARV. This flaw causes seemingly random segfaults and other hard to track corruption. One popular incarnation of this bug is the “rake aborted! not in gzip format.”

## theme song

This blog post was inspired by one of my favorite Papoose verses. If you don’t listen to this while reading, you probably won’t understand what I’m talking about: get in the zone.

## rake aborted! not in gzip format [BUG] Segmentation fault

If you’ve seen either of these error messages you are hitting a fundamental flaw of the object model in MRI/YARV. An example of a fix for a single instance of this bug can be seen in this patch. Let’s examine this specific patch so that we can gain some understanding of the general case.

FACT: What you are about to read is absolutely not a compiler bug.

## A small, but important piece of background information

The amd64 ABI1 states that some registers are caller saved, while others are callee saved. In particular, the register rax is caller saved. The callee will overwrite the value in this register to store its return value for the caller so if the caller cares about what is stored in this register, it must be copied prior to a function call.

## stare into the abyss part 1

Let’s look at the C code for gzfile_read_raw_ensure WITHOUT the fix from above:

#define zstream_append_input2(z,v)\
zstream_append_input((z), (Bytef*)RSTRING_PTR(v), RSTRING_LEN(v))

static int
{
VALUE str;

while (NIL_P(gz->z.input) || RSTRING_LEN(gz->z.input) < size) {
if (NIL_P(str)) return Qfalse;
zstream_append_input2(&gz->z, str);
}
return Qtrue;
}


It looks relatively sane at first glance, but to understand this bug we’ll need to examine the assembly generated for this thing. I’m going to rearrange the assembly a bit to make it easier to follow and add few comments a long the way.

First, the code begins by setting the stage:

  push   %rbp
movslq %esi,%rbp    # sign extend "size" into rbp
push   %rbx
mov    %rdi,%rbx    # rbx = gz
sub    $0x8,%rsp # make room on the stack for "str"  The above is pretty basic. It is your typical amd64 prologue. After things are all setup, it is time to enter into the while loop in the C code above:  jmp 1180 # JUMP IN to the loop  Next comes the NIL_P(gz->z.input) portion of the while-loop condition:  mov 0x18(%rbx),%rax # rax = gz->z.input cmp$0x4,%rax          # in Ruby, nil is represented as 4.
je     1190 [gzfile_read_raw_ensure+0x30]  # if gz->z.input is nil, enter the loop


Now the RSTRING_LEN(gz->z.input) < size portion:

  cmp    %rbp,0x10(%rax)        # compare size and gz->z.input->len
jge    11b0 [gzfile_read_raw_ensure+0x50]  # jump out of loop
# if  gz->z.input->len is >= size


Next comes the call to gzfile_read_raw and the NIL_P(str) check. If this check fails, the code just falls through and exits the loop:

 mov    %rbx,%rdi            # rdi = gz, rdi holds the first argument to a function.
cmp    \$0x4,%rax   # compare return value (%rax) to nil


The return value of gzfile_read_raw_ensure (an address of a ruby object) is stored in rax.

And finally, the good stuff. The call to zstream_append_input:

  mov    0x10(%rax),%rdx # RSTRING_LEN(v) as 3rd arg
mov    0x18(%rax),%rsi # RSTRING_PTR(v) as 2nd arg
mov    %rbx,%rdi       # set gz->z as the 1st arg
callq  10e0 [zstream_append_input]  # let it rip


Note that the arguments to zstream_append_input are moved into registers by offsetting from rax and that when the call to zstream_append occurs, the ruby object returned from gzfile_read_raw_ensure is still stored in rax and not written to it's slot on the stack because the extra write is unnecessary.

## stare into the abyss part 2

Aright, so the patch changes the zstream_append_input2 macro to this:

#define zstream_append_input2(z,v)\
RB_GC_GUARD(v),\
zstream_append_input((z), (Bytef*)RSTRING_PTR(v), RSTRING_LEN(v))


And, RB_GC_GUARD is defined as:

#define RB_GC_GUARD_PTR(ptr) \
__extension__ ({volatile VALUE *rb_gc_guarded_ptr = (ptr); rb_gc_guarded_ptr;})

#define RB_GC_GUARD(v) (*RB_GC_GUARD_PTR(&(v)))


That code is just a hack to mark the memory location holding v with the volatile type qualifier. This tells the compiler that memory backing v acts in ways that the compiler is too stupid to understand, so the compiler must ensure that reads and writes to this location are not optimized out.

A common usage of this qualifier is for memory mapped registers. Reads from memory mapped registers should not be optimized away since a hardware device may update the value stored at that location. The compiler wouldn't know when these updates could happen so it must make sure to re-read the value from this memory location when it is needed. Similarly, writes to memory mapped registers may modify the state of a hardware device and should not be optimized away.

Most of the code generated with the patch applied is the same as without except for a few slight differences before zstream_append_input is called. Let's take a look:

  mov    %rax,-0x18(%rbp)    # write str to the stack
mov    -0x18(%rbp),%rax    # read the value in str back to rax
mov    0x10(%rcx),%rdx      # RSTRING_LEN(v)
mov    0x18(%rcx),%rsi       # RSTRING_PTR(v)
mov    %rbx,%rdi                # z
callq  1f60 [_zstream_append_input]


The key difference is that the return value of gz_file_read_raw is written back to it's memory location (which, in this case, happens to be on the stack and is called str).

## the bug

The bug is triggered because:

1. The address of the ruby object str is stored in a caller saved register, rax.
2. The callee (zstream_append_input) does not save the value of rax (it is not required to) and rax is overwritten in the function, leaving no references to the ruby object returned by gzfile_read_raw.
3. The callee (zstream_append_input) eventually calls rb_newobj. rb_newobj may trigger a GC run, if there are no available objects on the freelist.
4. The GC run finds the object returned by gzfile_read_raw but sees no references to it and frees the memory associated with it.
5. The freed object is used as it were it were valid, and memory corruption occurs causing the VM to explode.

The patch prevents this bug from happening because:

1. The address of the ruby object str is stored in a caller saved register, rax.
2. The volatile type qualifier causes the compiler to generate code which writes the return value back into it's memory location on the stack.
3. The callee (zstream_append_input) eventually calls rb_newobj. rb_newobj may trigger a GC run, if there are no available objects on the freelist.
4. The GC run finds the object returned by gzfile_read_raw and finds a reference to it and therefore does not free it.
5. Everyone is happy.

## The general case

Given valid C code, gcc will generate machine instructions that correctly do what you want. Of course, there are bugs in gcc just like any other piece of software. The problem in this case is not gcc. The problem is that the object and garbage collection implementations in REE/MRI/YARV are not valid C code, so it is not possible for gcc to generate machine instructions that do the right thing. In other words, Ruby's object and GC implementations are breaking their contract with gcc.

The end result is the need for shit like RB_GC_GUARD in REE/MRI/YARV and also in Ruby gems to selectively paper over valid gcc optimizations. Having an API that might cause the Ruby VM to fucking explode unless you proactively mark things with RB_GC_GUARD is not on the path of least resistance toward building a maintainable, safe, and performant system. Very few people out there know that the volatile type qualifier exists, let alone what it does. Essentially, this means that authors of Ruby gems must understand how GC works in the VM to prevent their gems from causing GC to break the universe.

That is fucking beyond stupid.

## How to detect this bug class

This could be detected by building a simple static analysis tool. You won't catch 100% of cases, and you will definitely have false positives, but it is better than nothing. Something like this should work:

1. Build a call digraph of the VM and/or the set of gems you care about.
2. Find all paths leading to the rb_newobj sink.
3. Find all paths which call rb_newobj, but do not save rax prior to making another function call which is also on a path to rb_newobj.
4. The functions found are very likely to be causing corruption. A human will need to examine the found cases to weed out false positives and to fix the code.

If you have found yourself wondering who the fuck would write such a test? it is important for you to note that rtld in Linux does not save the SSE registers (which are supposed to be caller saved) prior to entering the fixup function, however to ensure that such an optimization does not cause the fucking universe to come crashing down, a test ships with the code to run objdump after building the binary. The objdump output is then grepped for any instructions which might modify the SSE registers. As long as no one touches the SSE registers, there is no need to save and restore them.

If Ruby's object and GC subsystems want to prevent the universe from exploding, it must supply an equivalent test to ensure that corruption is impossible.

## Conclusion

• MRI/YARV/REE are inherently fatally flawed.
• I'm never writing another Ruby-related blog post.
• I'm not a Ruby programmer.