time to bleed by Joe Damato

technical ramblings from a wanna-be unix dinosaur

6 Line EventMachine Bugfix = 2x faster GC, +1300% requests/sec

View Comments




Nothing is possible without lunch

So Aman Gupta (tmm1) and I were eating lunch at the Oaxacan Kitchen on Tuesday and as usual, we were talking about scaling Ruby. We got into a small debate about which phase of garbage collection took the most CPU time.

Aman’s claim:

  • The mark phase, specifically the stack marking phase because of the huge stack frames created by rb_eval

My claim:

  • The sweep phase, because every single object has to be touched and some freeing happens.

I told Aman that I didn’t believe the stack frames were that large, and we bet on how big we thought they would be. Couldn’t be more than a couple kilobytes, could it? Little did we know how wrong our estimates were.

Quick note about Ruby’s GC

Ruby MRI has a mark-and-sweep garbage collector. As part of the mark phase, it scans the process stack. This is required because a pointer to a Ruby object can be passed to a C extension (like Eventmachine, or Hpricot, or whatever). If that happens, it isn’t safe to free the object yet. So Ruby does a simple scan and checks if each word on the stack is a pointer to the Ruby heap, if so, that item cannot be freed.

GDB to the rescue

We get back from lunch, launch our application, attach GDB and set a breakpoint. The breakpoint gets triggered and we see this seemingly innocuous stack trace [Note: To help with debugging, we compiled the EventMachine gem with -fno-omit-frame-pointer]:

#0 0x00007ffff77629ac in epoll_wait () from /lib/libc.so.6
#1 0x00007ffff6c0b220 in EventMachine_t::_RunEpollOnce (this=0x158d7e0) at em.cpp:461
#2 0x00007ffff6c0b86c in EventMachine_t::_RunOnce (this=0x158d7e0) at em.cpp:423
#3 0x00007ffff6c0bbd6 in EventMachine_t::Run (this=0x158d7e0) at em.cpp:404
#4 0x00007ffff6c06638 in evma_run_machine () at cmain.cpp:83
#5 0x00007ffff6c1897f in t_run_machine_without_threads (self=26066936) at rubymain.cpp:154
#6 0x000000000041d598 in call_cfunc (func=0x7ffff6c1896e , recv=26066936, len=0, argc=0, argv=0x0) at eval.c:5759
#7 0x000000000041c92f in rb_call0 (klass=26065816, recv=26066936, id=29417, oid=29417, argc=0, argv=0x0, body=0x18dba10, flags=0) at eval.c:5911
#8 0x000000000041e0ad in rb_call (klass=26065816, recv=26066936, mid=29417, argc=0, argv=0x0, scope=2, self=26066936) at eval.c:6158
#9 0x00000000004160d5 in rb_eval (self=26066936, n=0x1940330) at eval.c:3514
#10 0x00000000004150b7 in rb_eval (self=26066936, n=0x1941018) at eval.c:3357
#11 0x000000000041d196 in rb_call0 (klass=26065816, recv=26066936, id=5393, oid=5393, argc=0, argv=0x0, body=0x1941018, flags=0) at eval.c:6062
#12 0x000000000041e0ad in rb_call (klass=26065816, recv=26066936, mid=5393, argc=0, argv=0x0, scope=0, self=47127864) at eval.c:6158
#13 0x0000000000415d01 in rb_eval (self=47127864, n=0x2cf5298) at eval.c:3493
#14 0x00000000004148b2 in rb_eval (self=47127864, n=0x2cf4380) at eval.c:3223
#15 0x000000000041d196 in rb_call0 (klass=47127808, recv=47127864, id=5313, oid=5313, argc=0, argv=0x0, body=0x2cf4380, flags=0) at eval.c:6062
#16 0x000000000041e0ad in rb_call (klass=47127808, recv=47127864, mid=5313, argc=0, argv=0x0, scope=0, self=9606072) at eval.c:6158
#17 0x0000000000415d01 in rb_eval (self=9606072, n=0x194b2a0) at eval.c:3493
#18 0x00000000004148b2 in rb_eval (self=9606072, n=0x19587b0) at eval.c:3223
#19 0x000000000041072c in eval_node (self=9606072, node=0x19587b0) at eval.c:1437
#20 0x0000000000410dff in ruby_exec_internal () at eval.c:1642
#21 0x0000000000410e4f in ruby_exec () at eval.c:1662
#22 0x0000000000410e72 in ruby_run () at eval.c:1672
#23 0x000000000040e78a in main (argc=3, argv=0x7fffffffebd8, envp=0x7fffffffebf8) at main.c:48

Looks pretty normal, nothing to worry about, right?

We started checking the rb_eval frames because we assumed that those would be the largest stack frames. The rb_eval function inlines other functions and call itself recursively. So how big is one of the rb_eval frames?

(gdb) frame 10
#10 0x00000000004150b7 in rb_eval (self=26066936, n=0x1941018) at eval.c:3357
3357 result = rb_eval(self, node->nd_head);
(gdb) p $rbp-$rsp
$2 = 1904

1,904 bytes – pretty large. If all the stack frames are that large, we are looking at around 47,600 bytes. Pretty serious. Let’s verify that Ruby thinks the stack is a sane size. There is a global in the Ruby interpreter called rb_gc_stack_start. It gets set when the Ruby stack is created in Init_stack(). When Ruby calculates the stack size it subtracts the current stack pointer from rb_gc_stack_start [remember on x86_64, the stack grows from high addresses to low addresses]. Let’s do that and see how big Ruby thinks the stack is.

(gdb) p (unsigned int)rb_gc_stack_start - (unsigned int)$rsp
$3 = 802688

Wait, wait, wait. 802,688 bytes with only 23 stack frames? WTF?! Something is wrong. We started at the top and checked all the rb_eval stack frames, but none of them are larger than 2kb. We did find something quite a bit larger than 2kb, though.

(gdb) frame 1
#1 0x00007ffff6c0b220 in EventMachine_t::_RunEpollOnce (this=0x158d7e0) at em.cpp:461
461 s = epoll_wait (epfd, ev, MaxEpollDescriptors, timeout == 0 ? 5 : timeout);
(gdb) p $rbp-$rsp
$28 = 786816

Uh, the RunEpollOnce stack frame is 786,816 bytes? That’s got to be wrong. WTF?

Time to bring out the big guns.

objdump + x86_64 asm FTW

I pumped EventMachine’s shared object into objdump and captured the assembly dump:

objdump -d rubyeventmachine.so > em.S

I headed down to the RunEpollOnce function and saw the following:

2f12b: 48 81 ec 78 01 0c 00 sub $0xc0178,%rsp

Interesting. So the code is moving %rsp down by 786,808 bytes to make room for something big. So, let’s see if the EventMachine code matches up with the assembly output.

struct epoll_event ev [MaxEpollDescriptors];

Where MaxEpollDescriptors = 64*1024 and sizeof(struct epoll_event) == 12. That matches up with the assembly dump and the GDB output.

Usually, doing something like that in C/C++ is (usually) OK. Avoiding the heap whenever you can is a good idea because you avoid heap-lock contention, fragmenting the heap, and memory overhead for tracking the memory region. When writing Ruby extensions, this isn’t necessarily true. Remember, Ruby’s GC algorithm scans the entire process stack searching for references to Ruby objects. This EventMachine code causes Ruby to search an extra ~800,000 bytes drastically slowing down garbage collection.

The patch

Get the patch HERE

The patch simply moves the stack allocated struct epoll_event ev to the class definition so that it is allocated on the heap when an instance of the class is created with new. This does not change the memory usage of the process at all. It just moves the object off the stack. This makes all the difference because Ruby’s GC scans the process stack and not the process heap.

On top of all that, this patch helps with Ruby’s green threads, too. If the epoll_wait causes a Ruby event to fire and that event creates a Ruby thread, that Ruby thread gets an entire copy of the existing stack. Each time that thread is switched into and out of, that thread stack has to be memcpy’d into and out of place. Reducing those memcpys by ~800,000 bytes is a HUGE performance win. Want to learn more about threading implementations? Check out my threading models post: here.

Fixing this turned out to be pretty simple. A six (6!!) line patch:

  • Speeds up GC by 2-3x because of the huge decrease in stack frame size.
  • Fixes an open bug in EventMachine where using threads with Epoll causes lots of slowness. The reason is that each thread will inherit an ~800,000 byte stack that gets copied in and out every context switch.
  • This results in an increase from 500 requests/sec to 7000 requests/sec when using Sinatra+Thin+Epoll+Threads. That is pretty ill.

Conclusion

All in all, a productive debugging session lasting about an hour. The result was a simple patch, with 2 big performance improvements.

A couple things to take away from this experience:

  • Spend time learning your debugging tools because it pays off, especially nm, objdump, and of course GDB.
  • Getting familiar with x86_64 assembly is crucial if you hope to debug complex software and optimize it correctly.

Keep your eyes open for up-coming blog posts about x86_64 assembly! Don’t forget to subscribe via RSS or follow me on twitter

Written by Joe Damato

April 29th, 2009 at 1:36 am

  • Anon
    This is way over my head. :(
  • darekjones
    Interesting post. I have stumbled and twittered this for my friends. Hope others find it as interesting as I did.

    Darek from classifieds posting service
  • I'm always very impressed when people go to these lengths to track a problem. Very very nice work, and thanks!
  • Awesome post! I really like such a write-ups!
  • Kate
    Cool story, bro.
  • I spend most of my time programming in higher level language, but have always wanted to learn more about lower level tools like gdb, nm, objdump, etc. What resources would you recommend for learning these tools? I have tried reading guides online, but I have always found them lacking, particularly on how one goes about creating a mental model of the running system from all the data that can be seen with the tools.

    Also, how does one go about honing their assembly skills? Other than the architecture class in college where I learned MIPS, I never had the opportunity to write any assembly, not in MIPS and not in x86. Do you have any recommendations on that too?
  • Another ruby superhero saves the day. Why not use something where you don't have to debug language internals? Sounds awful productive to me.
  • Nice work. I love small changes with big results. And thanks for the quick intro to gdb.
  • Thanks! If you like slightly-larger changes with big results you should check out my other post about a fibers implementation for Ruby 1.8.7: http://timetobleed.com/fibers-implemented-for-r...

    Thanks for reading and don't forget to subscribe =]
  • Great post and a really nice job. I suffered from this and I was clueless but now I know better :)
  • Jon
    I would check to make sure that thread-switch will cause a full memcpy of the whole stack. It doesn't seem likely.
  • Why doesn't it seem likely? I've fixed a couple bugs in the Ruby threading implementation and implemented Fibers for Ruby 1.8.7. So I've messed with the threading implementation quite a bit =]

    Anyway, please see:

    http://github.com/ice799/matzruby/blob/039eaea4...

    http://github.com/ice799/matzruby/blob/039eaea4...
  • mark
    I am not sure if i understood this. Does this mean that this patches speeds up ruby's GC collection?
  • This speeds up Ruby's GC -if- you are using EventMachine. The patch fixed a bug in EventMachine which was slowing down Ruby's GC.

    Does that clear it up?
  • Just for extreme clarity:

    This patch affects ruby code running on EventMachine with the epoll reactor.

    require 'eventmachine'; EM.epoll; EM.run {} # 800k stack on linux on old EM

    There may be a similar patch coming for the kqueue reactor, but that is less affected (40k vs. 800k).

    require 'eventmachine'; EM.kqueue; EM.run {} # 40k stack on OSX

    The issue does not exist in any extreme with the select reactor.

    require 'eventmachine'; EM.run {}

    You can see the effects of the issue most drastically when using EventMachines defer method, because it spawns up 20 threads, each of which will carry a heavy stack.

    I have recently done some research on performance metrics of ruby's green thread implementation. Whilst there is a much more significant impact to ruby's performance when the stack size is large, the specific impact of the number of threads does not become noticeable until high numbers of threads (two orders of magnitude and above (100+ threads)). The choice of 20 threads seems sensible, given this research.

    Each spawned thread carries a copy of the stack, though, so if you spawn threads 'later' in your execution stack (like Thin does), then you will see extra performance degradation from these issues. As such the real highlight comes from an app like this:

    require 'eventmachine'; EM.epoll
    require 'sinatra'; def (Sinatra::Application).deferred?(env); true; end
    get '/' do escape_html Thread.current.inspect end

    Some of the performance impact can be slightly subdued by doing the following:

    EM.defer {}

    Prior to thin loading, as the 20 threads for doing work in will be sitting on a significantly smaller stack. The main eventmachine stack would still be very large though, so this 'quick fix' isn't really a fix at all.

    Some of the general impacts of stack size on ruby performance are demonstrated here:
    http://pastie.textmate.org/private/iynyujlrzorq...

    And can be generated like so:
    http://gist.github.com/104213

    Maybe I'll collaborate or separately post some ruby programmer oriented explanations of these posts soon. :)
  • terry
    Good post, but I would REALLY like to look at the six line bug fix along with commentary from the author about why this was the change he picked to fix the issue. Otherwise, very good bug hunting story...
  • tmm1 and I wrote the patch together, it is listed under the section titled "The patch".

    This blog post explains why the patch was chosen.

    Not sure I understand what you are asking.
  • terry
    @ice799 You are correct and my mistake, I missed the patch link when I originally looked over the article. However, like Ken I'm curious about how the patch works but now that I'm aware of the link I'll do some detective work of my own. :-)
  • Let me know if the explanation I gave Ken clears it up. I'll add a little more to the article itself in the "patch" section.
  • terry
    I read the explanation and it makes perfect sense now. I haven't coded much in C++ since college, mostly .NET apps nowadays, so my analysis was admittedly going slower than I would've liked. Anyway thanks for the explanation.
  • I think perhaps terry was wondering about how the patch itself works (hence the "commentary" request).

    It looks to me like the patch moves the array allocation out of _RunEpollOnce() into the header file.

    My question: why does this help? If _RunEpollOnce() is only getting called once, shouldn't it be the same as allocating in the header file? Why does this decrease the allocation size?

    Thanks!
  • AH, Ok.

    Moving it into the class definition causes the object to be allocated on the heap when an object of the class is created with new.

    The answer to your question is that it doesn't decrease the memory usage. It decreases the size of the stack because it moves the object over to the heap.

    Ruby scans the process stack, not the process heap (it scans its own internal heap which is not the same as the process heap).

    Does that make more sense? I'll add a little bit of info to the article in the "Patch" section above.
  • Ah, yes, that makes sense, thanks ice.
  • jamesaguilar
    Edit: Also, amazing writeup. Thanks for it.
  • jamesaguilar
    It's not really OK to make stack frames this large even when coding pure C/C++. If you are writing threaded code and you need to make space for stacks this big on all threads you are going to run into trouble.
  • Not necessarily. Depends on the threading implementation. If you use something like pthreads, it doesn't matter all that much - the stacks aren't getting memcopy'd in and out all the time.

    It matters for a user-land thread implementation (like Ruby's) because the process has to swap the stacks in and out.

    The memcpys are what kills you here.
  • jamesaguilar
    I think it may also depend on how many threads you're using and how many stack frames are like this, as well as your target architecture and the OS's handling of virtual memory. For example, let's assume that we have a codebase where two or three of these big stack frame functions are scattered around. In order to thread safely, that means every thread stack needs to have at least enough virtual space to fit a call to one of these functions. What if two are called in a single stack?

    So let's say you only reserve 4M per stack (which would allow a call to three of these functions on teh same call stack safely. Then you have 1000 threads each reserving this much space. So now you're looking at 4G of virtual memory just for these thread stacks. And this is assuming 4M per stack which is not very safe if this idiom is used frequently in your codebase. The program in question is also now completely unbuildable if you're targeting a 32 bit arch.

    So, no, it may not be a performance problem in a program that only runs a few threads, or a program where these frames are very rare. But when large frames are common and you use a lot of threads, it can be an issue. I'm not really an expert on this issue but I've observed other people deal with it and I know it's not as simple as, "This is fine in C/C++."
  • OK, so I'll agree that it isn't as simple as I said it was. I should be more careful; I didn't think anyone would actually read this in the first place.

    It's only a performance issue on certain threading implementations. It could be a portability issue, though.

    It depends on what your invariant is - if you just need to stack allocate a huge amount to use and toss out once your function returns, I don't see that necessarily being problematic. If your invariant is that that stack frame doesn't return for a while (for some definition of "a while") it could be a problem.

    If your argument is simply "using a lot of memory is bad" then sure I agree with that argument.

    However, stack allocating stuff is useful and *can* be done safely, even for large objects.
  • jamesaguilar
    > However, stack allocating stuff is useful and *can* be done safely, even for large objects.

    I'm not really concerned about memory use here. (Although, haha, I think we can both agree that you should attempt to conserve memory.) I'm concerned about virtual address space use.

    Do you know of a threading implementation that does do this safely? As far as I know (and again, I'm no expert), pthreads does not solve this problem. If a function that uses a lot of memory in a frame can be called from any thread, all stacks have to be spaced at a distance wide enough in virtual memory to accomodate that frame, even if it's almost never called and never simultaneously.

    I'm thinking of a large thread pool executing a variety of periodic tasks, among which is one that uses the large function or functions in question. In such a pool pthread_attr_getstacksize for each thread in the pool must be at least as large as the largest possible stack generated by a call from that pool, even if the average stack is very small. And that stack size affects the overall usage of virtual space by pthread stacks as a whole.

    You can make it better by guaranteeing that the big function is only called from one thread, then setting the stack size of that thread much larger and setting all the rest small. But in my inexperienced knowledge there is no thread implementation that will handle this detail automatically. Please let me know if I'm wrong; I'd love to find that this is something I don't have to worry about.
  • We can chat about this via e-mail if you like - this is difficult via blog comments.
  • jamesaguilar
    Sounds great. Hit me up. lastname.firstname@gmail.com
  • Jack Tang
    Well, one of my friends drop me one formula. Let x to be the rate of GC time out of total running time, and we can get

    (1-2x) * 14 = 1- x => x = 13/27

    which means now ruby spends 13/27 time on GC, fast enough? (before patch, it spends 26/27 time on GC, really bad)
  • Very cool! Great writeup.
  • Amazing! Fantastic job.
  • Jack Tang
    Well, one of my friends drop me one formula. Let x to be the rate of GC time out of total running time, and we can get

    (1-2x) * 14 = 1- x => x = 13/27

    which means now ruby spends 13/27 time on GC, fast enough? (before patch, it spends 26/27 time on GC, really bad)
  • Great writeup - thanks for sharing the gritty details!
  • awesome writeup. Thanks!
  • Thanks for reading guys!

    Don't forget to subscribe, lots of cool posts in the queue =]
  • Whoa, nice job! I'll try this ASAP on Babylon!
  • Guys, the beer (or whatever you like) is on me. Cheers!
  • Like you say, *ill*

    Really nice catch guys, well done :-)

    You forgot something though, what was the excellent brain food you ate?
blog comments powered by Disqus