time to bleed by Joe Damato

technical ramblings from a wanna-be unix dinosaur

Archive for the ‘patch’ tag

Fixing Threads in Ruby 1.8: A 2-10x performance boost

View Comments

Quick notes before things get crazy

OK, things might get a little crazy in this blog post so let’s clear a few things up before we get moving.

  • I like the gritty details, and this article in particular has a lot of gritty info. To reduce the length of the article for the casual reader, I’ve put a portion of the really gritty stuff in the Epilogue below. Definitely check it out if that is your thing.
  • This article, the code, and the patches below are for Linux and OSX for the x86 and x86_64 platforms, only.
  • Even though there are code paths for both x86 and x86_64, I’m going to use the 64bit register names and (briefly) mention the 64bit binary interface.
  • Let’s assume the binary is built with -fno-omit-frame-pointer, the patches don’t care, but it’ll make the explanation a bit simpler later.
  • If you don’t know what the above two things mean, don’t worry; I got your back chief.

How threads work in Ruby

Ruby 1.8 implements pre-emptible userland threads, also known as “green threads.” (Want to know more about threading models? See this post.) The major performance killer in Ruby’s implementation of green threads is that the entire thread stack is copied to and from the heap every context switch. Let’s take a look at a high level what happens when you:

Thread.new{
	10000.times {
		a << "a"
		a.pop
	}
}

  1. A thread control block (tcb) is allocated in Ruby.
  2. The infamous thread timer is initialized, either as a pthread or as an itimer.
  3. Ruby scope information is copied to the heap.
  4. The new thread is added to the list of threads.
  5. The current thread is set as the new thread.
  6. rb_thread_yield is called to yield to the block you passed in.
  7. Your block starts executing.
  8. The timer interrupts the executing thread.
  9. The current thread’s state is stored:
    • memcpy() #1 (sometimes): If the stack has grown since the last save, realloc is called. If the allocator cannot extend the size of the current block in place, it may decide to move the data to a new block that is large enough. If that happens memcpy() is called to move the data over.
    • memcpy() #2 (always): A copy of this thread’s entire stack (starting from the top of the interpreter’s stack) is put on the heap.
  10. The next thread’s state is restored.
    • memcpy() #3 (always): A copy of this thread’s entire stack is placed on the stack.

Steps 9 and 10 crush performance when even small amounts of Ruby code are executed.

Many of the functions the interpreter uses to evaluate code are massive. They allocate a large number of local variables creating stack frames up to 4 kilobytes per function call. Those functions also call themselves recursively many times in a single expression. This leads to huge stacks, huge memcpy()s, and an incredible performance penalty.

If we can eliminate the memcpy()s we can get a lot of performance back. So, let’s do it.

Increase performance by putting thread stacks on the heap

[Remember: we are only talking about x86_64]

How stacks work – a refresher

Stacks grow downward from high addresses to low addresses. As data is pushed on to the stack, it grows downward. As stuff is popped, it shrinks upward. The register %rsp serves as a pointer to the bottom of the stack. When it is decremented or incremented the stack grows or shrinks, respectively. The special property of the program stack is that it will grow until you run out of memory (or are killed by the OS for being bad). The operating system handles the automatic growth. See the Epilogue for some more information about this.

How to actually switch stacks

The %rsp register can be (and is) changed and adjusted directly by user code. So all we have to do is put the address of our stack in %rsp, and we’ve switched stacks. Then we can just call our thread start function. Pretty easy. A small blob of inline assembly should do the trick:

__asm__ __volatile__ ("movq %0, %%rsp\n\t"
                      "callq *%1\n"
                      :: "r" (th->stk_base),
                         "r" (rb_thread_start_2));

Two instructions, not too bad.

  1. movq %0, %%rsp moves a quad-word (th->stk_base) into the %rsp. Quad-word is Intel speak for 4 words, where 1 Intel word is 2 bytes.
  2. callq *%1 calls a function at the address “rb_thread_start_2.” This has a side-effect or two, which I’ll mention in the Epilogue below, for those interested in a few more details.

The above code is called once per thread. Calling rb_thread_start_2 spins up your thread and it never returns.

Where do we get stack space from?

When the tcb is created, we’ll allocate some space with mmap and set a pointer to it.

/* error checking omitted for brevity, but exists in the patch =] */
stack_area = mmap(NULL, total_size, PROT_READ | PROT_WRITE | PROT_EXEC,
			MAP_PRIVATE | MAP_ANON, -1, 0);

th->stk_ptr = th->stk_pos = stack_area;
th->stk_base = th->stk_ptr + (total_size - sizeof(int))/sizeof(VALUE *);

Remember, stacks grow downward so that last line: th->stk_base = ... is necessary because the base of the stack is actually at the top of the memory region return by mmap(). The ugly math in there is for alignment, to comply with the x86_64 binary interface. Those curious about more gritty details should see the Epilogue below.

BUT WAIT, I thought stacks were supposed to grow automatically?

Yeah, the OS does that for the normal program stack. Not gonna happen for our mmap‘d regions. The best we can do is pick a good default size and export a tuning lever so that advanced users can adjust the stack size as they see fit.

BUT WAIT, isn’t that dangerous? If you fall off your stack, wouldn’t you just overwrite memory below?

Yep, but there is a fix for that too. It’s called a guard page. We’ll create a guard page below each stack that has its permission bits set to PROT_NONE. This means, if a thread falls off the bottom of its stack and tries to read, write, or execute the memory below the thread stack, a signal (usually SIGSEGV or SIGBUS) will be sent to the process.

The code for the guard page is pretty simple, too:

/* omit error checking for brevity */
mprotect(th->stk_ptr, getpagesize(), PROT_NONE);

Cool, let’s modify the SIGSEGV and SIGBUS signal handlers to check for stack overflow:

/* if the address which generated the fault is within the current thread's guard page... */
  if(fault_addr <= (caddr_t)rb_curr_thread->guard &&
     fault_addr >= (caddr_t)rb_curr_thread->stk_ptr) {
  /* we hit the guard page, print out a warning to help app developers */
  rb_bug("Thread stack overflow! Try increasing it!");
}

See the epilogue for more details about this signal handler trick.

Patches

As always, this is super-alpha software.

Ruby 1.8.6 github raw .patch
Ruby 1.8.7 github raw .patch

Benchmarks

The computer language shootout has a thread test called thread-ring; let’s start with that.

require 'thread'
THREAD_NUM = 403
number = ARGV.first.to_i

threads = []
for i in 1..THREAD_NUM
   threads << Thread.new(i) do |thr_num|
      while true
         Thread.stop
         if number > 0
            number -= 1
         else
            puts thr_num
            exit 0
         end
      end
   end
end

prev_thread = threads.last
while true
   for thread in threads
      Thread.pass until prev_thread.stop?
      thread.run
      prev_thread = thread
   end
end

Results (ARGV[0] = 50000000):

Ruby 1.8.6 1389.52s
Ruby 1.8.6 w/ heap stacks 793.06s
Ruby 1.9.1 752.44s

A speed up of about 2.3x compared to Ruby 1.8.6. A bit slower than Ruby 1.9.1.

That is a pretty strong showing, for sure. Let’s modify the test slightly to illustrate the true power of this implementation.

Since our implementation does no memcpy()s we expect the cost of context switching to stay constant regardless of thread stack size. Moreover, the unmodified Ruby 1.8.6 should perform worse as thread stack size increases (therefore increasing the amount of time the CPU is doing memcpy()s).

Let’s test this hypothesis by modifying thread-ring slightly so that it increases the size of the stack after spawning threads.

def grow_stack n=0, &blk
  unless n > 100
    grow_stack n+1, &blk
  else
    yield
  end
end

require 'thread'
THREAD_NUM = 403
number = ARGV.first.to_i

threads = []
for i in 1..THREAD_NUM
  threads << Thread.new(i) do |thr_num|
    grow_stack do
      while true
        Thread.stop
        if number > 0
          number -= 1
        else
          puts thr_num
          exit 0
        end
      end
    end
  end
end

prev_thread = threads.last
while true
   for thread in threads
      Thread.pass until prev_thread.stop?
      thread.run
      prev_thread = thread
   end
end

Results (ARGV[0] = 50000000):

Ruby 1.8.6 7493.50s
Ruby 1.8.6 w/ heap stacks 799.52s
Ruby 1.9.1 680.92s

A speed up of about 9.4x compared to Ruby 1.8.6. A bit slower than Ruby 1.9.1.

Now, lets benchmark mongrel+sinatra.

require 'rubygems'
require 'sinatra'

disable :reload

set :server, 'mongrel' 

get '/' do 
  'hi' 
end

Results:

Ruby 1.8.6 1395.43 request/sec
Ruby 1.8.6 w/ heap stacks 1770.26 request/sec

An increase of about 1.26x in the most naive case possible.

Of course, if the handler did anything more than simply write “hi” (like use memcache or make sql queries) there would be more function calls, more context switches, and a much greater savings.

Conclusion

A couple lessons learned this time:

  • Hacking a VM like Ruby is kind of like hacking a kernel. Some subset of the tricks used in kernel hacking are useful in userland.
  • The x86_64 ABI is a must read if you plan on doing any low-level hacking.
  • Keep your CPU manuals close by, they come in handy even in userland.
  • Installing your own signal handlers is really useful for debugging, even if they are dumping architecture specific information.

Hope everyone enjoyed this blog post. I’m always looking for things to blog about. If there is something you want explained or talked about, send me an email or a tweet!

Don’t forget to subscribe and follow me and Aman on twitter.

Epilogue

Automatic stack growth

This can be achieved pretty easily with a little help from virtual memory and the programmable interrupt controller (PIC). The idea is pretty simple. When you (or your shell on your behalf) calls exec() to execute a binary, the OS will map a bunch of pages of memory for the stack and set the stack pointer of the process to the top of the memory. Once the stack space is exhausted, and the stack pointer is pushed onto un-mapped memory, a page fault will be generated.

The OS’s page fault handler (installed via the PIC) will fire. The OS can then check the address that generated the exception and see that you fell off the bottom of your stack. This works very similarly to the guard page idea we added to protect Ruby thread stacks. It can then just map more memory to that area, and tell your process to continue executing. Your process doesn’t know anything bad happened.

I hope to chat a little bit about interrupt and exception handlers in an upcoming blog post. Stay tuned!

callq side-effects

When a callq instruction is executed, the CPU pushes the return address on to the stack and then begins executing the function that was called. This is important because when the function you are calling executes a ret instruction, a quad-word is popped from the stack and put into the instruction pointer (%rip).

x86_64 Application Binary Interface

The x86_64 ABI is an extension of the x86 ABI. It specifies architecture programming information such as the fundamental types, caller and callee saved registers, alignment considerations and more. It is a really important document for any programmer messing with x86_64 architecture specific code.
The particular piece of information relevant for this blog post is found buried in section 3.2.2

The end of the input argument area shall be aligned on a 16 … byte boundary.

This is important to keep in mind when constructing thread stacks. We decided to avoid messing with alignment issues. As such we did not pass any arguments to rb_thread_start_2. We wanted to avoid mathematical error that could happen if we try to align the memory ourselves after pushing some data. We also wanted to avoid writing more assembly than we had to, so we avoided passing the arguments in registers, too.

Signal handler trick

The signal handler “trick” to check if you have hit the guard page is made possible by the sigaltstack() system call and the POSIX sa_sigaction interface.

sigaltstack() lets us specify a memory region to be used as the stack when a signal is delivered. This extremely important for the signal handler trick because once we fall off our thread stack, we certainly cannot expect to handle a signal using that stack space.

POSIX provides two ways for signals to be handled:

  • sa_handler interface: calls your handler and passes in the signal number.
  • sa_sigaction interface: calls your handler and passes in the signal number, a siginfo_t struct, and a ucontext_t. The siginfo_t struct contains (among other things), the address which generated the fault. Simply check this address to see if its in the guard page and if so let the user know they just overflowed their stack. Another useful, but extremely non-portable modification that was added to Ruby’ signal handlers was a dump of the contents in ucontext_t to provide useful debugging information. This structure contains the register state at the time of signal. Dumping it can help debugging by showing which values are in what registers.

Written by Joe Damato

May 18th, 2009 at 5:00 am

Fix a bug in Ruby’s configure.in and get a ~30% performance boost.

View Comments


Special thanks…

Going out to Jake Douglas for pushing the initial investigation and getting the ball rolling.

The whole --enable-pthread thing

Ask any Ruby hacker how to easily increase performance in a threaded Ruby application and they’ll probably tell you:

Yo dude… Everyone knows you need to configure Ruby with --disable-pthread.

And it’s true; configure Ruby with --disable-pthread and you get a ~30% performance boost. But… why?

For this, we’ll have to turn to our handy tool strace. We’ll also need a simple Ruby program to this one. How about something like this:

def make_thread
  Thread.new {
    a = []
    10_000_000.times {
      a << "a"
      a.pop
    }
  }
end

t = make_thread 
t1 = make_thread 

t.join
t1.join

Now, let's run strace on a version of Ruby configure'd with --enable-pthread and point it at our test script. The output from strace looks like this:

22:46:16.706136 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706177 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706218 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706259 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000005>
22:46:16.706301 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706342 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706383 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706425 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706466 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>

Pages and pages and pages of sigprocmask system calls (Actually, running with strace -c, I get about 20,054,180 calls to sigprocmask, WOW). Running the same test script against a Ruby built with --disable-pthread and the output does not have pages and pages of sigprocmask calls (only 3 times, a HUGE reduction).

OK, so let's just set a breakpoint in GDB... right?

OK, so we should just be able to set a breakpoint on sigprocmask and figure out who is calling it.

Well, not exactly. You can try it, but the breakpoint won't trigger (we'll see why a little bit later).

Hrm, that kinda sucks and is confusing. This will make it harder to track down who is calling sigprocmask in the threaded case.

Well, we know that when you run configure the script creates a config.h with a bunch of defines that Ruby uses to decide which functions to use for what. So let's compare ./configure --enable-pthread with ./configure --disable-pthread:

[joe@mawu:/home/joe/ruby]% diff config.h config.h.pthread 
> #define _REENTRANT 1
> #define _THREAD_SAFE 1
> #define HAVE_LIBPTHREAD 1
> #define HAVE_NANOSLEEP 1
> #define HAVE_GETCONTEXT 1
> #define HAVE_SETCONTEXT 1


OK, now if we grep the Ruby source code, we see that whenever HAVE_[SG]ETCONTEXT are set, Ruby uses the system calls setcontext() and getcontext() to save and restore state for context switching and for exception handling (via the EXEC_TAG).

What about when HAVE_[SG]ETCONTEXT are not define'd? Well in that case, Ruby uses _setjmp/_longjmp.

Bingo!

That's what's going on! From the _setjmp/_longjmp man page:

... The _longjmp() and _setjmp() functions shall be equivalent to longjmp() and setjmp(), respectively, with the additional restriction that _longjmp() and _setjmp() shall not manipulate the signal mask...

And from the [sg]etcontext man page:

... uc_sigmask is the set of signals blocked in this context (see sigprocmask(2)) ...


The issue is that getcontext calls sigprocmask on every invocation but _setjmp does not.

BUT WAIT if that's true why didn't GDB hit a sigprocmask breakpoint before?

x86_64 assembly FTW, again

Let's fire up gdb and figure out this breakpoint-not-breaking thing. First, let's start by disassembling getcontext (snipped for brevity):

(gdb) p getcontext
$1 = {} 0x7ffff7825100
(gdb) disas getcontext
...
0x00007ffff782517f : mov $0xe,%rax
0x00007ffff7825186 : syscall
...

Yeah, that's pretty weird. I'll explain why in a minute, but let's look at the disassembly of sigprocmask first:

(gdb) p sigprocmask
$2 = {} 0x7ffff7817340 <__sigprocmask>
(gdb) disas sigprocmask
...
0x00007ffff7817383 <__sigprocmask+67>: mov $0xe,%rax
0x00007ffff7817388 <__sigprocmask+72>: syscall
...

Yeah, this is a bit confusing, but here's the deal.

Recent Linux kernels implement a shiny new method for calling system calls called sysenter/sysexit. This new way was created because the old way (int $0x80) turned out to be pretty slow. So Intel created some new instructions to execute system calls without such huge overhead.

All you need to know right now (I'll try to blog more about this in the future) is that the %rax register holds the system call number. The syscall instruction transfers control to the kernel and the kernel figures out which syscall you wanted by checking the value in %rax. Let's just make sure that sigprocmask is actually 0xe:

[joe@pluto:/usr/include]% grep -Hrn "sigprocmask" asm-x86_64/unistd.h 
asm-x86_64/unistd.h:44:#define __NR_rt_sigprocmask                     14


Bingo. It's calling sigprocmask (albeit a bit obscurely).

OK, so getcontext isn't calling sigprocmask directly, instead it replicates a bunch of code that sigprocmask has in its function body. That's why we didn't hit the sigprocmask breakpoint; GDB was going to break if you landed on the address 0x7ffff7817340 but you didn't.

Instead, getcontext reimplements the wrapper code for sigprocmask itself and GDB is none the wiser.

Mystery solved.

The patch

Get it HERE

The patch works by adding a new configure flag called --disable-ucontext to allow you to specifically disable [sg]etcontext from being called, you use this in conjunction with --enable-pthread, like this:

./configure --disable-ucontext --enable-pthread


After you build Ruby configured like that, its performance is on par with (and sometimes slightly faster) than Ruby built with --disable-pthread for about a 30% performance boost when compared to --enable-pthread.

I added the switch because I wanted to preserve the original Ruby behavior, if you just pass --enable-pthread without --disable-ucontext Ruby will do the old thing and generate piles of sigprocmasks.

Conclusion

  1. Things aren't always what they seem - GDB may lie to you. Be careful.
  2. Use the source, Luke. Libraries can do unexpected things, debug builds of libc can help!
  3. I know I keep saying this, assembly is useful. Start learning it today!

If you enjoyed this blog post, consider subscribing (via RSS) or following (via twitter).

You'll want to stay tuned; tmm1 and I have been on a roll the past week. Lots of cool stuff coming out!

Written by Joe Damato

May 5th, 2009 at 3:20 am

6 Line EventMachine Bugfix = 2x faster GC, +1300% requests/sec

View Comments




Nothing is possible without lunch

So Aman Gupta (tmm1) and I were eating lunch at the Oaxacan Kitchen on Tuesday and as usual, we were talking about scaling Ruby. We got into a small debate about which phase of garbage collection took the most CPU time.

Aman’s claim:

  • The mark phase, specifically the stack marking phase because of the huge stack frames created by rb_eval

My claim:

  • The sweep phase, because every single object has to be touched and some freeing happens.

I told Aman that I didn’t believe the stack frames were that large, and we bet on how big we thought they would be. Couldn’t be more than a couple kilobytes, could it? Little did we know how wrong our estimates were.

Quick note about Ruby’s GC

Ruby MRI has a mark-and-sweep garbage collector. As part of the mark phase, it scans the process stack. This is required because a pointer to a Ruby object can be passed to a C extension (like Eventmachine, or Hpricot, or whatever). If that happens, it isn’t safe to free the object yet. So Ruby does a simple scan and checks if each word on the stack is a pointer to the Ruby heap, if so, that item cannot be freed.

GDB to the rescue

We get back from lunch, launch our application, attach GDB and set a breakpoint. The breakpoint gets triggered and we see this seemingly innocuous stack trace [Note: To help with debugging, we compiled the EventMachine gem with -fno-omit-frame-pointer]:

#0 0x00007ffff77629ac in epoll_wait () from /lib/libc.so.6
#1 0x00007ffff6c0b220 in EventMachine_t::_RunEpollOnce (this=0x158d7e0) at em.cpp:461
#2 0x00007ffff6c0b86c in EventMachine_t::_RunOnce (this=0x158d7e0) at em.cpp:423
#3 0x00007ffff6c0bbd6 in EventMachine_t::Run (this=0x158d7e0) at em.cpp:404
#4 0x00007ffff6c06638 in evma_run_machine () at cmain.cpp:83
#5 0x00007ffff6c1897f in t_run_machine_without_threads (self=26066936) at rubymain.cpp:154
#6 0x000000000041d598 in call_cfunc (func=0x7ffff6c1896e , recv=26066936, len=0, argc=0, argv=0x0) at eval.c:5759
#7 0x000000000041c92f in rb_call0 (klass=26065816, recv=26066936, id=29417, oid=29417, argc=0, argv=0x0, body=0x18dba10, flags=0) at eval.c:5911
#8 0x000000000041e0ad in rb_call (klass=26065816, recv=26066936, mid=29417, argc=0, argv=0x0, scope=2, self=26066936) at eval.c:6158
#9 0x00000000004160d5 in rb_eval (self=26066936, n=0x1940330) at eval.c:3514
#10 0x00000000004150b7 in rb_eval (self=26066936, n=0x1941018) at eval.c:3357
#11 0x000000000041d196 in rb_call0 (klass=26065816, recv=26066936, id=5393, oid=5393, argc=0, argv=0x0, body=0x1941018, flags=0) at eval.c:6062
#12 0x000000000041e0ad in rb_call (klass=26065816, recv=26066936, mid=5393, argc=0, argv=0x0, scope=0, self=47127864) at eval.c:6158
#13 0x0000000000415d01 in rb_eval (self=47127864, n=0x2cf5298) at eval.c:3493
#14 0x00000000004148b2 in rb_eval (self=47127864, n=0x2cf4380) at eval.c:3223
#15 0x000000000041d196 in rb_call0 (klass=47127808, recv=47127864, id=5313, oid=5313, argc=0, argv=0x0, body=0x2cf4380, flags=0) at eval.c:6062
#16 0x000000000041e0ad in rb_call (klass=47127808, recv=47127864, mid=5313, argc=0, argv=0x0, scope=0, self=9606072) at eval.c:6158
#17 0x0000000000415d01 in rb_eval (self=9606072, n=0x194b2a0) at eval.c:3493
#18 0x00000000004148b2 in rb_eval (self=9606072, n=0x19587b0) at eval.c:3223
#19 0x000000000041072c in eval_node (self=9606072, node=0x19587b0) at eval.c:1437
#20 0x0000000000410dff in ruby_exec_internal () at eval.c:1642
#21 0x0000000000410e4f in ruby_exec () at eval.c:1662
#22 0x0000000000410e72 in ruby_run () at eval.c:1672
#23 0x000000000040e78a in main (argc=3, argv=0x7fffffffebd8, envp=0x7fffffffebf8) at main.c:48

Looks pretty normal, nothing to worry about, right?

We started checking the rb_eval frames because we assumed that those would be the largest stack frames. The rb_eval function inlines other functions and call itself recursively. So how big is one of the rb_eval frames?

(gdb) frame 10
#10 0x00000000004150b7 in rb_eval (self=26066936, n=0x1941018) at eval.c:3357
3357 result = rb_eval(self, node->nd_head);
(gdb) p $rbp-$rsp
$2 = 1904

1,904 bytes – pretty large. If all the stack frames are that large, we are looking at around 47,600 bytes. Pretty serious. Let’s verify that Ruby thinks the stack is a sane size. There is a global in the Ruby interpreter called rb_gc_stack_start. It gets set when the Ruby stack is created in Init_stack(). When Ruby calculates the stack size it subtracts the current stack pointer from rb_gc_stack_start [remember on x86_64, the stack grows from high addresses to low addresses]. Let’s do that and see how big Ruby thinks the stack is.

(gdb) p (unsigned int)rb_gc_stack_start - (unsigned int)$rsp
$3 = 802688

Wait, wait, wait. 802,688 bytes with only 23 stack frames? WTF?! Something is wrong. We started at the top and checked all the rb_eval stack frames, but none of them are larger than 2kb. We did find something quite a bit larger than 2kb, though.

(gdb) frame 1
#1 0x00007ffff6c0b220 in EventMachine_t::_RunEpollOnce (this=0x158d7e0) at em.cpp:461
461 s = epoll_wait (epfd, ev, MaxEpollDescriptors, timeout == 0 ? 5 : timeout);
(gdb) p $rbp-$rsp
$28 = 786816

Uh, the RunEpollOnce stack frame is 786,816 bytes? That’s got to be wrong. WTF?

Time to bring out the big guns.

objdump + x86_64 asm FTW

I pumped EventMachine’s shared object into objdump and captured the assembly dump:

objdump -d rubyeventmachine.so > em.S

I headed down to the RunEpollOnce function and saw the following:

2f12b: 48 81 ec 78 01 0c 00 sub $0xc0178,%rsp

Interesting. So the code is moving %rsp down by 786,808 bytes to make room for something big. So, let’s see if the EventMachine code matches up with the assembly output.

struct epoll_event ev [MaxEpollDescriptors];

Where MaxEpollDescriptors = 64*1024 and sizeof(struct epoll_event) == 12. That matches up with the assembly dump and the GDB output.

Usually, doing something like that in C/C++ is (usually) OK. Avoiding the heap whenever you can is a good idea because you avoid heap-lock contention, fragmenting the heap, and memory overhead for tracking the memory region. When writing Ruby extensions, this isn’t necessarily true. Remember, Ruby’s GC algorithm scans the entire process stack searching for references to Ruby objects. This EventMachine code causes Ruby to search an extra ~800,000 bytes drastically slowing down garbage collection.

The patch

Get the patch HERE

The patch simply moves the stack allocated struct epoll_event ev to the class definition so that it is allocated on the heap when an instance of the class is created with new. This does not change the memory usage of the process at all. It just moves the object off the stack. This makes all the difference because Ruby’s GC scans the process stack and not the process heap.

On top of all that, this patch helps with Ruby’s green threads, too. If the epoll_wait causes a Ruby event to fire and that event creates a Ruby thread, that Ruby thread gets an entire copy of the existing stack. Each time that thread is switched into and out of, that thread stack has to be memcpy’d into and out of place. Reducing those memcpys by ~800,000 bytes is a HUGE performance win. Want to learn more about threading implementations? Check out my threading models post: here.

Fixing this turned out to be pretty simple. A six (6!!) line patch:

  • Speeds up GC by 2-3x because of the huge decrease in stack frame size.
  • Fixes an open bug in EventMachine where using threads with Epoll causes lots of slowness. The reason is that each thread will inherit an ~800,000 byte stack that gets copied in and out every context switch.
  • This results in an increase from 500 requests/sec to 7000 requests/sec when using Sinatra+Thin+Epoll+Threads. That is pretty ill.

Conclusion

All in all, a productive debugging session lasting about an hour. The result was a simple patch, with 2 big performance improvements.

A couple things to take away from this experience:

  • Spend time learning your debugging tools because it pays off, especially nm, objdump, and of course GDB.
  • Getting familiar with x86_64 assembly is crucial if you hope to debug complex software and optimize it correctly.

Keep your eyes open for up-coming blog posts about x86_64 assembly! Don’t forget to subscribe via RSS or follow me on twitter

Written by Joe Damato

April 29th, 2009 at 1:36 am

Fibers implemented for Ruby 1.8.{6,7}

View Comments

At Kickball Labs, Aman Gupta (http://github.com/tmm1) and I (http://github.com/ice799) have been working on an implementation of Fibers for Ruby 1.8.{6,7}. It is API compatible to Fibers in Ruby 1.9, except for the “transfer” method, which is currently unimplemented. This patch will allow you to use fibers with mysqlplus and neverblock.

THIS IS ALPHA SOFTWARE (we are using it in production, though), USE WITH CAUTION.

Raw patches

Patch against ruby-1.8.7_p72: HERE.

Patch against ruby-1.8.6_p287: HERE.

To use the patch:
Download ruby source Ruby 1.8.7_p72, or if you prefer: Ruby 1.8.6-p287

Then, perform the following:

cd your-ruby-src-directory/
wget http://timetobleed.com/files/fibers-RUBY_VERSION.patch
patch -p1 < fibers.patch
./configure —-disable-pthread —-prefix=/tmp/ruby-with-fibers/ &&  make && sudo make install
/tmp/ruby-with-fibers/bin/ruby test/test_fiber.rb

This will patch ruby and install it to a custom location: /tmp/ruby-with-fibers so you can test and play around with it without overwriting your existing Ruby installation.

Github

I am currently working on getting the ruby 1.8.6 patched code up on github, but Aman has a branch of ruby 1.8.7_p72 called fibers with the code at http://github.com/tmm1/ruby187/tree/fibers

What are fibers?

Fibers are (usually) non-preemptible lightweight user-land threads.

But I thought Ruby 1.8.{6,7} already had green threads?

You are right; it does. Fibers are simply ruby green threads, without preemption. The programmer (you) gets to decide when to pause and resume execution of a fiber instead of a timer.

Why would I use fibers?

Bottom line: Your I/O should be asynchronous whenever possible, but sometimes re-writing your entire code base to be asynch and have callbacks can be difficult or painful. A simple solution to this problem is to create or use (see: NeverBlock) some middleware that wraps code paths which make I/O requests in a fiber.

The middleware can issue the asynch I/O operation in a fiber, and yield. Once the middleware’s asynch callback is hit, the Fiber can be resumed. Using NeverBlock (or rolling something similar yourself), should require only minimal code changes to your application, and will essentially make all of your I/O requests asynchronous without much pain at all.

How do I use fibers?

There are already lots of great tutorials about fibers basics here and here.

Let’s take a look at something that drives home the point about being able to drop in some middleware to make synchronous code act asynchronous with minimal changes.

Consider the following code snippet:

require ‘rubygems’
require ‘sinatra’

# eventmachine/thin
require ‘eventmachine’
require ‘thin’

# mysql
require ‘mysqlplus’

# single threaded
DB = Mysql.connect

disable :reload

get ‘/’ do
  4.times do
    DB.query(‘select sleep(0.25)’)
  end
  ‘done’
end
 

This code snippet creates a simple webservice which connects to a mysql database and issues long running queries (in this case, 4 queries which execute for a total of 1 second).

In this implementation, only one request can be handled at a time; the DB.query blocks, so the other users have to wait to have their queries executed.

This sucks because certainly mysql can handle more than just 4 sleep(0.25) queries a second! But, what are our options?

Well, we can rewrite the code to be asynchronous and string together some callbacks. For my contrived example, doing that would be pretty easy and it’d be only slightly harder to read. Let’s use our imaginations. Let’s pretend the code snippet I just showed you was some huge, ugly, scary blob of code and rewritting it to be asynchronous would not only take a long time, it would also make the code very ugly and difficult to read.

Now, let’s drop in fibers:

require ‘rubygems’
require ‘sinatra’

# eventmachine/thin
require ‘eventmachine’
require ‘thin’

# mysql
require ‘mysqlplus’

# fibered
require ‘neverblock’
require ‘never_block/servers/thin’
require ‘neverblock-mysql’
class Thin::Server
 def fiber_pool() @fiber_pool ||= NB::Pool::FiberPool.new(20) end
end

DB = NB::DB::PooledDBConnection.new(20){ NB::DB::FMysql.connect }

disable :reload

get ‘/’ do
  4.times do
    DB.query(‘select sleep(0.25)’)
  end
  ‘done’
end
 

NOTICE: The application code hasn’t changed, we simply monkey patched Thin to use a pool of fibers.

Suddenly, our application can handle 20 connections. This is all handled by NeverBlock and mysqlplus.

  • NeverBlock uses the fiber pool to issue an asynch DB query via mysqplus.
  • After the asynch query is executed, NeverBlock pauses the executing fiber
  • At this point other requests can be serviced
  • When the data comes back from the mysql server, a callback in NeverBlock is executed.
  • The callback resumes the paused fiber, which continues executing.

Pretty sick, right?

Memory consumption, context switches, cooperative multi-threading, oh my!

In our implementation, fibers are ruby green threads, but with no scheduler or preemption. In fact, our fiber implementation shares many code-paths with the existing green thread implementation. As a result, there is very little difference in memory consumption between green threads and our fiber implementation.

Context switches are a different matter all together. The whole point of building a fiber implementation is to allow the programmer to decide when context switching is appropriate. In most circumstances, the application should be undergoing many fewer context switches with fibers and the context switches that do happen occur precisely when needed. As a result, the application can tend to run faster (fewer context switches ==> fewer stack copies ==> fewer CPU cycles).

The major advantage of fibers over green threads is that you get to control when execution starts and stops. The major disadvantage of fibers is that if you have to code carefully, to ensure that you are starting and stopping your fibers appropriately.

Future Directions

Next stop will be “stackless” fibers. I have a fork of the fibers implementation in the works that pre-allocates fiber stacks on the ruby process’ heap. I am hoping to eliminate the overhead associated with switching between fibers by simply shuffling pointers around.

A preliminary version seems to work, although a few bugs that crop up when you use fibers and threads together need to be squashed before the code can be considered “alpha” stage. When it’s done, you’ll find it right here.

Written by Joe Damato

February 5th, 2009 at 5:25 pm

Plugging Ruby Memory Leaks: heap/stack dump patches to help take out the trash.

View Comments

EDIT: For all those tuning into my blog for the first time be sure to check out the Ruby threading bugfix from an earlier post: Ruby Threading Bugfix: Small fix goes a long way

In this blog post I’m releasing some patches to MRI Ruby 1.8.7p72 that add heap dumping, object reference finder, stack dumping, object allocation/deallocation tracking, and some more goodies to MRI Ruby 1.8.7p72. These are some changes to Ruby I’ve been working on at the startup I am with (Kickball Labs!), but I think they are generally useful to other people who are looking to try and track down memory problems in Ruby. Let’s get started…

Memory leaks? In Ruby ?!

Many people working on large Ruby applications have complained about memory consumption and the lack of tools available to dump the object heap and search for which objects hold references to other objects. In a future blog post, I hope to talk about different methods of garbage collection and dissect some popular languages and their GC algorithms. Very briefly though, let’s take a quick look at how Ruby’s GC works and how leaks can occur.

The garbage collector in MRI Ruby marks objects as in use during the mark phase if a reference to the object is found. The traversal is started at a set of top level root objects which are marked as in use and recurses on the objects which are referenced from the current object. In this way all reachable objects are marked as in use. The sweep phase then hits every object on the heap and marks objects as free if they are not in use.

We all try to write code which is decoupled, clean, and well documented. Let’s face it though: we are human and we make mistakes. Image a pile of objects all holding references to each other. As long as this pile of objects has no reachable reference from a path originating from a root object, these objects will get garbage collected. Unfortunately, from time to time we forget to clear out all references to old objects we are no longer using. In our pile of objects example, all it takes is one reachable reference from a root object to keep the entire pile of objects in memory.

Here at Kickball labs we thought we were encountering such a situation, but we had no easy way to verify and track it down. I hacked up a heap dumper, stack dumper, reference finder, and some other stuff to help in our quest to take out the trash.

Memtrack APIs

To start I did a bunch of google searching to find if anyone else had gone down this path before. I stumbled across a company called Software Verify, which released a really cool set of APIs, Ruby Memory Tracking API , for tracking object allocation, deallocation, getting the root heap objects, and getting references to objects. The API was really clean and allowed me to hook in my own callbacks for each of those events. One small downside though; the memtrack APIs were for ruby 1.8.6 and when I created a diff they did not apply cleanly to 1.8.7. With a tiny bit of work I was able to get APIs to apply to 1.8.7p72. The patch can be found below and applies to MRI Ruby 1.8.7-p72.

Heap/Stack dumper overview

I hooked in some callbacks to the Memtrack API and added some additional code for heap_find and stack_dump. My algorithm for heap_dump and heap_find is pretty simple. I start by traversing each of the objects on the heap and dumping them each (for heap_dump) or checking if the object has a reference to the object passed into heap_find.

I also copied some code from the mark phase of Ruby’s GC to check the contents of the registers and stack frames. I did this because Ruby checks this information before marking an object as in use. You may be wondering how this can possibly work; couldn’t any FIXNUM that gets passed around in the Ruby internals look like a heap pointer? Well… not quite. Ruby FIXNUMs are 31 bits and as a result they can be bit shifted by 1 bit and their LSB can be set to 1 without losing any data. Doing this ensures that the resulting value cannot look like a pointer to the heap.

HOWTO use the stack/heap dumper

Four functions were added to GC: heap_dump, heap_find, heap_counts, and stack_dump. After applying the patches below and rebuilding Ruby, you can use these functions. There are a bunch of environment variables you can set to control how much output you get. WARNING: There can be a lot of output.

Environment vars and what they do:

RUBY_HEAP_LOG – setting this environment var will cause heap dumping output to be written to the specified file. If this is not set data will be written to stderr

RUBY_HEAP_ALLOC – setting this environment var to anything greather than or equal to 1 will cause output whenever a ruby object is allocated or deallocated this creates a LOT of output.

RUBY_HEAP_SETUP – setting this environment var to anything greater than or equal to 1 will cause output whenever a ruby object has its klass field set.

Ok – now let’s take a quick look at the four added functions and what they do when you use them:

GC.heap_dump – this function writes to stderr (unless RUBY_HEAP_LOG is set) the contents of the Ruby heap. It outputs each object that is not marked as free and information about that object and the objects it references. For strings, for example, the value and length of the string will be output. For classes, the name of the class (unless it is an anonymous class) and the superclass.

GC.heap_find – this function takes one argument and traverses the heap looking for objects which hold a reference to the argument. Once found, it outputs information about the object holding the reference and also information about the object you passed in.

GC.heap_counts – this function is poorly implemented and can only be called after you have called GC.heap_dump. This function outputs the number of each type of Ruby object that has been created (object, node, class, module, string, fixnum, float, etc).

GC.stack_dump – this function checks the registers and stack frames for things which look like pointers to the heap. If the pointer looks like it is pointed at the heap, the address of the pointer is output as well as information about the object it references.

The good stuff: Patches for Ruby 1.8.7p72

Apply the patches below to the MRI Ruby 1.8.7p72 code base and rebuild Ruby. The code below is a bit ugly as it was written in a rush. Use with caution and be sure to email me with any bugfixes or cool additions you come up with!

Patch to apply Memtrack API to MRI Ruby 1.8.7-p72:
memtrack-api-187p72.patch

Patch to apply heap dumper, stack dumper, ref finder, alloc/dealloc tracker, and everything else which is awesome:

heapdumper.patch

Conclusion

Ruby is lacking tools for serious memory analysis. Using the Memtrack API and some other code, tracking down memory issues can be less painful. The tools included in the patches above are only a first cut at developing real, useful memory analysis tools for Ruby. I hope to see more work in creating tools for Ruby.

Thanks for reading! If you have problems using the patches or you create a cool memory analysis tool send me some mail.

Written by Joe Damato

October 15th, 2008 at 11:30 am