time to bleed by Joe Damato

technical ramblings from a wanna-be unix dinosaur

Fix a bug in Ruby’s configure.in and get a ~30% performance boost.

View Comments


Special thanks…

Going out to Jake Douglas for pushing the initial investigation and getting the ball rolling.

The whole --enable-pthread thing

Ask any Ruby hacker how to easily increase performance in a threaded Ruby application and they’ll probably tell you:

Yo dude… Everyone knows you need to configure Ruby with --disable-pthread.

And it’s true; configure Ruby with --disable-pthread and you get a ~30% performance boost. But… why?

For this, we’ll have to turn to our handy tool strace. We’ll also need a simple Ruby program to this one. How about something like this:

def make_thread
  Thread.new {
    a = []
    10_000_000.times {
      a << "a"
      a.pop
    }
  }
end

t = make_thread 
t1 = make_thread 

t.join
t1.join

Now, let's run strace on a version of Ruby configure'd with --enable-pthread and point it at our test script. The output from strace looks like this:

22:46:16.706136 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706177 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706218 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706259 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000005>
22:46:16.706301 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706342 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706383 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706425 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706466 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>

Pages and pages and pages of sigprocmask system calls (Actually, running with strace -c, I get about 20,054,180 calls to sigprocmask, WOW). Running the same test script against a Ruby built with --disable-pthread and the output does not have pages and pages of sigprocmask calls (only 3 times, a HUGE reduction).

OK, so let's just set a breakpoint in GDB... right?

OK, so we should just be able to set a breakpoint on sigprocmask and figure out who is calling it.

Well, not exactly. You can try it, but the breakpoint won't trigger (we'll see why a little bit later).

Hrm, that kinda sucks and is confusing. This will make it harder to track down who is calling sigprocmask in the threaded case.

Well, we know that when you run configure the script creates a config.h with a bunch of defines that Ruby uses to decide which functions to use for what. So let's compare ./configure --enable-pthread with ./configure --disable-pthread:

[joe@mawu:/home/joe/ruby]% diff config.h config.h.pthread 
> #define _REENTRANT 1
> #define _THREAD_SAFE 1
> #define HAVE_LIBPTHREAD 1
> #define HAVE_NANOSLEEP 1
> #define HAVE_GETCONTEXT 1
> #define HAVE_SETCONTEXT 1


OK, now if we grep the Ruby source code, we see that whenever HAVE_[SG]ETCONTEXT are set, Ruby uses the system calls setcontext() and getcontext() to save and restore state for context switching and for exception handling (via the EXEC_TAG).

What about when HAVE_[SG]ETCONTEXT are not define'd? Well in that case, Ruby uses _setjmp/_longjmp.

Bingo!

That's what's going on! From the _setjmp/_longjmp man page:

... The _longjmp() and _setjmp() functions shall be equivalent to longjmp() and setjmp(), respectively, with the additional restriction that _longjmp() and _setjmp() shall not manipulate the signal mask...

And from the [sg]etcontext man page:

... uc_sigmask is the set of signals blocked in this context (see sigprocmask(2)) ...


The issue is that getcontext calls sigprocmask on every invocation but _setjmp does not.

BUT WAIT if that's true why didn't GDB hit a sigprocmask breakpoint before?

x86_64 assembly FTW, again

Let's fire up gdb and figure out this breakpoint-not-breaking thing. First, let's start by disassembling getcontext (snipped for brevity):

(gdb) p getcontext
$1 = {} 0x7ffff7825100
(gdb) disas getcontext
...
0x00007ffff782517f : mov $0xe,%rax
0x00007ffff7825186 : syscall
...

Yeah, that's pretty weird. I'll explain why in a minute, but let's look at the disassembly of sigprocmask first:

(gdb) p sigprocmask
$2 = {} 0x7ffff7817340 <__sigprocmask>
(gdb) disas sigprocmask
...
0x00007ffff7817383 <__sigprocmask+67>: mov $0xe,%rax
0x00007ffff7817388 <__sigprocmask+72>: syscall
...

Yeah, this is a bit confusing, but here's the deal.

Recent Linux kernels implement a shiny new method for calling system calls called sysenter/sysexit. This new way was created because the old way (int $0x80) turned out to be pretty slow. So Intel created some new instructions to execute system calls without such huge overhead.

All you need to know right now (I'll try to blog more about this in the future) is that the %rax register holds the system call number. The syscall instruction transfers control to the kernel and the kernel figures out which syscall you wanted by checking the value in %rax. Let's just make sure that sigprocmask is actually 0xe:

[joe@pluto:/usr/include]% grep -Hrn "sigprocmask" asm-x86_64/unistd.h 
asm-x86_64/unistd.h:44:#define __NR_rt_sigprocmask                     14


Bingo. It's calling sigprocmask (albeit a bit obscurely).

OK, so getcontext isn't calling sigprocmask directly, instead it replicates a bunch of code that sigprocmask has in its function body. That's why we didn't hit the sigprocmask breakpoint; GDB was going to break if you landed on the address 0x7ffff7817340 but you didn't.

Instead, getcontext reimplements the wrapper code for sigprocmask itself and GDB is none the wiser.

Mystery solved.

The patch

Get it HERE

The patch works by adding a new configure flag called --disable-ucontext to allow you to specifically disable [sg]etcontext from being called, you use this in conjunction with --enable-pthread, like this:

./configure --disable-ucontext --enable-pthread


After you build Ruby configured like that, its performance is on par with (and sometimes slightly faster) than Ruby built with --disable-pthread for about a 30% performance boost when compared to --enable-pthread.

I added the switch because I wanted to preserve the original Ruby behavior, if you just pass --enable-pthread without --disable-ucontext Ruby will do the old thing and generate piles of sigprocmasks.

Conclusion

  1. Things aren't always what they seem - GDB may lie to you. Be careful.
  2. Use the source, Luke. Libraries can do unexpected things, debug builds of libc can help!
  3. I know I keep saying this, assembly is useful. Start learning it today!

If you enjoyed this blog post, consider subscribing (via RSS) or following (via twitter).

You'll want to stay tuned; tmm1 and I have been on a roll the past week. Lots of cool stuff coming out!

Written by Joe Damato

May 5th, 2009 at 3:20 am

  • This is not fixed in 1.8.7 p330 (latest 187 2 months ago). Comment 3 of mine: http://www.ruby-forum.com/topi...

    """
    Solaris 10 on SPARC, Ruby 1.8.7 p330, --enable-pthread

    Ruby application 'puppet'

    "puppet resource user root" issues over 2 million getcontext() calls and
    takes 25 seconds to complete

    On a Linux box, same build args, the command takes 1-2 seconds.
    """

  • Same story on FreeBSD:

    3049 ruby18 CALL _umtx_op(0x8017d1060,0x8,0x1,0x8017d1040,0x7fffe7bfdec0)
    3049 ruby18 RET _umtx_op -1 errno 60 Operation timed out
    3049 ruby18 CALL clock_gettime(0,0x7fffe7bfdf80)
    3049 ruby18 RET clock_gettime 0
    3049 ruby18 CALL clock_gettime(0,0x7fffe7bfded0)
    3049 ruby18 RET clock_gettime 0
    3049 ruby18 CALL _umtx_op(0x8017d1060,0x8,0x1,0x8017d1040,0x7fffe7bfdec0)
    3049 ruby18 RET _umtx_op -1 errno 60 Operation timed out
    3049 ruby18 CALL clock_gettime(0,0x7fffe7bfdf80)
    3049 ruby18 RET clock_gettime 0
    3049 ruby18 CALL clock_gettime(0,0x7fffe7bfded0)
    3049 ruby18 RET clock_gettime 0
    3049 ruby18 CALL _umtx_^Cop(0x8017d1060,0x8,0x1,0x8017d1040,0x7fffe7bfdec0)

  • lazzarello

    Heya. I applied your patch to ruby 1.8.7 (2008-08-11 patchlevel 72) [x86_64-linux] in Debian stable and rebuilt the package. It does not remove the dominance of the rt_sigprocmask syscall in your test script. I ran the script for 24 seconds and it called that syscall 416614 times.

  • cryptoromantic

    so what is the underlying bug here - that get/setcontext is completely unnecessary because green threads are being used even with pthreads enabled, or that some work get/setcontext is doing is necessary, but it also always calls sigprocmasks (which is unnecessary), and this calls for a better decomposition of the get/setcontext call... ?

  • Great work! I've been investigating these issues around --enable-pthread recently as I *just* became aware of their significance.. and I'm surprised, frankly.. there seem to be a lot of people that knew this, yet a lot of people who still don't and who still run into tutorials like the popular Hivelogic one which recommends you use --enable-pthread.

blog comments powered by Disqus