time to bleed by Joe Damato

technical ramblings from a wanna-be unix dinosaur

Ruby threading bugfix: small fix goes a long way.

View Comments

threads

Quick Overview of Ruby Threads

Ruby 1.8.7 (MRI) implements threads completely in userland (also called “green threads” for short) even if built with pthreads. This means that underlying OS kernel has no knowledge about any threads created in ruby programs. In the view of the kernel, it only sees a process with one thread. This one thread is the ruby interpreter which has its own scheduler and threading implementation built-in. What this means for the Ruby developer is that any thread which does I/O will cause the entire ruby process (the ruby interpretter and all ruby green threads) to block.

Implementing threads in userland has some interesting design questions, one of which is: How does the interpretter start and stop executing ruby threads? One way to implement this is to create a timer which interrupts the interpretter at some interval. Ruby (depending on your platform and build options) creates either:

  1. An interval timer with setitimer, which delivers a SIGVTALRM signal to the process at the specified interval, or
  2. A real native OS thread (via pthreads) which sleeps for the length of the interval

In either case, a flag called rb_thread_pending is set (for those of you following along with the Ruby source, the flag is checked with the CHECK_INTS macro). It is important to note, however that the timer created with setitimer is of type ITIMER_VIRTUAL which means time will be measured only when the interpretter is executing (and not during system calls executed on behalf of ruby) whereas the sleeping OS thread is always measuring time, regardless of whether or not Ruby is executing.

strace saves the day

I am working on an event-based real-time distributed (insert more buzzwords) system built in ruby. As a result I am constantly trying to push ruby to its limits, like many other people out there. I noticed that the latency of my eventloop started to increase and after I spawned threads to do short tasks (like send an email, for example). The weird thing was that the latency didn’t go down even after the thread had finished executing! To debug this problem I attached strace to my running ruby process and I saw this:

[joe@mawu]% strace -ttTp `pidof ruby` 2>&1 | egrep '(sigret|setitimer|timer|exit_group)'
19:41:21.282700 setitimer(ITIMER_VIRTUAL, {it_interval={0, 10000}, itvalue={0, 10000}}, NULL) = 0 <0.000022>
19:41:26.778386 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.780578 sigreturn()             = ? (mask now []) <0.000022>
19:41:26.814172 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.823761 sigreturn()             = ? (mask now []) <0.000022>
19:41:26.888419 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.890691 sigreturn()             = ? (mask now []) <0.000041>
19:41:26.904949 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.907327 sigreturn()             = ? (mask now []) <0.000040>
19:41:26.995445 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.997699 sigreturn()             = ? (mask now []) <0.000041>
19:41:27.144428 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:27.147146 sigreturn()             = ? (mask now []) <0.000023>
19:41:27.303472 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:27.306825 sigreturn()             = ? (mask now []) <0.000021>
...

Weird! Looks like the timer is interrupting the executing Ruby process causing it to enter the thread scheduler only to schedule the only thread in the app and start executing again. This was really bad for our system because our main eventloop was being constantly interrupted to the point where under high load the eventloop was unable to service connection requests fast enough and timing out our test scripts. This is also a big problem if you use ruby gems piled on top of ruby gems because the more layers of gem code executing for the short time quanta means that less of your actual app code gets to execute! Not cool, but before getting excited I decided to try to reproduce this on a smaller scale, so:

[joe@mawu]% strace -ttT ruby -e 't1 = Thread.new{ sleep(5) }; t1.join; 10000.times{"aaaaa" * 1000};' 2>&1 | egrep '(sigret|setitimer|timer|exit_group)'
19:41:21.282700 setitimer(ITIMER_VIRTUAL, {it_interval={0, 10000}, itvalue={0, 10000}}, NULL) = 0 <0.000022>
19:41:26.778386 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.780578 sigreturn()             = ? (mask now []) <0.000022>
19:41:26.814172 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.823761 sigreturn()             = ? (mask now []) <0.000022>
19:41:26.888419 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.890691 sigreturn()             = ? (mask now []) <0.000041>
19:41:26.904949 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.907327 sigreturn()             = ? (mask now []) <0.000040>
19:41:26.995445 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.997699 sigreturn()             = ? (mask now []) <0.000041>
19:41:27.144428 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:27.147146 sigreturn()             = ? (mask now []) <0.000023>
19:41:27.303472 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:27.306825 sigreturn()             = ? (mask now []) <0.000021>
19:41:27.314461 exit_group(0)           = ?

Definitely starting to look like a bug from the strace output.

I decided to dive into the ruby 1.8.7 MRI source code (eval.c for those following along in the source) and found that a timer is created whenever a thread is created, but the timer is not destroyed when the thread terminates! Definitely a bug. A quick fix to eval.c fixed the problem and my latency dropped like a rock!

Patch for ruby 1.8.7

I posted a patch to ruby-core and some code was added to fix pthread-enabled Ruby. NOTE: You should ALWAYS test new patches before applying them to your live site, this is no exception!
Ruby MRI 1.8.7p72 patch

Future directions

I’ve been asked a bunch of different questions about threads and threading models, so my next couple blog posts will be about different threading models. I’m going to dive into the details, go through the pros and cons, and try to clear things up a bit, so stay tuned and thanks for reading!

Written by Joe Damato

October 5th, 2008 at 10:17 pm

  • rogerpack

    Appears the patch has been accepted upstream.
    http://blog.phusion.nl/2010/01...

  • Georgia Sewall

    do you have a thread called "shrinking thread"...happened to see a quilting show on TV...guest was showing a pillow she had covered and one of the swatches was made w/shrinking thread...made the surface look like it was wrinkled....very atractive....

  • We're using 1.8.5 p115 and are running into this SIGVTALRM weirdness with Phusion Passenger. I tried adapting the patch for 1.8.5 and came up with the following:

    http://www.elliterate.com/file...

    Any reason why this couldn't work? It doesn't seem to be causing any problems (yet) on our staging server.

  • Thanks for the patch, it finally fixed that SIGVTALRM weirdness that I was experiencing. Too bad that I ran in circles trying to understand what I'm doing wrong, when in fact it was a VM bug. But it seems that your fix is not even in the latest 1.8.7 snapshot. Perhaps you need to bug ruby-core again in order to include it.

  • joe

    @roger - This patch has been submit to ruby-core. Also, ruby uses asynchronous blocking I/O.

    Stay tuned for a new timetobleed article about different I/O models.

    Thanks for reading!

blog comments powered by Disqus