time to bleed by Joe Damato

technical ramblings from a wanna-be unix dinosaur

Ruby threading bugfix: small fix goes a long way.

View Comments

threads

Quick Overview of Ruby Threads

Ruby 1.8.7 (MRI) implements threads completely in userland (also called “green threads” for short) even if built with pthreads. This means that underlying OS kernel has no knowledge about any threads created in ruby programs. In the view of the kernel, it only sees a process with one thread. This one thread is the ruby interpreter which has its own scheduler and threading implementation built-in. What this means for the Ruby developer is that any thread which does I/O will cause the entire ruby process (the ruby interpretter and all ruby green threads) to block.

Implementing threads in userland has some interesting design questions, one of which is: How does the interpretter start and stop executing ruby threads? One way to implement this is to create a timer which interrupts the interpretter at some interval. Ruby (depending on your platform and build options) creates either:

  1. An interval timer with setitimer, which delivers a SIGVTALRM signal to the process at the specified interval, or
  2. A real native OS thread (via pthreads) which sleeps for the length of the interval

In either case, a flag called rb_thread_pending is set (for those of you following along with the Ruby source, the flag is checked with the CHECK_INTS macro). It is important to note, however that the timer created with setitimer is of type ITIMER_VIRTUAL which means time will be measured only when the interpretter is executing (and not during system calls executed on behalf of ruby) whereas the sleeping OS thread is always measuring time, regardless of whether or not Ruby is executing.

strace saves the day

I am working on an event-based real-time distributed (insert more buzzwords) system built in ruby. As a result I am constantly trying to push ruby to its limits, like many other people out there. I noticed that the latency of my eventloop started to increase and after I spawned threads to do short tasks (like send an email, for example). The weird thing was that the latency didn’t go down even after the thread had finished executing! To debug this problem I attached strace to my running ruby process and I saw this:

[joe@mawu]% strace -ttTp `pidof ruby` 2>&1 | egrep '(sigret|setitimer|timer|exit_group)'
19:41:21.282700 setitimer(ITIMER_VIRTUAL, {it_interval={0, 10000}, itvalue={0, 10000}}, NULL) = 0 <0.000022>
19:41:26.778386 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.780578 sigreturn()             = ? (mask now []) <0.000022>
19:41:26.814172 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.823761 sigreturn()             = ? (mask now []) <0.000022>
19:41:26.888419 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.890691 sigreturn()             = ? (mask now []) <0.000041>
19:41:26.904949 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.907327 sigreturn()             = ? (mask now []) <0.000040>
19:41:26.995445 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.997699 sigreturn()             = ? (mask now []) <0.000041>
19:41:27.144428 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:27.147146 sigreturn()             = ? (mask now []) <0.000023>
19:41:27.303472 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:27.306825 sigreturn()             = ? (mask now []) <0.000021>
...

Weird! Looks like the timer is interrupting the executing Ruby process causing it to enter the thread scheduler only to schedule the only thread in the app and start executing again. This was really bad for our system because our main eventloop was being constantly interrupted to the point where under high load the eventloop was unable to service connection requests fast enough and timing out our test scripts. This is also a big problem if you use ruby gems piled on top of ruby gems because the more layers of gem code executing for the short time quanta means that less of your actual app code gets to execute! Not cool, but before getting excited I decided to try to reproduce this on a smaller scale, so:

[joe@mawu]% strace -ttT ruby -e 't1 = Thread.new{ sleep(5) }; t1.join; 10000.times{"aaaaa" * 1000};' 2>&1 | egrep '(sigret|setitimer|timer|exit_group)'
19:41:21.282700 setitimer(ITIMER_VIRTUAL, {it_interval={0, 10000}, itvalue={0, 10000}}, NULL) = 0 <0.000022>
19:41:26.778386 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.780578 sigreturn()             = ? (mask now []) <0.000022>
19:41:26.814172 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.823761 sigreturn()             = ? (mask now []) <0.000022>
19:41:26.888419 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.890691 sigreturn()             = ? (mask now []) <0.000041>
19:41:26.904949 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.907327 sigreturn()             = ? (mask now []) <0.000040>
19:41:26.995445 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.997699 sigreturn()             = ? (mask now []) <0.000041>
19:41:27.144428 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:27.147146 sigreturn()             = ? (mask now []) <0.000023>
19:41:27.303472 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:27.306825 sigreturn()             = ? (mask now []) <0.000021>
19:41:27.314461 exit_group(0)           = ?

Definitely starting to look like a bug from the strace output.

I decided to dive into the ruby 1.8.7 MRI source code (eval.c for those following along in the source) and found that a timer is created whenever a thread is created, but the timer is not destroyed when the thread terminates! Definitely a bug. A quick fix to eval.c fixed the problem and my latency dropped like a rock!

Patch for ruby 1.8.7

I posted a patch to ruby-core and some code was added to fix pthread-enabled Ruby. NOTE: You should ALWAYS test new patches before applying them to your live site, this is no exception!
Ruby MRI 1.8.7p72 patch

Future directions

I’ve been asked a bunch of different questions about threads and threading models, so my next couple blog posts will be about different threading models. I’m going to dive into the details, go through the pros and cons, and try to clear things up a bit, so stay tuned and thanks for reading!

Written by Joe Damato

October 5th, 2008 at 10:17 pm

  • rogerpack
    Appears the patch has been accepted upstream.
    http://blog.phusion.nl/2010/01/20/ruby-enterpri...
  • Georgia Sewall
    do you have a thread called "shrinking thread"...happened to see a quilting show on TV...guest was showing a pillow she had covered and one of the swatches was made w/shrinking thread...made the surface look like it was wrinkled....very atractive....
  • We're using 1.8.5 p115 and are running into this SIGVTALRM weirdness with Phusion Passenger. I tried adapting the patch for 1.8.5 and came up with the following:

    http://www.elliterate.com/files/ruby_1.8.5_sigv...

    Any reason why this couldn't work? It doesn't seem to be causing any problems (yet) on our staging server.
  • Thanks for the patch, it finally fixed that SIGVTALRM weirdness that I was experiencing. Too bad that I ran in circles trying to understand what I'm doing wrong, when in fact it was a VM bug. But it seems that your fix is not even in the latest 1.8.7 snapshot. Perhaps you need to bug ruby-core again in order to include it.
  • joe
    @roger - This patch has been submit to ruby-core. Also, ruby uses asynchronous blocking I/O.

    Stay tuned for a new timetobleed article about different I/O models.

    Thanks for reading!
  • rogerdpack
    @hongli the mysqlplus library doesn't block if you use async_query :) [oh wait, you probably already knew that].

    MRI 1.8.x does use a "non blocking" internal IO--kind of slow [1] but it works.

    I'd say definitely submit a patch to core for this one -- others may benefit :) -- wonder if 1.9 is affected [though know nothing about it].

    Maybe somebody should actually buy something from software verify :)

    Thanks for doing these.

    -=R

    [1] http://www.ruby-forum.com/topic/164329#new
  • joe
    @Hongli Lai: Thanks for the comments - I will try to make my posts more clear in the future.

    You are correct that ruby attempts to do non-blocking I/O whenever possible, but it cannot control what the ruby extensions do. So if your ruby extension does blocking I/O, your Ruby process blocks.

    I also want to point out that when the actual I/O is occurring (in your example, when there is data for read) that I/O operation WILL block all of Ruby and all green threads.

    I plan on writing a blog post or two soon that will talk about different I/O models (blocking vs nonblocking vs async) and I'll touch more on these details there.

    Again, thanks for reading and I'll make some corrections to my post to make it more clear.
  • My phone interview question at YouTube was "How would you build a thread scheduler?"

    A stupid question for a phone interview (and the interviewer was entirely too please with himself), but I'm interested in seeing the answer from someone who knows what they're talking about, i.e., not me.
  • Hongli Lai
    Nice! Have you considered submitting this patch to upstream?


    "What this means for the Ruby developer is that any thread which does I/O will cause the entire ruby process (the ruby interpretter and all ruby green threads) to block."

    Sorry, gotta disagree with you here. The MRI core I/O implementation is essentially non-blocking. If it detects that an I/O operation will block, it will simply set the thread's state to 'blocking on I/O' and schedule another green thread instead.

    Consider this:
    require 'thread'
    require 'socket'

    thread = Thread.new do
    while true
    puts rand
    sleep 1
    end
    end

    TCPSocket.new('www.google.com', 80).read

    Look! The thread keeps running even though we're blocked on a read operation.

    What is true however, is that native extensions that:
    1. use blocking I/O
    and
    2. aren't explicitly programmed to support green threads (i.e. to cause a context switch when it detects a potentially blocking condition)
    will block all green threads. The MySQL library is one such example.
blog comments powered by Disqus