time to bleed by Joe Damato

technical ramblings from a wanna-be unix dinosaur

Archive for the ‘scaling’ tag

Fibers implemented for Ruby 1.8.{6,7}

View Comments

At Kickball Labs, Aman Gupta (http://github.com/tmm1) and I (http://github.com/ice799) have been working on an implementation of Fibers for Ruby 1.8.{6,7}. It is API compatible to Fibers in Ruby 1.9, except for the “transfer” method, which is currently unimplemented. This patch will allow you to use fibers with mysqlplus and neverblock.

THIS IS ALPHA SOFTWARE (we are using it in production, though), USE WITH CAUTION.

Raw patches

Patch against ruby-1.8.7_p72: HERE.

Patch against ruby-1.8.6_p287: HERE.

To use the patch:
Download ruby source Ruby 1.8.7_p72, or if you prefer: Ruby 1.8.6-p287

Then, perform the following:

cd your-ruby-src-directory/
wget http://timetobleed.com/files/fibers-RUBY_VERSION.patch
patch -p1 < fibers.patch
./configure —-disable-pthread —-prefix=/tmp/ruby-with-fibers/ &&  make && sudo make install
/tmp/ruby-with-fibers/bin/ruby test/test_fiber.rb

This will patch ruby and install it to a custom location: /tmp/ruby-with-fibers so you can test and play around with it without overwriting your existing Ruby installation.

Github

I am currently working on getting the ruby 1.8.6 patched code up on github, but Aman has a branch of ruby 1.8.7_p72 called fibers with the code at http://github.com/tmm1/ruby187/tree/fibers

What are fibers?

Fibers are (usually) non-preemptible lightweight user-land threads.

But I thought Ruby 1.8.{6,7} already had green threads?

You are right; it does. Fibers are simply ruby green threads, without preemption. The programmer (you) gets to decide when to pause and resume execution of a fiber instead of a timer.

Why would I use fibers?

Bottom line: Your I/O should be asynchronous whenever possible, but sometimes re-writing your entire code base to be asynch and have callbacks can be difficult or painful. A simple solution to this problem is to create or use (see: NeverBlock) some middleware that wraps code paths which make I/O requests in a fiber.

The middleware can issue the asynch I/O operation in a fiber, and yield. Once the middleware’s asynch callback is hit, the Fiber can be resumed. Using NeverBlock (or rolling something similar yourself), should require only minimal code changes to your application, and will essentially make all of your I/O requests asynchronous without much pain at all.

How do I use fibers?

There are already lots of great tutorials about fibers basics here and here.

Let’s take a look at something that drives home the point about being able to drop in some middleware to make synchronous code act asynchronous with minimal changes.

Consider the following code snippet:

require ‘rubygems’
require ‘sinatra’

# eventmachine/thin
require ‘eventmachine’
require ‘thin’

# mysql
require ‘mysqlplus’

# single threaded
DB = Mysql.connect

disable :reload

get ‘/’ do
  4.times do
    DB.query(‘select sleep(0.25)’)
  end
  ‘done’
end
 

This code snippet creates a simple webservice which connects to a mysql database and issues long running queries (in this case, 4 queries which execute for a total of 1 second).

In this implementation, only one request can be handled at a time; the DB.query blocks, so the other users have to wait to have their queries executed.

This sucks because certainly mysql can handle more than just 4 sleep(0.25) queries a second! But, what are our options?

Well, we can rewrite the code to be asynchronous and string together some callbacks. For my contrived example, doing that would be pretty easy and it’d be only slightly harder to read. Let’s use our imaginations. Let’s pretend the code snippet I just showed you was some huge, ugly, scary blob of code and rewritting it to be asynchronous would not only take a long time, it would also make the code very ugly and difficult to read.

Now, let’s drop in fibers:

require ‘rubygems’
require ‘sinatra’

# eventmachine/thin
require ‘eventmachine’
require ‘thin’

# mysql
require ‘mysqlplus’

# fibered
require ‘neverblock’
require ‘never_block/servers/thin’
require ‘neverblock-mysql’
class Thin::Server
 def fiber_pool() @fiber_pool ||= NB::Pool::FiberPool.new(20) end
end

DB = NB::DB::PooledDBConnection.new(20){ NB::DB::FMysql.connect }

disable :reload

get ‘/’ do
  4.times do
    DB.query(‘select sleep(0.25)’)
  end
  ‘done’
end
 

NOTICE: The application code hasn’t changed, we simply monkey patched Thin to use a pool of fibers.

Suddenly, our application can handle 20 connections. This is all handled by NeverBlock and mysqlplus.

  • NeverBlock uses the fiber pool to issue an asynch DB query via mysqplus.
  • After the asynch query is executed, NeverBlock pauses the executing fiber
  • At this point other requests can be serviced
  • When the data comes back from the mysql server, a callback in NeverBlock is executed.
  • The callback resumes the paused fiber, which continues executing.

Pretty sick, right?

Memory consumption, context switches, cooperative multi-threading, oh my!

In our implementation, fibers are ruby green threads, but with no scheduler or preemption. In fact, our fiber implementation shares many code-paths with the existing green thread implementation. As a result, there is very little difference in memory consumption between green threads and our fiber implementation.

Context switches are a different matter all together. The whole point of building a fiber implementation is to allow the programmer to decide when context switching is appropriate. In most circumstances, the application should be undergoing many fewer context switches with fibers and the context switches that do happen occur precisely when needed. As a result, the application can tend to run faster (fewer context switches ==> fewer stack copies ==> fewer CPU cycles).

The major advantage of fibers over green threads is that you get to control when execution starts and stops. The major disadvantage of fibers is that if you have to code carefully, to ensure that you are starting and stopping your fibers appropriately.

Future Directions

Next stop will be “stackless” fibers. I have a fork of the fibers implementation in the works that pre-allocates fiber stacks on the ruby process’ heap. I am hoping to eliminate the overhead associated with switching between fibers by simply shuffling pointers around.

A preliminary version seems to work, although a few bugs that crop up when you use fibers and threads together need to be squashed before the code can be considered “alpha” stage. When it’s done, you’ll find it right here.

Written by Joe Damato

February 5th, 2009 at 5:25 pm

PXE booting: easily getting what you want on to remote servers

View Comments



Scaling out your service to multiple servers can be a painful process once you consider package management, configuration, OS installation, and provisioning. The pain can be exacerbated if you want to use a version of linux that your hosting provider does not provide. In this next set of blog posts I’m going to talk about a few of the different ways to deal with these issues.

This post will address PXE booting (pronounced pixie, like the candy our friends up top are eating). PXE booting can help you easily install an OS and provision a server. All your provider has to do is turn your new system on and PXE can handle the rest!

Before talking about provisioning, let’s talk about booting a custom kernel image for a linux network installer.

But why would I not use the provider’s linux?

Plenty of reasons for this one. Maybe you started out with some distro of linux, had to switch providers, and now your new provider doesn’t have the distro you want. Perhaps you want to use some less popular distro, don’t like the way the system was installed, want to install on to software raid, or whatever. An easy solution to this problem is to use PXE booting to easily launch an installer image on multiple machines as the machines boot.

What is PXE booting?

PXE booting is a process that can occur during boot time on a computer system. PXE was designed as an option ROM for the x86 BIOS and you can get this functionality by having a NIC that supports PXE. Many NICs these days support PXE, do some googling or contact your provider to see if your NIC supports PXE.

After the system BIOS comes up, the PXE option ROM code is executed. The NIC on the system broadcasts a DHCPDISCOVER packet with some extra information that lets anyone listening know that it supports PXE. If a DHCP server hears this packet (and also supports PXE) it can transmit data back to the NIC including a file name (the image you will be booting) and an IP address (the server that has the image). The image is transferred via TFTP, loaded into RAM, and booted.

What do I need to get the party started?

You will need to do some initial bootstrapping. You may need to have your provider install whatever generic distro they have, then PXE boot your other machines. As painful as this may sound, it is easier IMHO than installing linux on top of linux. Hopefully, your hosting provider includes a NIC with an IP address on an internal vlan and that NIC is listed early in the boot order in the BIOS.

You will also need some software that should be available via your package management system:
dhcpd – dhcp server
pxelinux (sometimes bundled with syslinux, check your package manager)- pxe boot files
tftp-hpa – tftp server
os kernel image - kernel you want to boot (if you don’t have one, you can use memtest86 to test)

Configure your DHCP server

You’ll need to setup your DHCP server by editing your dhcpd.conf (or equivalent file for your DHCP server). I’ve included the config file I’m using in production below and I’ll go over the important parts below the file.

ddns-update-style interim;
subnet 10.16.73.0 netmask 255.255.255.192 {
        range 10.16.73.4 10.16.73.20;
        default-lease-time 3600;
        max-lease-time 4800;
        option routers 10.16.73.2;
        option domain-name-servers 10.16.73.2;
        option subnet-mask 255.255.255.192;
        option domain-name "cool-domain.com";
        option time-offset -7;
        option ntp-servers 10.16.73.2;
}

host host1 {
        hardware ethernet aa:bb:cc:dd:ee:ff;
        fixed-address 10.16.73.10;
        option host-name "host1";
        filename "pxelinux.0";
        next-server 10.16.73.2;
}

There are a few important things to note about the above config file:

  • I’ve decided to specify host1 explicity by listing its MAC address. You don’t have to do this; you can specify an entire subnet (see man dhcpd(8) for more information).
  • The filename line – this line specifies the pxelinux file to download and execute (we’ll get to this soon).
  • The next-server line – this line specifies the server where the pxelinux file can be downloaded from. This can be the same server that is running dhcpd or a different one – it doesn’t matter.

For more info on other config options, check man dhcpd(8).

Configure the TFTP server

The tftp server configuration is pretty simple. A couple things to remember when setting up the tftp server:

  • It sounds obvious, but make sure the TFTP server can read from the directory it is pointed at.
  • Make sure hosts.allow and hosts.deny are setup properly to allow only your servers to access the tftp server.

Be sure to test your tftp server setup with a tftp client before moving on to the next step.

Setup pxelinux

Once you’ve used your package management system to install the pxelinux package (if that doesn’t exist, try syslinux sometimes they are packaged together) you can copy the pxelinux.0 file included in the pxelinux package to the directory your tftp server is serving files from.

Create a directory called pxelinux.cfg under the directory where your tftp server is pointed at. If your tftp server is serving from /tftpboot, pxelinux.0 would be under /tftpboot and you’d want to create /tftpboot/pxelinux.cfg/ Under this directory you will create configuration files for the different hosts.

Configure pxelinux

Under the pxelinux.cfg directory you can create configuration files for PXE to use. PXE decides which file to use based on the filename.

  1. The first filename searched is the MAC address of the client with “01-” prepended to it. For example: 01-AA-BB-CC-11-22-33.
  2. If that file is not found, the next filename searched is the IP address in hexadecimal.
  3. Next afterwards is the IP address in hexadecimal with the last digit removed.
  4. Subtracting the last digit repeats until there is only one digit left.
  5. If that file is not found, the last resort is to search for a file called default.

If your server has an IPMI interface, this is a perfect opportunity to use it. The NIC will output debugging information as it searches for files and will let you know what files it finds, if any.

The configuration file itself will look something like this (this is from our actual production config):

prompt 1
timeout 300
display boot.msg
F1 boot.msg
F2 options.msg
default arch
label arch
kernel vmlinuz
append initrd=initrd.img rootdelay=5

The configuration file is pretty straightforward.

  • timeout – number of milliseconds until the default label is executed
  • display – prints the ascii data in the specified file to the screen before doing anything else
  • default – the default label to execute if the timeout is reached
  • label – name for a specific configuration
  • kernel – the kernel to boot
  • append – additional data to pass to the kernel, in this case I’ve specified the initrd to use

That should be all you need to get PXE boot working. You should now be able to boot into the image of your choice, whether it be a network install image or a bootstrapping image. If it is an install image, you can use IPMI to give you remote KVM to guide the install.

Cool, what else can I do?

Pretty much anything you want, including provisioning. Since you can specify an initrd for the kernel to use, you can roll your own initrd. initrds are just gzipped cpio archives. You can create your own initrd which could:

  • Run scripts to provision the system as a database or app server.
  • Download a disk image from an NFS/CIFS/FTP/whatever server and dd it to disk.
  • Script the installation of your favorite linux.
  • Anything else you can think of.

Conclusion

PXE boot is a fast, effective, and easy system that can be setup with minimal effort and provides plenty of flexibility. It provides you with a simple way to bring up new systems when scaling out and can also be used for provisioning and deployment. Of course, it is just one of many possible solutions to custom OS installation and provisioning.

I have used PXE booting for installing a lesser-known linux distro in a remote datacenter and also for kernel development. I hope this blog post inspires you to give it a shot.

Written by Joe Damato

November 3rd, 2008 at 8:59 am

Posted in scaling,systems

Tagged with , , ,

I/O models: how you move your data matters

View Comments

Above picture was shamelessly stolen from: http://computer-history.info/Page4.dir/pages/IBM.7030.Stretch.dir/

In this blog post I’m going to follow suit on my threading models post (here) and talk about different types of I/O, how they work, and when you might want to consider using them. Much like with threading models, I/O models have terminology which can be confusing. The confusion leads to misconceptions which will hopefully be cleared up here.

Let’s start first by going over some operating system basics.

System Calls

A system call is a common interface which allows user applications and the operating system kernel to interact with one another. Some familiar functions which are system calls: open(), read(), and write(). These are system calls which ask the kernel to do I/O on behalf of the user process.

There is a cost associated with making system calls. In Linux, system calls are implemented via a software interrupt which causes a privilege level change in the processor – this switch from user to kernel mode is commonly called a context-switch.

User applications typically execute at the most restricted privilege level available where interaction with I/O devices (and other stuff) is not allowed. As a result user applications use system calls to get the kernel to complete privileged I/O (and other) operations.

Synchronous blocking I/O

This is the most familiar and most common type of I/O out there. When an I/O operation is initiated in this model (maybe by calling a system call such as read(), write(), ioctl(), …), the user application making the system call is put into a waiting state by the kernel. The application sleeps until the I/O operation has completed (or has generated an error) at which point it is scheduled to run again. Data is transferred from the device to memory and possibly into another buffer for the user-land application.

Pros:

  • Easy to use and well understood
  • Ubiquitous

Cons:

  • Does not maximize I/O throughput
  • Causes all threads in a process to block if that process uses green threads

This method of I/O is very straight forward and simple to use, but it has many downsides. In a previous post about threading models, I mentioned that doing blocking I/O in a green thread causes all green threads to stop executing until the I/O operation has completed.

This happens because there is only one kernel context which can scheduled, so that context is put into a waiting state in the kernel until the I/O has been copied to the user buffer and the process can run again.

Synchronous non-blocking I/O

This model of I/O is not very well known compared to other models. This is good because this model isn’t very useful.

In this model, a file descriptor is created via open(), but a flag is passed in (O_NONBLOCK on most Linux kernels) to tell the kernel: If data is not available immediately, do not put me to sleep. Instead let me know so I can go on with my life. I’ll try back later.

Pros:

  • If no I/O is available other work can be completed in the meantime
  • When I/O is available, is does not block the thread (even models with green threads)

Cons:

  • Does not maximize I/O throughput for the application
  • Lots of system call overhead – constantly making system calls to see if I/O is ready
  • Can be high latency if I/O arrives and a system call is not made for a while

This model of I/O is typically very inefficient because the I/O system call made by the application may return EAGAIN or EWOULDBLOCK repeatedly. The application can either:

  • wait around for the data to finish (repeatedly calling its I/O system call over and over)  — or
  • try to do other work for a bit, and retry the I/O system call later

At some point the I/O will either return an error or it will be able to complete.

If this type of I/O is used in a system with green threads, the entire process is not blocked but the efficiency is very poor due to the constant polling with system calls from user-land. Each time a system call is invoked a privelege level change occurs on the processor and the execution state of the application has to be saved out to memory (or disk!) so that the kernel can execute.

Asynchronous blocking I/O

This model of I/O is much more well known. In fact, this is how Ruby implements I/O for its green threads.

In this model, non-blocking file descriptors are created (similar to the previous model) and they monitored by calling either select() or poll(). The system call to select()/poll() blocks the process (the process is put into a sleeping state in the kernel) and the system call returns when either an error has occurred or when the file descriptors are ready to be read from or written to.

Pros:

  • When I/O is available is does not block
  • Lots of I/O can be issued to execute in parallel
  • Notifications occur when one or more file descriptors are ready (helps to improve I/O throughput)

Cons:

  • Calling select(), poll(), or epoll_wait() blocks the calling thread (entire application if using green threads)
  • Lots of file descriptors for I/O means lots that have to be checked (can be avoided with epoll)

What is important to note here is that more than one file descriptor can be monitored and when select/poll returns, more than one of the file descriptors may be able to do non-blocking I/O. This is great because it increases the application’s I/O throughput by allowing many I/O operations to occur in parallel.

Of course there are two main drawbacks of using this model:

  • select()/poll() block – so if they are used in a system with green threads, all the threads are put to sleep while these system calls are executing.
  • You must check the entire set of file descriptors to determine which are ready. This can be bad if you have a lot of file descriptors, because you can potentially spend a lot of time checking file descriptors which aren’t ready (epoll() fixes this problem).

This model is important for all you Ruby programmers out there — this is the type of I/O that Ruby uses internally. The calls to select cause Ruby to block while they are being executed.

There are some work-arounds though:

  • Timeouts – select() and poll() let you set timeouts so your app doesn’t have to sleep endlessly if there is no I/O to process – it can continue executing other code in the meantime. This what Ruby does.
  • epoll() (or kqueue on bsd)- epoll() allows you to register a set of file descriptors you are interested in. You then make blocking epoll_wait calls (they accept timeouts) which will return only the file descriptors which are ready for I/O. This allows you to avoid searching through all your file descriptors every time.

At the very least you should set a timeout so that you can do other work if no I/O is ready. If possible though, use epoll().

Asynchronous non-blocking I/O

This is probably the least widely known model of I/O out there. This model of io is implemented via the libaio library in Linux.

In this I/O model, you can initiate I/O using aio_read(), aio_write(), and a few others. Before using these functions, you must set up a struct aiocb including fields which indicate how you’d like to get notifications and where the data can be read from or written to. Notifications can be delivered in a couple different ways:

  • Signal – a SIGIO is delivered to the process when the I/O has completed
  • Callback – a callback function is called when the I/O has completed

Pros:

  • Helps maximize I/O throughput by allowing lots of I/O to issued in parallel
  • Allows application to continue processing while I/O is executing, callback or POSIX signal when done

Cons:

  • Wrapper for libaio may not exist for your programming environment
  • Network I/O may not be supported

This method of I/O is really awesome because it does not block the calling application and allows multiple I/O operations to executed in parallel which increases the I/O throughput of the application.

The downsides to using libaio are:

  • Wrapper may not exist for your favorite programming language.
  • Unclear whether libaio supports network I/O on all systems — may only support disk I/O. When this happens, the library falls back to using normal synchronous blocking I/O.

You should try out this I/O model if your programming environment has support for it and it either has support for network I/O or you don’t need it.

Conclusion

In conclusion, you should use synchronous blocking I/O when you are writing small apps which won’t see much traffic. For more intense applications, you should definitely use one of the two asynchronous models. If possible, avoid synchronous non-blocking I/O at all costs.

Remember that the goal is to increase I/O throughput to scale your application to withstand thousands of requests per second. Doing any sort of blocking I/O in your application can (depending on threading model) cause your entire application to block, increasing latency and slowing the user experience to a crawl.

Written by Joe Damato

October 27th, 2008 at 8:58 am

Posted in systems

Tagged with , , , , ,

Threading models: So many different ways to get stuff done.

View Comments

multitasking

Why do I care?

Threading models are often very confusing; there are many different models with different trade-offs and dissecting the details can be tough the first time around. It is important for any large scale project to consider what threading model(s) a programming language supports and what implications the model(s) will have on the system design so that the software system that performs as optimally as possible can be built.

Probably the source of a lot of the confusion surrounding threading models is the terminology used to describe the different components. I am going to try to explain some terminology which, to my knowledge, is the most commonly used.

User-land? Kernel-land?

This could be a blog post in and of itself, but let’s try to stay high level here. When I write “user-land” I am referring to the context in which normal applications run, such as a web-browser, or email client. When I write “kernel-land” I am referring to the context in which the kernel executes, typically a more privileged execution context that allows interaction with memory, I/O ports, process scheduling, and other funky stuff.

What is a process?

A process is a collection of various pieces of state for an executable that includes things such as virtual address space, per process flags, file descriptors, and more.

What is a thread?

A thread is just a collection of execution state for a program. Depending on the implementation this can include register values, execution stack, and more. Each process has at least one thread, the main thread. Some processes will create more threads. How new threads are created is where we begin considering the trade-offs.

Let’s look at some different threading models. I’m going to list the Pros and Cons first in case you don’t feel like reading the full explanation :) Let’s get started.

1:1

The 1:1 model, or one kernel thread for each user thread, is a very widespread model that is seen in many operating system implementations like Linux. It is sometimes referred to as “native threads.”

Pros:

  • Threads can execute on different CPUs
  • Threads do not block each other
  • Shared memory

Cons:

  • Setup overhead
  • Linux kernel bug with lots of threads (read more here)
  • Low limits on the number of threads which can be created

What does this mean? This means that each user-thread (execution state in user-land) is paired with a kernel-thread (execution state in kernel-land). The two commonly interact via system calls and signals. Since state exists in the kernel, the scheduler can schedule threads created in the 1:1 model across different CPUs to execute in parallel. A side effect of this is that if a thread executes a system call that blocks, the other threads in the process can be scheduled and executed in the mean time. In this model, different threads can share the same virtual address space but care must be taken to synchronize access to the same memory regions. Unfortunately, since the kernel has to be notified when a new thread is created in userland so corresponding state can be created in the kernel, this setup cost is overhead that must be paid each time a thread is created and there is an upper bound on the number of threads and thread state that the kernel can track before performance begins to degrade.

You may be familiar with libpthread and the function pthread_create. On Linux, this creates user and kernel state.

1:N

The 1:N model, or one kernel thread for N user threads, is a model that is commonly called “green threads” or “lightweight threads.”

Pros:

  • Thread creation, execution, and cleanup are cheap
  • Lots of threads can be created (10s of thousands or more)

Cons:

  • Kernel scheduler doesn’t know about threads so they can’t be scheduled across CPUs or take advantage of SMP
  • Blocking I/O operations can block all the green threads

In this model a process manages thread creation, termination, and scheduling completely on its own without the help or knowledge of the kernel. The major upside of this model is that thread creation, termination, cleanup, and synchronization is extremely cheap, and it is possible to create huge numbers of threads in user-land. This model has several downsides, though. One of the major downsides of not being able to utilize the kernel’s scheduler. As a result, all the user-land threads execute on the same CPU and cannot take advantage of true parallel execution. One way to cope with this is to create multiple processes (perhaps via fork()) and then have the processes communicate with each other. A model like this begins to look very much like the M:N model described below.

MRI Ruby 1.8.7 has green threads. Early versions of Java also had green threads.

M:N

The M:N model, or M kernel threads for N user threads, is a model that is a hybrid of the previous two models.

Pros:

  • Take advantage of multiple CPUs
  • Not all threads are blocked by blocking system calls
  • Cheap creation, execution, and cleanup

Cons:

  • Need scheduler in userland and kernel to work with each other
  • Green threads doing blocking I/O operations will block all other green threads sharing same kernel thread
  • Difficult to write, maintain, and debug code

This hybrid model appears to be a best of both worlds solution that includes all the advantages of 1:1 and 1:N threading without any of the downsides. Unfortunately the cost of the downsides outweighs many of the advantages to such an extent that it isn’t worth it in many cases to build/use an M:N threading model. In general, building and synchronizing a user-land scheduler with a kernel scheduler makes programming in this model extremely difficult and error prone. Research on M:N threading vs 1:1 threading was done for the Linux kernel to determine how threading was going to evolve. Research into performance implications and use cases on Linux showed the 1:1 model to be superior in general. On the other hand, in specific problem domains that are well understood M:N may be the right choice.

Erlang has what many consider to be an M:N threading model. Prior to Solaris 9, Solaris supported an M:N threading model.

So which should I use?

Well, it is a tough call. You need to sit and think awhile about what your specific system needs and how intelligent your libraries are. In some implementations of the 1:N threading model I/O operations will all be abstracted away into a non-blocking I/O subsystem. If your library of choice does not (or cannot due to language design) hook into this non-blocking I/O subsystem, your library may block all your green threads clobbering your performance.

You should strongly consider the threading model(s) supported by the programming language(s) and libraries you choose because this decision will have impact on your performance, application execution time, and I/O operations.

Thanks for reading!

Written by Joe Damato

October 8th, 2008 at 1:00 pm

Posted in systems

Tagged with , , , , ,

Ruby threading bugfix: small fix goes a long way.

View Comments

threads

Quick Overview of Ruby Threads

Ruby 1.8.7 (MRI) implements threads completely in userland (also called “green threads” for short) even if built with pthreads. This means that underlying OS kernel has no knowledge about any threads created in ruby programs. In the view of the kernel, it only sees a process with one thread. This one thread is the ruby interpreter which has its own scheduler and threading implementation built-in. What this means for the Ruby developer is that any thread which does I/O will cause the entire ruby process (the ruby interpretter and all ruby green threads) to block.

Implementing threads in userland has some interesting design questions, one of which is: How does the interpretter start and stop executing ruby threads? One way to implement this is to create a timer which interrupts the interpretter at some interval. Ruby (depending on your platform and build options) creates either:

  1. An interval timer with setitimer, which delivers a SIGVTALRM signal to the process at the specified interval, or
  2. A real native OS thread (via pthreads) which sleeps for the length of the interval

In either case, a flag called rb_thread_pending is set (for those of you following along with the Ruby source, the flag is checked with the CHECK_INTS macro). It is important to note, however that the timer created with setitimer is of type ITIMER_VIRTUAL which means time will be measured only when the interpretter is executing (and not during system calls executed on behalf of ruby) whereas the sleeping OS thread is always measuring time, regardless of whether or not Ruby is executing.

strace saves the day

I am working on an event-based real-time distributed (insert more buzzwords) system built in ruby. As a result I am constantly trying to push ruby to its limits, like many other people out there. I noticed that the latency of my eventloop started to increase and after I spawned threads to do short tasks (like send an email, for example). The weird thing was that the latency didn’t go down even after the thread had finished executing! To debug this problem I attached strace to my running ruby process and I saw this:

[joe@mawu]% strace -ttTp `pidof ruby` 2>&1 | egrep '(sigret|setitimer|timer|exit_group)'
19:41:21.282700 setitimer(ITIMER_VIRTUAL, {it_interval={0, 10000}, itvalue={0, 10000}}, NULL) = 0 <0.000022>
19:41:26.778386 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.780578 sigreturn()             = ? (mask now []) <0.000022>
19:41:26.814172 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.823761 sigreturn()             = ? (mask now []) <0.000022>
19:41:26.888419 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.890691 sigreturn()             = ? (mask now []) <0.000041>
19:41:26.904949 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.907327 sigreturn()             = ? (mask now []) <0.000040>
19:41:26.995445 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.997699 sigreturn()             = ? (mask now []) <0.000041>
19:41:27.144428 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:27.147146 sigreturn()             = ? (mask now []) <0.000023>
19:41:27.303472 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:27.306825 sigreturn()             = ? (mask now []) <0.000021>
...

Weird! Looks like the timer is interrupting the executing Ruby process causing it to enter the thread scheduler only to schedule the only thread in the app and start executing again. This was really bad for our system because our main eventloop was being constantly interrupted to the point where under high load the eventloop was unable to service connection requests fast enough and timing out our test scripts. This is also a big problem if you use ruby gems piled on top of ruby gems because the more layers of gem code executing for the short time quanta means that less of your actual app code gets to execute! Not cool, but before getting excited I decided to try to reproduce this on a smaller scale, so:

[joe@mawu]% strace -ttT ruby -e 't1 = Thread.new{ sleep(5) }; t1.join; 10000.times{"aaaaa" * 1000};' 2>&1 | egrep '(sigret|setitimer|timer|exit_group)'
19:41:21.282700 setitimer(ITIMER_VIRTUAL, {it_interval={0, 10000}, itvalue={0, 10000}}, NULL) = 0 <0.000022>
19:41:26.778386 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.780578 sigreturn()             = ? (mask now []) <0.000022>
19:41:26.814172 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.823761 sigreturn()             = ? (mask now []) <0.000022>
19:41:26.888419 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.890691 sigreturn()             = ? (mask now []) <0.000041>
19:41:26.904949 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.907327 sigreturn()             = ? (mask now []) <0.000040>
19:41:26.995445 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.997699 sigreturn()             = ? (mask now []) <0.000041>
19:41:27.144428 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:27.147146 sigreturn()             = ? (mask now []) <0.000023>
19:41:27.303472 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:27.306825 sigreturn()             = ? (mask now []) <0.000021>
19:41:27.314461 exit_group(0)           = ?

Definitely starting to look like a bug from the strace output.

I decided to dive into the ruby 1.8.7 MRI source code (eval.c for those following along in the source) and found that a timer is created whenever a thread is created, but the timer is not destroyed when the thread terminates! Definitely a bug. A quick fix to eval.c fixed the problem and my latency dropped like a rock!

Patch for ruby 1.8.7

I posted a patch to ruby-core and some code was added to fix pthread-enabled Ruby. NOTE: You should ALWAYS test new patches before applying them to your live site, this is no exception!
Ruby MRI 1.8.7p72 patch

Future directions

I’ve been asked a bunch of different questions about threads and threading models, so my next couple blog posts will be about different threading models. I’m going to dive into the details, go through the pros and cons, and try to clear things up a bit, so stay tuned and thanks for reading!

Written by Joe Damato

October 5th, 2008 at 10:17 pm