Archive for the ‘syscall’ tag
Descent into Darkness: Understanding your system’s binary interface is the only way out
Download as PDF (3mb)
Descent into Darkness: Understanding your system’s binary interface is the only way out.
Debugging Ruby: Understanding and Troubleshooting the VM and your Application
Download the PDF here.
Ruby Hoedown Slides
Below are the slides for a talk that Aman Gupta and I gave at Ruby Hoedown
Download the PDF here
Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.
Fix a bug in Ruby’s configure.in and get a ~30% performance boost.

Special thanks…
Going out to Jake Douglas for pushing the initial investigation and getting the ball rolling.
The whole --enable-pthread thing
Ask any Ruby hacker how to easily increase performance in a threaded Ruby application and they’ll probably tell you:
Yo dude… Everyone knows you need to configure Ruby with --disable-pthread.
And it’s true; configure Ruby with --disable-pthread and you get a ~30% performance boost. But… why?
For this, we’ll have to turn to our handy tool strace. We’ll also need a simple Ruby program to this one. How about something like this:
def make_thread
Thread.new {
a = []
10_000_000.times {
a << "a"
a.pop
}
}
end
t = make_thread
t1 = make_thread
t.join
t1.join
Now, let's run strace on a version of Ruby configure'd with --enable-pthread and point it at our test script. The output from strace looks like this:
22:46:16.706136 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004> 22:46:16.706177 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004> 22:46:16.706218 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004> 22:46:16.706259 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000005> 22:46:16.706301 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004> 22:46:16.706342 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004> 22:46:16.706383 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004> 22:46:16.706425 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004> 22:46:16.706466 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
Pages and pages and pages of sigprocmask system calls (Actually, running with strace -c, I get about 20,054,180 calls to sigprocmask, WOW). Running the same test script against a Ruby built with --disable-pthread and the output does not have pages and pages of sigprocmask calls (only 3 times, a HUGE reduction).
OK, so let's just set a breakpoint in GDB... right?
OK, so we should just be able to set a breakpoint on sigprocmask and figure out who is calling it.
Well, not exactly. You can try it, but the breakpoint won't trigger (we'll see why a little bit later).
Hrm, that kinda sucks and is confusing. This will make it harder to track down who is calling sigprocmask in the threaded case.
Well, we know that when you run configure the script creates a config.h with a bunch of defines that Ruby uses to decide which functions to use for what. So let's compare ./configure --enable-pthread with ./configure --disable-pthread:
[joe@mawu:/home/joe/ruby]% diff config.h config.h.pthread > #define _REENTRANT 1 > #define _THREAD_SAFE 1 > #define HAVE_LIBPTHREAD 1 > #define HAVE_NANOSLEEP 1 > #define HAVE_GETCONTEXT 1 > #define HAVE_SETCONTEXT 1
OK, now if we grep the Ruby source code, we see that whenever HAVE_[SG]ETCONTEXT are set, Ruby uses the system calls setcontext() and getcontext() to save and restore state for context switching and for exception handling (via the EXEC_TAG).
What about when HAVE_[SG]ETCONTEXT are not define'd? Well in that case, Ruby uses _setjmp/_longjmp.
Bingo!
That's what's going on! From the _setjmp/_longjmp man page:
... The _longjmp() and _setjmp() functions shall be equivalent to longjmp() and setjmp(), respectively, with the additional restriction that _longjmp() and _setjmp() shall not manipulate the signal mask...
And from the [sg]etcontext man page:
... uc_sigmask is the set of signals blocked in this context (see sigprocmask(2)) ...
The issue is that getcontext calls sigprocmask on every invocation but _setjmp does not.
BUT WAIT if that's true why didn't GDB hit a sigprocmask breakpoint before?
x86_64 assembly FTW, again
Let's fire up gdb and figure out this breakpoint-not-breaking thing. First, let's start by disassembling getcontext (snipped for brevity):
(gdb) p getcontext
$1 = {
(gdb) disas getcontext
...
0x00007ffff782517f
0x00007ffff7825186
...
Yeah, that's pretty weird. I'll explain why in a minute, but let's look at the disassembly of sigprocmask first:
(gdb) p sigprocmask
$2 = {
(gdb) disas sigprocmask
...
0x00007ffff7817383 <__sigprocmask+67>: mov $0xe,%rax
0x00007ffff7817388 <__sigprocmask+72>: syscall
...
Yeah, this is a bit confusing, but here's the deal.
Recent Linux kernels implement a shiny new method for calling system calls called sysenter/sysexit. This new way was created because the old way (int $0x80) turned out to be pretty slow. So Intel created some new instructions to execute system calls without such huge overhead.
All you need to know right now (I'll try to blog more about this in the future) is that the %rax register holds the system call number. The syscall instruction transfers control to the kernel and the kernel figures out which syscall you wanted by checking the value in %rax. Let's just make sure that sigprocmask is actually 0xe:
[joe@pluto:/usr/include]% grep -Hrn "sigprocmask" asm-x86_64/unistd.h asm-x86_64/unistd.h:44:#define __NR_rt_sigprocmask 14
Bingo. It's calling sigprocmask (albeit a bit obscurely).
OK, so getcontext isn't calling sigprocmask directly, instead it replicates a bunch of code that sigprocmask has in its function body. That's why we didn't hit the sigprocmask breakpoint; GDB was going to break if you landed on the address 0x7ffff7817340 but you didn't.
Instead, getcontext reimplements the wrapper code for sigprocmask itself and GDB is none the wiser.
Mystery solved.
The patch
Get it HERE
The patch works by adding a new configure flag called --disable-ucontext to allow you to specifically disable [sg]etcontext from being called, you use this in conjunction with --enable-pthread, like this:
./configure --disable-ucontext --enable-pthread
After you build Ruby configured like that, its performance is on par with (and sometimes slightly faster) than Ruby built with --disable-pthread for about a 30% performance boost when compared to --enable-pthread.
I added the switch because I wanted to preserve the original Ruby behavior, if you just pass --enable-pthread without --disable-ucontext Ruby will do the old thing and generate piles of sigprocmasks.
Conclusion
- Things aren't always what they seem - GDB may lie to you. Be careful.
- Use the source, Luke. Libraries can do unexpected things, debug builds of libc can help!
- I know I keep saying this, assembly is useful. Start learning it today!
If you enjoyed this blog post, consider subscribing (via RSS) or following (via twitter).
You'll want to stay tuned; tmm1 and I have been on a roll the past week. Lots of cool stuff coming out!
I/O models: how you move your data matters

Above picture was shamelessly stolen from: http://computer-history.info/Page4.dir/pages/IBM.7030.Stretch.dir/
In this blog post I’m going to follow suit on my threading models post (here) and talk about different types of I/O, how they work, and when you might want to consider using them. Much like with threading models, I/O models have terminology which can be confusing. The confusion leads to misconceptions which will hopefully be cleared up here.
Let’s start first by going over some operating system basics.
System Calls
A system call is a common interface which allows user applications and the operating system kernel to interact with one another. Some familiar functions which are system calls: open(), read(), and write(). These are system calls which ask the kernel to do I/O on behalf of the user process.
There is a cost associated with making system calls. In Linux, system calls are implemented via a software interrupt which causes a privilege level change in the processor – this switch from user to kernel mode is commonly called a context-switch.
User applications typically execute at the most restricted privilege level available where interaction with I/O devices (and other stuff) is not allowed. As a result user applications use system calls to get the kernel to complete privileged I/O (and other) operations.
Synchronous blocking I/O
This is the most familiar and most common type of I/O out there. When an I/O operation is initiated in this model (maybe by calling a system call such as read(), write(), ioctl(), …), the user application making the system call is put into a waiting state by the kernel. The application sleeps until the I/O operation has completed (or has generated an error) at which point it is scheduled to run again. Data is transferred from the device to memory and possibly into another buffer for the user-land application.
Pros:
- Easy to use and well understood
- Ubiquitous
Cons:
- Does not maximize I/O throughput
- Causes all threads in a process to block if that process uses green threads
This method of I/O is very straight forward and simple to use, but it has many downsides. In a previous post about threading models, I mentioned that doing blocking I/O in a green thread causes all green threads to stop executing until the I/O operation has completed.
This happens because there is only one kernel context which can scheduled, so that context is put into a waiting state in the kernel until the I/O has been copied to the user buffer and the process can run again.
Synchronous non-blocking I/O
This model of I/O is not very well known compared to other models. This is good because this model isn’t very useful.
In this model, a file descriptor is created via open(), but a flag is passed in (O_NONBLOCK on most Linux kernels) to tell the kernel: If data is not available immediately, do not put me to sleep. Instead let me know so I can go on with my life. I’ll try back later.
Pros:
- If no I/O is available other work can be completed in the meantime
- When I/O is available, is does not block the thread (even models with green threads)
Cons:
- Does not maximize I/O throughput for the application
- Lots of system call overhead – constantly making system calls to see if I/O is ready
- Can be high latency if I/O arrives and a system call is not made for a while
This model of I/O is typically very inefficient because the I/O system call made by the application may return EAGAIN or EWOULDBLOCK repeatedly. The application can either:
- wait around for the data to finish (repeatedly calling its I/O system call over and over) — or
- try to do other work for a bit, and retry the I/O system call later
At some point the I/O will either return an error or it will be able to complete.
If this type of I/O is used in a system with green threads, the entire process is not blocked but the efficiency is very poor due to the constant polling with system calls from user-land. Each time a system call is invoked a privelege level change occurs on the processor and the execution state of the application has to be saved out to memory (or disk!) so that the kernel can execute.
Asynchronous blocking I/O
This model of I/O is much more well known. In fact, this is how Ruby implements I/O for its green threads.
In this model, non-blocking file descriptors are created (similar to the previous model) and they monitored by calling either select() or poll(). The system call to select()/poll() blocks the process (the process is put into a sleeping state in the kernel) and the system call returns when either an error has occurred or when the file descriptors are ready to be read from or written to.
Pros:
- When I/O is available is does not block
- Lots of I/O can be issued to execute in parallel
- Notifications occur when one or more file descriptors are ready (helps to improve I/O throughput)
Cons:
- Calling select(), poll(), or epoll_wait() blocks the calling thread (entire application if using green threads)
- Lots of file descriptors for I/O means lots that have to be checked (can be avoided with epoll)
What is important to note here is that more than one file descriptor can be monitored and when select/poll returns, more than one of the file descriptors may be able to do non-blocking I/O. This is great because it increases the application’s I/O throughput by allowing many I/O operations to occur in parallel.
Of course there are two main drawbacks of using this model:
- select()/poll() block – so if they are used in a system with green threads, all the threads are put to sleep while these system calls are executing.
- You must check the entire set of file descriptors to determine which are ready. This can be bad if you have a lot of file descriptors, because you can potentially spend a lot of time checking file descriptors which aren’t ready (epoll() fixes this problem).
This model is important for all you Ruby programmers out there — this is the type of I/O that Ruby uses internally. The calls to select cause Ruby to block while they are being executed.
There are some work-arounds though:
- Timeouts – select() and poll() let you set timeouts so your app doesn’t have to sleep endlessly if there is no I/O to process – it can continue executing other code in the meantime. This what Ruby does.
- epoll() (or kqueue on bsd)- epoll() allows you to register a set of file descriptors you are interested in. You then make blocking epoll_wait calls (they accept timeouts) which will return only the file descriptors which are ready for I/O. This allows you to avoid searching through all your file descriptors every time.
At the very least you should set a timeout so that you can do other work if no I/O is ready. If possible though, use epoll().
Asynchronous non-blocking I/O
This is probably the least widely known model of I/O out there. This model of io is implemented via the libaio library in Linux.
In this I/O model, you can initiate I/O using aio_read(), aio_write(), and a few others. Before using these functions, you must set up a struct aiocb including fields which indicate how you’d like to get notifications and where the data can be read from or written to. Notifications can be delivered in a couple different ways:
- Signal – a SIGIO is delivered to the process when the I/O has completed
- Callback – a callback function is called when the I/O has completed
Pros:
- Helps maximize I/O throughput by allowing lots of I/O to issued in parallel
- Allows application to continue processing while I/O is executing, callback or POSIX signal when done
Cons:
- Wrapper for libaio may not exist for your programming environment
- Network I/O may not be supported
This method of I/O is really awesome because it does not block the calling application and allows multiple I/O operations to executed in parallel which increases the I/O throughput of the application.
The downsides to using libaio are:
- Wrapper may not exist for your favorite programming language.
- Unclear whether libaio supports network I/O on all systems — may only support disk I/O. When this happens, the library falls back to using normal synchronous blocking I/O.
You should try out this I/O model if your programming environment has support for it and it either has support for network I/O or you don’t need it.
Conclusion
In conclusion, you should use synchronous blocking I/O when you are writing small apps which won’t see much traffic. For more intense applications, you should definitely use one of the two asynchronous models. If possible, avoid synchronous non-blocking I/O at all costs.
Remember that the goal is to increase I/O throughput to scale your application to withstand thousands of requests per second. Doing any sort of blocking I/O in your application can (depending on threading model) cause your entire application to block, increasing latency and slowing the user experience to a crawl.

