Archive for the ‘fibers’ tag
Ruby Hoedown Slides
Below are the slides for a talk that Aman Gupta and I gave at Ruby Hoedown
Download the PDF here
Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.
Fixing Threads in Ruby 1.8: A 2-10x performance boost

Quick notes before things get crazy
OK, things might get a little crazy in this blog post so let’s clear a few things up before we get moving.
- I like the gritty details, and this article in particular has a lot of gritty info. To reduce the length of the article for the casual reader, I’ve put a portion of the really gritty stuff in the Epilogue below. Definitely check it out if that is your thing.
- This article, the code, and the patches below are for Linux and OSX for the x86 and x86_64 platforms, only.
- Even though there are code paths for both x86 and x86_64, I’m going to use the 64bit register names and (briefly) mention the 64bit binary interface.
- Let’s assume the binary is built with -fno-omit-frame-pointer, the patches don’t care, but it’ll make the explanation a bit simpler later.
- If you don’t know what the above two things mean, don’t worry; I got your back chief.
How threads work in Ruby
Ruby 1.8 implements pre-emptible userland threads, also known as “green threads.” (Want to know more about threading models? See this post.) The major performance killer in Ruby’s implementation of green threads is that the entire thread stack is copied to and from the heap every context switch. Let’s take a look at a high level what happens when you:
Thread.new{ 10000.times { a << "a" a.pop } }
- A thread control block (tcb) is allocated in Ruby.
- The infamous thread timer is initialized, either as a pthread or as an itimer.
- Ruby scope information is copied to the heap.
- The new thread is added to the list of threads.
- The current thread is set as the new thread.
- rb_thread_yield is called to yield to the block you passed in.
- Your block starts executing.
- The timer interrupts the executing thread.
- The current thread’s state is stored:
memcpy()
#1 (sometimes): If the stack has grown since the last save,realloc
is called. If the allocator cannot extend the size of the current block in place, it may decide to move the data to a new block that is large enough. If that happensmemcpy()
is called to move the data over.memcpy()
#2 (always): A copy of this thread’s entire stack (starting from the top of the interpreter’s stack) is put on the heap.
- The next thread’s state is restored.
memcpy()
#3 (always): A copy of this thread’s entire stack is placed on the stack.
Steps 9 and 10 crush performance when even small amounts of Ruby code are executed.
Many of the functions the interpreter uses to evaluate code are massive. They allocate a large number of local variables creating stack frames up to 4 kilobytes per function call. Those functions also call themselves recursively many times in a single expression. This leads to huge stacks, huge memcpy()s
, and an incredible performance penalty.
If we can eliminate the memcpy()s
we can get a lot of performance back. So, let’s do it.
Increase performance by putting thread stacks on the heap
[Remember: we are only talking about x86_64]
How stacks work – a refresher
Stacks grow downward from high addresses to low addresses. As data is push
ed on to the stack, it grows downward. As stuff is pop
ped, it shrinks upward. The register %rsp
serves as a pointer to the bottom of the stack. When it is decremented or incremented the stack grows or shrinks, respectively. The special property of the program stack is that it will grow until you run out of memory (or are killed by the OS for being bad). The operating system handles the automatic growth. See the Epilogue for some more information about this.
How to actually switch stacks
The %rsp
register can be (and is) changed and adjusted directly by user code. So all we have to do is put the address of our stack in %rsp
, and we’ve switched stacks. Then we can just call our thread start function. Pretty easy. A small blob of inline assembly should do the trick:
__asm__ __volatile__ ("movq %0, %%rsp\n\t" "callq *%1\n" :: "r" (th->stk_base), "r" (rb_thread_start_2));
Two instructions, not too bad.
movq %0, %%rsp
moves a quad-word (th->stk_base) into the %rsp. Quad-word is Intel speak for 4 words, where 1 Intel word is 2 bytes.callq *%1
calls a function at the address “rb_thread_start_2.” This has a side-effect or two, which I’ll mention in the Epilogue below, for those interested in a few more details.
The above code is called once per thread. Calling rb_thread_start_2
spins up your thread and it never returns.
Where do we get stack space from?
When the tcb is created, we’ll allocate some space with mmap
and set a pointer to it.
/* error checking omitted for brevity, but exists in the patch =] */ stack_area = mmap(NULL, total_size, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_ANON, -1, 0); th->stk_ptr = th->stk_pos = stack_area; th->stk_base = th->stk_ptr + (total_size - sizeof(int))/sizeof(VALUE *);
Remember, stacks grow downward so that last line: th->stk_base = ...
is necessary because the base of the stack is actually at the top of the memory region return by mmap()
. The ugly math in there is for alignment, to comply with the x86_64 binary interface. Those curious about more gritty details should see the Epilogue below.
BUT WAIT, I thought stacks were supposed to grow automatically?
Yeah, the OS does that for the normal program stack. Not gonna happen for our mmap
‘d regions. The best we can do is pick a good default size and export a tuning lever so that advanced users can adjust the stack size as they see fit.
BUT WAIT, isn’t that dangerous? If you fall off your stack, wouldn’t you just overwrite memory below?
Yep, but there is a fix for that too. It’s called a guard page. We’ll create a guard page below each stack that has its permission bits set to PROT_NONE
. This means, if a thread falls off the bottom of its stack and tries to read, write, or execute the memory below the thread stack, a signal (usually SIGSEGV
or SIGBUS
) will be sent to the process.
The code for the guard page is pretty simple, too:
/* omit error checking for brevity */ mprotect(th->stk_ptr, getpagesize(), PROT_NONE);
Cool, let’s modify the SIGSEGV and SIGBUS signal handlers to check for stack overflow:
/* if the address which generated the fault is within the current thread's guard page... */ if(fault_addr <= (caddr_t)rb_curr_thread->guard && fault_addr >= (caddr_t)rb_curr_thread->stk_ptr) { /* we hit the guard page, print out a warning to help app developers */ rb_bug("Thread stack overflow! Try increasing it!"); }
See the epilogue for more details about this signal handler trick.
Patches
As always, this is super-alpha software.
Ruby 1.8.6 | github | raw .patch |
Ruby 1.8.7 | github | raw .patch |
Benchmarks
The computer language shootout has a thread test called thread-ring; let’s start with that.
require 'thread' THREAD_NUM = 403 number = ARGV.first.to_i threads = [] for i in 1..THREAD_NUM threads << Thread.new(i) do |thr_num| while true Thread.stop if number > 0 number -= 1 else puts thr_num exit 0 end end end end prev_thread = threads.last while true for thread in threads Thread.pass until prev_thread.stop? thread.run prev_thread = thread end end
Results (ARGV[0] = 50000000):
Ruby 1.8.6 | 1389.52s |
Ruby 1.8.6 w/ heap stacks | 793.06s |
Ruby 1.9.1 | 752.44s |
A speed up of about 2.3x compared to Ruby 1.8.6. A bit slower than Ruby 1.9.1.
That is a pretty strong showing, for sure. Let’s modify the test slightly to illustrate the true power of this implementation.
Since our implementation does no memcpy
()s we expect the cost of context switching to stay constant regardless of thread stack size. Moreover, the unmodified Ruby 1.8.6 should perform worse as thread stack size increases (therefore increasing the amount of time the CPU is doing memcpy
()s).
Let’s test this hypothesis by modifying thread-ring slightly so that it increases the size of the stack after spawning threads.
def grow_stack n=0, &blk unless n > 100 grow_stack n+1, &blk else yield end end require 'thread' THREAD_NUM = 403 number = ARGV.first.to_i threads = [] for i in 1..THREAD_NUM threads << Thread.new(i) do |thr_num| grow_stack do while true Thread.stop if number > 0 number -= 1 else puts thr_num exit 0 end end end end end prev_thread = threads.last while true for thread in threads Thread.pass until prev_thread.stop? thread.run prev_thread = thread end end
Results (ARGV[0] = 50000000):
Ruby 1.8.6 | 7493.50s |
Ruby 1.8.6 w/ heap stacks | 799.52s |
Ruby 1.9.1 | 680.92s |
A speed up of about 9.4x compared to Ruby 1.8.6. A bit slower than Ruby 1.9.1.
Now, lets benchmark mongrel+sinatra.
require 'rubygems' require 'sinatra' disable :reload set :server, 'mongrel' get '/' do 'hi' end
Results:
Ruby 1.8.6 | 1395.43 request/sec |
Ruby 1.8.6 w/ heap stacks | 1770.26 request/sec |
An increase of about 1.26x in the most naive case possible.
Of course, if the handler did anything more than simply write “hi” (like use memcache or make sql queries) there would be more function calls, more context switches, and a much greater savings.
Conclusion
A couple lessons learned this time:
- Hacking a VM like Ruby is kind of like hacking a kernel. Some subset of the tricks used in kernel hacking are useful in userland.
- The x86_64 ABI is a must read if you plan on doing any low-level hacking.
- Keep your CPU manuals close by, they come in handy even in userland.
- Installing your own signal handlers is really useful for debugging, even if they are dumping architecture specific information.
Hope everyone enjoyed this blog post. I’m always looking for things to blog about. If there is something you want explained or talked about, send me an email or a tweet!
Don’t forget to subscribe and follow me and Aman on twitter.
Epilogue
Automatic stack growth
This can be achieved pretty easily with a little help from virtual memory and the programmable interrupt controller (PIC). The idea is pretty simple. When you (or your shell on your behalf) calls exec()
to execute a binary, the OS will map a bunch of pages of memory for the stack and set the stack pointer of the process to the top of the memory. Once the stack space is exhausted, and the stack pointer is push
ed onto un-mapped memory, a page fault will be generated.
The OS’s page fault handler (installed via the PIC) will fire. The OS can then check the address that generated the exception and see that you fell off the bottom of your stack. This works very similarly to the guard page idea we added to protect Ruby thread stacks. It can then just map more memory to that area, and tell your process to continue executing. Your process doesn’t know anything bad happened.
I hope to chat a little bit about interrupt and exception handlers in an upcoming blog post. Stay tuned!
callq
side-effects
When a callq
instruction is executed, the CPU pushes the return address on to the stack and then begins executing the function that was called. This is important because when the function you are calling executes a ret
instruction, a quad-word is popped from the stack and put into the instruction pointer (%rip
).
x86_64 Application Binary Interface
The x86_64 ABI is an extension of the x86 ABI. It specifies architecture programming information such as the fundamental types, caller and callee saved registers, alignment considerations and more. It is a really important document for any programmer messing with x86_64 architecture specific code.
The particular piece of information relevant for this blog post is found buried in section 3.2.2
The end of the input argument area shall be aligned on a 16 … byte boundary.
This is important to keep in mind when constructing thread stacks. We decided to avoid messing with alignment issues. As such we did not pass any arguments to rb_thread_start_2. We wanted to avoid mathematical error that could happen if we try to align the memory ourselves after pushing some data. We also wanted to avoid writing more assembly than we had to, so we avoided passing the arguments in registers, too.
Signal handler trick
The signal handler “trick” to check if you have hit the guard page is made possible by the sigaltstack()
system call and the POSIX sa_sigaction
interface.
sigaltstack()
lets us specify a memory region to be used as the stack when a signal is delivered. This extremely important for the signal handler trick because once we fall off our thread stack, we certainly cannot expect to handle a signal using that stack space.
POSIX provides two ways for signals to be handled:
- sa_handler interface: calls your handler and passes in the signal number.
- sa_sigaction interface: calls your handler and passes in the signal number, a
siginfo_t
struct, and aucontext_t
. Thesiginfo_t
struct contains (among other things), the address which generated the fault. Simply check this address to see if its in the guard page and if so let the user know they just overflowed their stack. Another useful, but extremely non-portable modification that was added to Ruby’ signal handlers was a dump of the contents inucontext_t
to provide useful debugging information. This structure contains the register state at the time of signal. Dumping it can help debugging by showing which values are in what registers.
Fibers implemented for Ruby 1.8.{6,7}

At Kickball Labs, Aman Gupta (http://github.com/tmm1) and I (http://github.com/ice799) have been working on an implementation of Fibers for Ruby 1.8.{6,7}. It is API compatible to Fibers in Ruby 1.9, except for the “transfer” method, which is currently unimplemented. This patch will allow you to use fibers with mysqlplus and neverblock.
Raw patches
Patch against ruby-1.8.7_p72: HERE.
Patch against ruby-1.8.6_p287: HERE.
To use the patch:
Download ruby source Ruby 1.8.7_p72, or if you prefer: Ruby 1.8.6-p287
Then, perform the following:
wget http://timetobleed.com/files/fibers-RUBY_VERSION.patch
patch -p1 < fibers.patch
./configure —-disable-pthread —-prefix=/tmp/ruby-with-fibers/ && make && sudo make install
/tmp/ruby-with-fibers/bin/ruby test/test_fiber.rb
This will patch ruby and install it to a custom location: /tmp/ruby-with-fibers so you can test and play around with it without overwriting your existing Ruby installation.
Github
I am currently working on getting the ruby 1.8.6 patched code up on github, but Aman has a branch of ruby 1.8.7_p72 called fibers with the code at http://github.com/tmm1/ruby187/tree/fibers
What are fibers?
Fibers are (usually) non-preemptible lightweight user-land threads.
But I thought Ruby 1.8.{6,7} already had green threads?
You are right; it does. Fibers are simply ruby green threads, without preemption. The programmer (you) gets to decide when to pause and resume execution of a fiber instead of a timer.
Why would I use fibers?
Bottom line: Your I/O should be asynchronous whenever possible, but sometimes re-writing your entire code base to be asynch and have callbacks can be difficult or painful. A simple solution to this problem is to create or use (see: NeverBlock) some middleware that wraps code paths which make I/O requests in a fiber.
The middleware can issue the asynch I/O operation in a fiber, and yield. Once the middleware’s asynch callback is hit, the Fiber can be resumed. Using NeverBlock (or rolling something similar yourself), should require only minimal code changes to your application, and will essentially make all of your I/O requests asynchronous without much pain at all.
How do I use fibers?
There are already lots of great tutorials about fibers basics here and here.
Let’s take a look at something that drives home the point about being able to drop in some middleware to make synchronous code act asynchronous with minimal changes.
Consider the following code snippet:
require ‘sinatra’
# eventmachine/thin
require ‘eventmachine’
require ‘thin’
# mysql
require ‘mysqlplus’
# single threaded
DB = Mysql.connect
disable :reload
get ‘/’ do
4.times do
DB.query(‘select sleep(0.25)’)
end
‘done’
end
This code snippet creates a simple webservice which connects to a mysql database and issues long running queries (in this case, 4 queries which execute for a total of 1 second).
In this implementation, only one request can be handled at a time; the DB.query blocks, so the other users have to wait to have their queries executed.
This sucks because certainly mysql can handle more than just 4 sleep(0.25) queries a second! But, what are our options?
Well, we can rewrite the code to be asynchronous and string together some callbacks. For my contrived example, doing that would be pretty easy and it’d be only slightly harder to read. Let’s use our imaginations. Let’s pretend the code snippet I just showed you was some huge, ugly, scary blob of code and rewritting it to be asynchronous would not only take a long time, it would also make the code very ugly and difficult to read.
Now, let’s drop in fibers:
require ‘sinatra’
# eventmachine/thin
require ‘eventmachine’
require ‘thin’
# mysql
require ‘mysqlplus’
# fibered
require ‘neverblock’
require ‘never_block/servers/thin’
require ‘neverblock-mysql’
class Thin::Server
def fiber_pool() @fiber_pool ||= NB::Pool::FiberPool.new(20) end
end
DB = NB::DB::PooledDBConnection.new(20){ NB::DB::FMysql.connect }
disable :reload
get ‘/’ do
4.times do
DB.query(‘select sleep(0.25)’)
end
‘done’
end
NOTICE: The application code hasn’t changed, we simply monkey patched Thin to use a pool of fibers.
Suddenly, our application can handle 20 connections. This is all handled by NeverBlock and mysqlplus.
- NeverBlock uses the fiber pool to issue an asynch DB query via mysqplus.
- After the asynch query is executed, NeverBlock pauses the executing fiber
- At this point other requests can be serviced
- When the data comes back from the mysql server, a callback in NeverBlock is executed.
- The callback resumes the paused fiber, which continues executing.
Pretty sick, right?
Memory consumption, context switches, cooperative multi-threading, oh my!
In our implementation, fibers are ruby green threads, but with no scheduler or preemption. In fact, our fiber implementation shares many code-paths with the existing green thread implementation. As a result, there is very little difference in memory consumption between green threads and our fiber implementation.
Context switches are a different matter all together. The whole point of building a fiber implementation is to allow the programmer to decide when context switching is appropriate. In most circumstances, the application should be undergoing many fewer context switches with fibers and the context switches that do happen occur precisely when needed. As a result, the application can tend to run faster (fewer context switches ==> fewer stack copies ==> fewer CPU cycles).
The major advantage of fibers over green threads is that you get to control when execution starts and stops. The major disadvantage of fibers is that if you have to code carefully, to ensure that you are starting and stopping your fibers appropriately.
Future Directions
Next stop will be “stackless” fibers. I have a fork of the fibers implementation in the works that pre-allocates fiber stacks on the ruby process’ heap. I am hoping to eliminate the overhead associated with switching between fibers by simply shuffling pointers around.
A preliminary version seems to work, although a few bugs that crop up when you use fibers and threads together need to be squashed before the code can be considered “alpha” stage. When it’s done, you’ll find it right here.