time to bleed by Joe Damato

technical ramblings from a wanna-be unix dinosaur

Archive for the ‘kernel’ tag

Fixing Threads in Ruby 1.8: A 2-10x performance boost

View Comments

Quick notes before things get crazy

OK, things might get a little crazy in this blog post so let’s clear a few things up before we get moving.

  • I like the gritty details, and this article in particular has a lot of gritty info. To reduce the length of the article for the casual reader, I’ve put a portion of the really gritty stuff in the Epilogue below. Definitely check it out if that is your thing.
  • This article, the code, and the patches below are for Linux and OSX for the x86 and x86_64 platforms, only.
  • Even though there are code paths for both x86 and x86_64, I’m going to use the 64bit register names and (briefly) mention the 64bit binary interface.
  • Let’s assume the binary is built with -fno-omit-frame-pointer, the patches don’t care, but it’ll make the explanation a bit simpler later.
  • If you don’t know what the above two things mean, don’t worry; I got your back chief.

How threads work in Ruby

Ruby 1.8 implements pre-emptible userland threads, also known as “green threads.” (Want to know more about threading models? See this post.) The major performance killer in Ruby’s implementation of green threads is that the entire thread stack is copied to and from the heap every context switch. Let’s take a look at a high level what happens when you:

Thread.new{
	10000.times {
		a << "a"
		a.pop
	}
}

  1. A thread control block (tcb) is allocated in Ruby.
  2. The infamous thread timer is initialized, either as a pthread or as an itimer.
  3. Ruby scope information is copied to the heap.
  4. The new thread is added to the list of threads.
  5. The current thread is set as the new thread.
  6. rb_thread_yield is called to yield to the block you passed in.
  7. Your block starts executing.
  8. The timer interrupts the executing thread.
  9. The current thread’s state is stored:
    • memcpy() #1 (sometimes): If the stack has grown since the last save, realloc is called. If the allocator cannot extend the size of the current block in place, it may decide to move the data to a new block that is large enough. If that happens memcpy() is called to move the data over.
    • memcpy() #2 (always): A copy of this thread’s entire stack (starting from the top of the interpreter’s stack) is put on the heap.
  10. The next thread’s state is restored.
    • memcpy() #3 (always): A copy of this thread’s entire stack is placed on the stack.

Steps 9 and 10 crush performance when even small amounts of Ruby code are executed.

Many of the functions the interpreter uses to evaluate code are massive. They allocate a large number of local variables creating stack frames up to 4 kilobytes per function call. Those functions also call themselves recursively many times in a single expression. This leads to huge stacks, huge memcpy()s, and an incredible performance penalty.

If we can eliminate the memcpy()s we can get a lot of performance back. So, let’s do it.

Increase performance by putting thread stacks on the heap

[Remember: we are only talking about x86_64]

How stacks work – a refresher

Stacks grow downward from high addresses to low addresses. As data is pushed on to the stack, it grows downward. As stuff is popped, it shrinks upward. The register %rsp serves as a pointer to the bottom of the stack. When it is decremented or incremented the stack grows or shrinks, respectively. The special property of the program stack is that it will grow until you run out of memory (or are killed by the OS for being bad). The operating system handles the automatic growth. See the Epilogue for some more information about this.

How to actually switch stacks

The %rsp register can be (and is) changed and adjusted directly by user code. So all we have to do is put the address of our stack in %rsp, and we’ve switched stacks. Then we can just call our thread start function. Pretty easy. A small blob of inline assembly should do the trick:

__asm__ __volatile__ ("movq %0, %%rsp\n\t"
                      "callq *%1\n"
                      :: "r" (th->stk_base),
                         "r" (rb_thread_start_2));

Two instructions, not too bad.

  1. movq %0, %%rsp moves a quad-word (th->stk_base) into the %rsp. Quad-word is Intel speak for 4 words, where 1 Intel word is 2 bytes.
  2. callq *%1 calls a function at the address “rb_thread_start_2.” This has a side-effect or two, which I’ll mention in the Epilogue below, for those interested in a few more details.

The above code is called once per thread. Calling rb_thread_start_2 spins up your thread and it never returns.

Where do we get stack space from?

When the tcb is created, we’ll allocate some space with mmap and set a pointer to it.

/* error checking omitted for brevity, but exists in the patch =] */
stack_area = mmap(NULL, total_size, PROT_READ | PROT_WRITE | PROT_EXEC,
			MAP_PRIVATE | MAP_ANON, -1, 0);

th->stk_ptr = th->stk_pos = stack_area;
th->stk_base = th->stk_ptr + (total_size - sizeof(int))/sizeof(VALUE *);

Remember, stacks grow downward so that last line: th->stk_base = ... is necessary because the base of the stack is actually at the top of the memory region return by mmap(). The ugly math in there is for alignment, to comply with the x86_64 binary interface. Those curious about more gritty details should see the Epilogue below.

BUT WAIT, I thought stacks were supposed to grow automatically?

Yeah, the OS does that for the normal program stack. Not gonna happen for our mmap‘d regions. The best we can do is pick a good default size and export a tuning lever so that advanced users can adjust the stack size as they see fit.

BUT WAIT, isn’t that dangerous? If you fall off your stack, wouldn’t you just overwrite memory below?

Yep, but there is a fix for that too. It’s called a guard page. We’ll create a guard page below each stack that has its permission bits set to PROT_NONE. This means, if a thread falls off the bottom of its stack and tries to read, write, or execute the memory below the thread stack, a signal (usually SIGSEGV or SIGBUS) will be sent to the process.

The code for the guard page is pretty simple, too:

/* omit error checking for brevity */
mprotect(th->stk_ptr, getpagesize(), PROT_NONE);

Cool, let’s modify the SIGSEGV and SIGBUS signal handlers to check for stack overflow:

/* if the address which generated the fault is within the current thread's guard page... */
  if(fault_addr <= (caddr_t)rb_curr_thread->guard &&
     fault_addr >= (caddr_t)rb_curr_thread->stk_ptr) {
  /* we hit the guard page, print out a warning to help app developers */
  rb_bug("Thread stack overflow! Try increasing it!");
}

See the epilogue for more details about this signal handler trick.

Patches

As always, this is super-alpha software.

Ruby 1.8.6 github raw .patch
Ruby 1.8.7 github raw .patch

Benchmarks

The computer language shootout has a thread test called thread-ring; let’s start with that.

require 'thread'
THREAD_NUM = 403
number = ARGV.first.to_i

threads = []
for i in 1..THREAD_NUM
   threads << Thread.new(i) do |thr_num|
      while true
         Thread.stop
         if number > 0
            number -= 1
         else
            puts thr_num
            exit 0
         end
      end
   end
end

prev_thread = threads.last
while true
   for thread in threads
      Thread.pass until prev_thread.stop?
      thread.run
      prev_thread = thread
   end
end

Results (ARGV[0] = 50000000):

Ruby 1.8.6 1389.52s
Ruby 1.8.6 w/ heap stacks 793.06s
Ruby 1.9.1 752.44s

A speed up of about 2.3x compared to Ruby 1.8.6. A bit slower than Ruby 1.9.1.

That is a pretty strong showing, for sure. Let’s modify the test slightly to illustrate the true power of this implementation.

Since our implementation does no memcpy()s we expect the cost of context switching to stay constant regardless of thread stack size. Moreover, the unmodified Ruby 1.8.6 should perform worse as thread stack size increases (therefore increasing the amount of time the CPU is doing memcpy()s).

Let’s test this hypothesis by modifying thread-ring slightly so that it increases the size of the stack after spawning threads.

def grow_stack n=0, &blk
  unless n > 100
    grow_stack n+1, &blk
  else
    yield
  end
end

require 'thread'
THREAD_NUM = 403
number = ARGV.first.to_i

threads = []
for i in 1..THREAD_NUM
  threads << Thread.new(i) do |thr_num|
    grow_stack do
      while true
        Thread.stop
        if number > 0
          number -= 1
        else
          puts thr_num
          exit 0
        end
      end
    end
  end
end

prev_thread = threads.last
while true
   for thread in threads
      Thread.pass until prev_thread.stop?
      thread.run
      prev_thread = thread
   end
end

Results (ARGV[0] = 50000000):

Ruby 1.8.6 7493.50s
Ruby 1.8.6 w/ heap stacks 799.52s
Ruby 1.9.1 680.92s

A speed up of about 9.4x compared to Ruby 1.8.6. A bit slower than Ruby 1.9.1.

Now, lets benchmark mongrel+sinatra.

require 'rubygems'
require 'sinatra'

disable :reload

set :server, 'mongrel' 

get '/' do 
  'hi' 
end

Results:

Ruby 1.8.6 1395.43 request/sec
Ruby 1.8.6 w/ heap stacks 1770.26 request/sec

An increase of about 1.26x in the most naive case possible.

Of course, if the handler did anything more than simply write “hi” (like use memcache or make sql queries) there would be more function calls, more context switches, and a much greater savings.

Conclusion

A couple lessons learned this time:

  • Hacking a VM like Ruby is kind of like hacking a kernel. Some subset of the tricks used in kernel hacking are useful in userland.
  • The x86_64 ABI is a must read if you plan on doing any low-level hacking.
  • Keep your CPU manuals close by, they come in handy even in userland.
  • Installing your own signal handlers is really useful for debugging, even if they are dumping architecture specific information.

Hope everyone enjoyed this blog post. I’m always looking for things to blog about. If there is something you want explained or talked about, send me an email or a tweet!

Don’t forget to subscribe and follow me and Aman on twitter.

Epilogue

Automatic stack growth

This can be achieved pretty easily with a little help from virtual memory and the programmable interrupt controller (PIC). The idea is pretty simple. When you (or your shell on your behalf) calls exec() to execute a binary, the OS will map a bunch of pages of memory for the stack and set the stack pointer of the process to the top of the memory. Once the stack space is exhausted, and the stack pointer is pushed onto un-mapped memory, a page fault will be generated.

The OS’s page fault handler (installed via the PIC) will fire. The OS can then check the address that generated the exception and see that you fell off the bottom of your stack. This works very similarly to the guard page idea we added to protect Ruby thread stacks. It can then just map more memory to that area, and tell your process to continue executing. Your process doesn’t know anything bad happened.

I hope to chat a little bit about interrupt and exception handlers in an upcoming blog post. Stay tuned!

callq side-effects

When a callq instruction is executed, the CPU pushes the return address on to the stack and then begins executing the function that was called. This is important because when the function you are calling executes a ret instruction, a quad-word is popped from the stack and put into the instruction pointer (%rip).

x86_64 Application Binary Interface

The x86_64 ABI is an extension of the x86 ABI. It specifies architecture programming information such as the fundamental types, caller and callee saved registers, alignment considerations and more. It is a really important document for any programmer messing with x86_64 architecture specific code.
The particular piece of information relevant for this blog post is found buried in section 3.2.2

The end of the input argument area shall be aligned on a 16 … byte boundary.

This is important to keep in mind when constructing thread stacks. We decided to avoid messing with alignment issues. As such we did not pass any arguments to rb_thread_start_2. We wanted to avoid mathematical error that could happen if we try to align the memory ourselves after pushing some data. We also wanted to avoid writing more assembly than we had to, so we avoided passing the arguments in registers, too.

Signal handler trick

The signal handler “trick” to check if you have hit the guard page is made possible by the sigaltstack() system call and the POSIX sa_sigaction interface.

sigaltstack() lets us specify a memory region to be used as the stack when a signal is delivered. This extremely important for the signal handler trick because once we fall off our thread stack, we certainly cannot expect to handle a signal using that stack space.

POSIX provides two ways for signals to be handled:

  • sa_handler interface: calls your handler and passes in the signal number.
  • sa_sigaction interface: calls your handler and passes in the signal number, a siginfo_t struct, and a ucontext_t. The siginfo_t struct contains (among other things), the address which generated the fault. Simply check this address to see if its in the guard page and if so let the user know they just overflowed their stack. Another useful, but extremely non-portable modification that was added to Ruby’ signal handlers was a dump of the contents in ucontext_t to provide useful debugging information. This structure contains the register state at the time of signal. Dumping it can help debugging by showing which values are in what registers.

Written by Joe Damato

May 18th, 2009 at 5:00 am

Fix a bug in Ruby’s configure.in and get a ~30% performance boost.

View Comments


Special thanks…

Going out to Jake Douglas for pushing the initial investigation and getting the ball rolling.

The whole --enable-pthread thing

Ask any Ruby hacker how to easily increase performance in a threaded Ruby application and they’ll probably tell you:

Yo dude… Everyone knows you need to configure Ruby with --disable-pthread.

And it’s true; configure Ruby with --disable-pthread and you get a ~30% performance boost. But… why?

For this, we’ll have to turn to our handy tool strace. We’ll also need a simple Ruby program to this one. How about something like this:

def make_thread
  Thread.new {
    a = []
    10_000_000.times {
      a << "a"
      a.pop
    }
  }
end

t = make_thread 
t1 = make_thread 

t.join
t1.join

Now, let's run strace on a version of Ruby configure'd with --enable-pthread and point it at our test script. The output from strace looks like this:

22:46:16.706136 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706177 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706218 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706259 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000005>
22:46:16.706301 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706342 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706383 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706425 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706466 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>

Pages and pages and pages of sigprocmask system calls (Actually, running with strace -c, I get about 20,054,180 calls to sigprocmask, WOW). Running the same test script against a Ruby built with --disable-pthread and the output does not have pages and pages of sigprocmask calls (only 3 times, a HUGE reduction).

OK, so let's just set a breakpoint in GDB... right?

OK, so we should just be able to set a breakpoint on sigprocmask and figure out who is calling it.

Well, not exactly. You can try it, but the breakpoint won't trigger (we'll see why a little bit later).

Hrm, that kinda sucks and is confusing. This will make it harder to track down who is calling sigprocmask in the threaded case.

Well, we know that when you run configure the script creates a config.h with a bunch of defines that Ruby uses to decide which functions to use for what. So let's compare ./configure --enable-pthread with ./configure --disable-pthread:

[joe@mawu:/home/joe/ruby]% diff config.h config.h.pthread 
> #define _REENTRANT 1
> #define _THREAD_SAFE 1
> #define HAVE_LIBPTHREAD 1
> #define HAVE_NANOSLEEP 1
> #define HAVE_GETCONTEXT 1
> #define HAVE_SETCONTEXT 1


OK, now if we grep the Ruby source code, we see that whenever HAVE_[SG]ETCONTEXT are set, Ruby uses the system calls setcontext() and getcontext() to save and restore state for context switching and for exception handling (via the EXEC_TAG).

What about when HAVE_[SG]ETCONTEXT are not define'd? Well in that case, Ruby uses _setjmp/_longjmp.

Bingo!

That's what's going on! From the _setjmp/_longjmp man page:

... The _longjmp() and _setjmp() functions shall be equivalent to longjmp() and setjmp(), respectively, with the additional restriction that _longjmp() and _setjmp() shall not manipulate the signal mask...

And from the [sg]etcontext man page:

... uc_sigmask is the set of signals blocked in this context (see sigprocmask(2)) ...


The issue is that getcontext calls sigprocmask on every invocation but _setjmp does not.

BUT WAIT if that's true why didn't GDB hit a sigprocmask breakpoint before?

x86_64 assembly FTW, again

Let's fire up gdb and figure out this breakpoint-not-breaking thing. First, let's start by disassembling getcontext (snipped for brevity):

(gdb) p getcontext
$1 = {} 0x7ffff7825100
(gdb) disas getcontext
...
0x00007ffff782517f : mov $0xe,%rax
0x00007ffff7825186 : syscall
...

Yeah, that's pretty weird. I'll explain why in a minute, but let's look at the disassembly of sigprocmask first:

(gdb) p sigprocmask
$2 = {} 0x7ffff7817340 <__sigprocmask>
(gdb) disas sigprocmask
...
0x00007ffff7817383 <__sigprocmask+67>: mov $0xe,%rax
0x00007ffff7817388 <__sigprocmask+72>: syscall
...

Yeah, this is a bit confusing, but here's the deal.

Recent Linux kernels implement a shiny new method for calling system calls called sysenter/sysexit. This new way was created because the old way (int $0x80) turned out to be pretty slow. So Intel created some new instructions to execute system calls without such huge overhead.

All you need to know right now (I'll try to blog more about this in the future) is that the %rax register holds the system call number. The syscall instruction transfers control to the kernel and the kernel figures out which syscall you wanted by checking the value in %rax. Let's just make sure that sigprocmask is actually 0xe:

[joe@pluto:/usr/include]% grep -Hrn "sigprocmask" asm-x86_64/unistd.h 
asm-x86_64/unistd.h:44:#define __NR_rt_sigprocmask                     14


Bingo. It's calling sigprocmask (albeit a bit obscurely).

OK, so getcontext isn't calling sigprocmask directly, instead it replicates a bunch of code that sigprocmask has in its function body. That's why we didn't hit the sigprocmask breakpoint; GDB was going to break if you landed on the address 0x7ffff7817340 but you didn't.

Instead, getcontext reimplements the wrapper code for sigprocmask itself and GDB is none the wiser.

Mystery solved.

The patch

Get it HERE

The patch works by adding a new configure flag called --disable-ucontext to allow you to specifically disable [sg]etcontext from being called, you use this in conjunction with --enable-pthread, like this:

./configure --disable-ucontext --enable-pthread


After you build Ruby configured like that, its performance is on par with (and sometimes slightly faster) than Ruby built with --disable-pthread for about a 30% performance boost when compared to --enable-pthread.

I added the switch because I wanted to preserve the original Ruby behavior, if you just pass --enable-pthread without --disable-ucontext Ruby will do the old thing and generate piles of sigprocmasks.

Conclusion

  1. Things aren't always what they seem - GDB may lie to you. Be careful.
  2. Use the source, Luke. Libraries can do unexpected things, debug builds of libc can help!
  3. I know I keep saying this, assembly is useful. Start learning it today!

If you enjoyed this blog post, consider subscribing (via RSS) or following (via twitter).

You'll want to stay tuned; tmm1 and I have been on a roll the past week. Lots of cool stuff coming out!

Written by Joe Damato

May 5th, 2009 at 3:20 am

MySQL Doesn’t Always Suck; This Time it’s AMD

View Comments


UPDATED! Added explanation about how to check if you are affected below. Thanks to vicaya on hacker news for commenting about this.

This article is going to take a look at a rather nasty low-level hardware bug which could bite people writing applications which are multi-threaded. The bug manifests itself in MySQL as we’ll see a bit later. Part of the blame for the nastiness of this bug certainly lies with a third party who should have (and could have) publicly released a detailed errata, as we’ll see in a bit.

Consider this article a bit of a thought experiment; when you read this article ask yourself: “How would I have debugged this?”

A quick cautionary note: We’re going to look at a little bit of assembly. If you don’t know assembly, don’t worry.

Starting at the atom

An important concept when building complex systems with multiple threads or processes is the concept of atomicity. Very briefly, atomicity is the idea that there is some set of instructions that when executed appear as an indivisible unit to the rest of the system.

The Intel x86 architecture has two ways to execute atomic operations:

  • guaranteed atomic operations
  • system bus locking – a processor asserts a signal and other processors are unable to gain control of the system bus during this time

Guaranteed atomic operations

The least interesting, but on the Intel 486 and above:

  • Reading or writing a byte
  • Reading or writing a word aligned on a 16-bit boundary
  • Reading or writing a doubleword aligned on a 32-bit boundary

are all guaranteed to be atomic.

[NOTE: There are a few more guaranteed atomic operations for the Pentium and P6 families, but I won't go into the details of them here as there is way too much bleeding involved. If you are the type of person who runs toward a burning building check out: Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3A Chapter 7 for more information about the other guaranteed atomic operations.]

This list is good, but it doesn’t give us enough power to actually implement a synchronization primitive. You can imagine two processes competing for a resource. One process reads a flag to check whether the resource is available, but before it has time to flip the flag, the other process is reading the value.

We need a little more juice, we’d like to be able to read, write, and maybe compare as one indivisible operation. This is where bus locking and two x86 instructions step in to save the day.

Bus locking

There are two instructions on the x86 which are commonly used as atomic operations:

  • xchg

    exchange a source and destination.

  • cmpxchg

    compare the EAX register with the destination, if equal load source into destination. Otherwise, load destination in EAX.

And there are two flavors of bus locking:

  • automatic bus locking – bus is locked whenever xchg and other special instructions are executed
  • software controlled bus locking – a small set of instructions (cmpxchg, dec, and others) can be executed atomically if preceded with lock

OK, cool. Now we’re talking. We have enough firepower to build some synchronization primitives.

Let’s build a mutex

Let’s build a simple mutex using cmpxchg.

Assume a function exists:

/**
 * compare old with the destination memory. if they match, put new
 *  in the destination and return old. Otherwise, return new.
 */

int cmpxchg(void *memory, int old, int new);

This implementation is left as an exercise to the reader. Wow, I’ve always wanted to say that.
(Protip: If you don’t feel like bleeding, you can just: #include <asm/system.h> which will take care of it for you on Linux.)

Then we can implement a stupid and simple spinlock like this:

typedef int mutex;
void mutex_init (mutex *m) { *m = 1; }
void mutex_lock (mutex *m) { while (!cmpxchg(m, 1, 0));  }
void mutex_unlock (mutex *m) { *m = 1; }

That was pretty easy! Admittedly, it is a pretty stupid implementation. You could imagine adding more features, and even building more complex synchronization primitives (semaphores, barriers, read-write locks, etc). Go crazy and build them out.

Synchronization bugs are nasty

So we’ve talked a little bit about how atomic operations work, and we’ve built a simple mutex. Complex multi-threaded applications can use lots of synchronization primitives to control access to a shared resource (like a piece of memory, or a file, etc). As you can imagine, weird bugs can crop up like deadlock, livelock, and memory corruption if synchronization primitives are used incorrectly.

A large multi-threaded application used by lots of people in the Web 2.0 world is MySQL. MySQL, like any other application, has bugs. Take a look at an interesting bug report filed on Feb 5, 2007 regarding a MySQL crash: http://bugs.mysql.com/bug.php?id=26081

So, it looks as if an assertion is being tripped in the source code, but the code in the second update to the bug looks very straightforward. Maybe some one else is writing all over the mutex and corrupting its state? Definitely possible, but there is something strange about the bug reports that start to roll in afterward.

It seems that everyone hitting this bug is using AMD Opteron CPUs. Maybe this a hardware bug?

AMD Opteron bug causes a MySQL crash

It turns out that MySQL was crashing because of a hardware bug in AMD Opteron Revision E and pre-release Revision F processors.

There is no official errata published about this, but we can find clues in the OpenSolaris mutex implementation.

From OpenSolaris’ IA-32 lock_prim.S file (NOTE: AT&T syntax):

560 	ENTRY_NP(mutex_enter)
...
563 	lock
564 	cmpxchgq %rdx, (%rdi)
565 	jnz	mutex_vector_enter
566 	.mutex_enter_lockstat_patch_point:
567 	#if defined(OPTERON_WORKAROUND_6323525)
568 	.mutex_enter_6323525_patch_point:
569 	ret					/* nop space for lfence */
570 	nop
571 	nop
....

Note that on line 563, the lock prefix is used, so the cmpxchg instruction should be atomic. But something weird is going on here, notice how line 567 checks for a macro called OPTERON_WORKAROUND_6323525? That looks suspicious, as does the comment on line 569. The nop instruction is a no-op; the instruction doesn’t do anything. In this case it looks as if nop instructions are being used to set aside space and will be overwritten later with an lfence if you have an affected AMD CPU.

If you look up lfence and nop in the Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2A: Instruction Set Reference, A-M, you can see that nop instruction is 1 byte wide and the lfence instruction is 2 bytes wide.

Looks like the OpenSolaris guys left just enough room for one lfence.

If we dig a little deeper in lock_prim.S, we discover a helpful comment:

1133 /*
1134  * patch_workaround_6323525: provide workaround for 6323525
1135  *
1136  * The workaround is to place a fencing instruction (lfence) between the
1137  * mutex operation and the subsequent read-modify-write instruction.
1138  *
1139  * This routine hot patches the lfence instruction on top of the space
1140  * reserved by nops in the lock enter routines.
1141  */

Wow.

That is extremely painful. So it turns out that an lfence instruction is needed after the cmpxchg. lfence (load fence) forces all memory reads preceding the lfence to be read from memory before memory reads following the lfence. Without the lfence, a read after the cmpxchg can happen before the cmpxchg. This read can result in reading stale/incomplete state information.

The workaround hot patches the kernel and writes an lfence instruction in the space set aside by the nop instructions on lines 570 and 571.

Thankfully, MySQL is well written and a runtime assertion caught this bug before it could do real damage or corrupt data.

But what about your multi-threaded code? Is it equally well-written? Would you have caught the same hardware bug with runtime assertions? Now is definitely a good time to go add those assertions you’ve been putting off.

Current state of affairs of this bug

This bug is fixed in recent versions of OpenSolaris and it is unclear whether or not this is currently an issue with Linux. Depending on which version of the Linux kernel you are using, Linux uses an atomic decrement (dec) in the fast path and an xchg in the slow path. With the lack of an official public errata about this issue, it is difficult to tell which other instructions could be affected.

How to check if you are affected

Special thanks to vicaya on hacker news for suggesting this!

If you are using Linux you can simply check /proc/cpuinfo:

$ cat /proc/cpuinfo

If your CPU family is 15 and your model is 32-63 inclusive your processor is buggy!

Conclusion

There are a couple things to take away from this:

  1. It is pretty surprising how many different kinds of bugs exist in really important areas like atomic operations. It is truly amazing that computers actually boot and do useful things even some of the time.
  2. Staying sharp with your low level skills is still important, even in the Web 2.0 world.
  3. CPUs aren’t normally equated with such subtle bugs, but remember CPUs are just like any other system component and they have bugs, too
  4. Last, but not least: Stay away from AMD Opteron Revision E!

Written by Joe Damato

April 6th, 2009 at 12:31 am

I/O models: how you move your data matters

View Comments

Above picture was shamelessly stolen from: http://computer-history.info/Page4.dir/pages/IBM.7030.Stretch.dir/

In this blog post I’m going to follow suit on my threading models post (here) and talk about different types of I/O, how they work, and when you might want to consider using them. Much like with threading models, I/O models have terminology which can be confusing. The confusion leads to misconceptions which will hopefully be cleared up here.

Let’s start first by going over some operating system basics.

System Calls

A system call is a common interface which allows user applications and the operating system kernel to interact with one another. Some familiar functions which are system calls: open(), read(), and write(). These are system calls which ask the kernel to do I/O on behalf of the user process.

There is a cost associated with making system calls. In Linux, system calls are implemented via a software interrupt which causes a privilege level change in the processor – this switch from user to kernel mode is commonly called a context-switch.

User applications typically execute at the most restricted privilege level available where interaction with I/O devices (and other stuff) is not allowed. As a result user applications use system calls to get the kernel to complete privileged I/O (and other) operations.

Synchronous blocking I/O

This is the most familiar and most common type of I/O out there. When an I/O operation is initiated in this model (maybe by calling a system call such as read(), write(), ioctl(), …), the user application making the system call is put into a waiting state by the kernel. The application sleeps until the I/O operation has completed (or has generated an error) at which point it is scheduled to run again. Data is transferred from the device to memory and possibly into another buffer for the user-land application.

Pros:

  • Easy to use and well understood
  • Ubiquitous

Cons:

  • Does not maximize I/O throughput
  • Causes all threads in a process to block if that process uses green threads

This method of I/O is very straight forward and simple to use, but it has many downsides. In a previous post about threading models, I mentioned that doing blocking I/O in a green thread causes all green threads to stop executing until the I/O operation has completed.

This happens because there is only one kernel context which can scheduled, so that context is put into a waiting state in the kernel until the I/O has been copied to the user buffer and the process can run again.

Synchronous non-blocking I/O

This model of I/O is not very well known compared to other models. This is good because this model isn’t very useful.

In this model, a file descriptor is created via open(), but a flag is passed in (O_NONBLOCK on most Linux kernels) to tell the kernel: If data is not available immediately, do not put me to sleep. Instead let me know so I can go on with my life. I’ll try back later.

Pros:

  • If no I/O is available other work can be completed in the meantime
  • When I/O is available, is does not block the thread (even models with green threads)

Cons:

  • Does not maximize I/O throughput for the application
  • Lots of system call overhead – constantly making system calls to see if I/O is ready
  • Can be high latency if I/O arrives and a system call is not made for a while

This model of I/O is typically very inefficient because the I/O system call made by the application may return EAGAIN or EWOULDBLOCK repeatedly. The application can either:

  • wait around for the data to finish (repeatedly calling its I/O system call over and over)  — or
  • try to do other work for a bit, and retry the I/O system call later

At some point the I/O will either return an error or it will be able to complete.

If this type of I/O is used in a system with green threads, the entire process is not blocked but the efficiency is very poor due to the constant polling with system calls from user-land. Each time a system call is invoked a privelege level change occurs on the processor and the execution state of the application has to be saved out to memory (or disk!) so that the kernel can execute.

Asynchronous blocking I/O

This model of I/O is much more well known. In fact, this is how Ruby implements I/O for its green threads.

In this model, non-blocking file descriptors are created (similar to the previous model) and they monitored by calling either select() or poll(). The system call to select()/poll() blocks the process (the process is put into a sleeping state in the kernel) and the system call returns when either an error has occurred or when the file descriptors are ready to be read from or written to.

Pros:

  • When I/O is available is does not block
  • Lots of I/O can be issued to execute in parallel
  • Notifications occur when one or more file descriptors are ready (helps to improve I/O throughput)

Cons:

  • Calling select(), poll(), or epoll_wait() blocks the calling thread (entire application if using green threads)
  • Lots of file descriptors for I/O means lots that have to be checked (can be avoided with epoll)

What is important to note here is that more than one file descriptor can be monitored and when select/poll returns, more than one of the file descriptors may be able to do non-blocking I/O. This is great because it increases the application’s I/O throughput by allowing many I/O operations to occur in parallel.

Of course there are two main drawbacks of using this model:

  • select()/poll() block – so if they are used in a system with green threads, all the threads are put to sleep while these system calls are executing.
  • You must check the entire set of file descriptors to determine which are ready. This can be bad if you have a lot of file descriptors, because you can potentially spend a lot of time checking file descriptors which aren’t ready (epoll() fixes this problem).

This model is important for all you Ruby programmers out there — this is the type of I/O that Ruby uses internally. The calls to select cause Ruby to block while they are being executed.

There are some work-arounds though:

  • Timeouts – select() and poll() let you set timeouts so your app doesn’t have to sleep endlessly if there is no I/O to process – it can continue executing other code in the meantime. This what Ruby does.
  • epoll() (or kqueue on bsd)- epoll() allows you to register a set of file descriptors you are interested in. You then make blocking epoll_wait calls (they accept timeouts) which will return only the file descriptors which are ready for I/O. This allows you to avoid searching through all your file descriptors every time.

At the very least you should set a timeout so that you can do other work if no I/O is ready. If possible though, use epoll().

Asynchronous non-blocking I/O

This is probably the least widely known model of I/O out there. This model of io is implemented via the libaio library in Linux.

In this I/O model, you can initiate I/O using aio_read(), aio_write(), and a few others. Before using these functions, you must set up a struct aiocb including fields which indicate how you’d like to get notifications and where the data can be read from or written to. Notifications can be delivered in a couple different ways:

  • Signal – a SIGIO is delivered to the process when the I/O has completed
  • Callback – a callback function is called when the I/O has completed

Pros:

  • Helps maximize I/O throughput by allowing lots of I/O to issued in parallel
  • Allows application to continue processing while I/O is executing, callback or POSIX signal when done

Cons:

  • Wrapper for libaio may not exist for your programming environment
  • Network I/O may not be supported

This method of I/O is really awesome because it does not block the calling application and allows multiple I/O operations to executed in parallel which increases the I/O throughput of the application.

The downsides to using libaio are:

  • Wrapper may not exist for your favorite programming language.
  • Unclear whether libaio supports network I/O on all systems — may only support disk I/O. When this happens, the library falls back to using normal synchronous blocking I/O.

You should try out this I/O model if your programming environment has support for it and it either has support for network I/O or you don’t need it.

Conclusion

In conclusion, you should use synchronous blocking I/O when you are writing small apps which won’t see much traffic. For more intense applications, you should definitely use one of the two asynchronous models. If possible, avoid synchronous non-blocking I/O at all costs.

Remember that the goal is to increase I/O throughput to scale your application to withstand thousands of requests per second. Doing any sort of blocking I/O in your application can (depending on threading model) cause your entire application to block, increasing latency and slowing the user experience to a crawl.

Written by Joe Damato

October 27th, 2008 at 8:58 am

Posted in systems

Tagged with , , , , ,

Threading models: So many different ways to get stuff done.

View Comments

multitasking

Why do I care?

Threading models are often very confusing; there are many different models with different trade-offs and dissecting the details can be tough the first time around. It is important for any large scale project to consider what threading model(s) a programming language supports and what implications the model(s) will have on the system design so that the software system that performs as optimally as possible can be built.

Probably the source of a lot of the confusion surrounding threading models is the terminology used to describe the different components. I am going to try to explain some terminology which, to my knowledge, is the most commonly used.

User-land? Kernel-land?

This could be a blog post in and of itself, but let’s try to stay high level here. When I write “user-land” I am referring to the context in which normal applications run, such as a web-browser, or email client. When I write “kernel-land” I am referring to the context in which the kernel executes, typically a more privileged execution context that allows interaction with memory, I/O ports, process scheduling, and other funky stuff.

What is a process?

A process is a collection of various pieces of state for an executable that includes things such as virtual address space, per process flags, file descriptors, and more.

What is a thread?

A thread is just a collection of execution state for a program. Depending on the implementation this can include register values, execution stack, and more. Each process has at least one thread, the main thread. Some processes will create more threads. How new threads are created is where we begin considering the trade-offs.

Let’s look at some different threading models. I’m going to list the Pros and Cons first in case you don’t feel like reading the full explanation :) Let’s get started.

1:1

The 1:1 model, or one kernel thread for each user thread, is a very widespread model that is seen in many operating system implementations like Linux. It is sometimes referred to as “native threads.”

Pros:

  • Threads can execute on different CPUs
  • Threads do not block each other
  • Shared memory

Cons:

  • Setup overhead
  • Linux kernel bug with lots of threads (read more here)
  • Low limits on the number of threads which can be created

What does this mean? This means that each user-thread (execution state in user-land) is paired with a kernel-thread (execution state in kernel-land). The two commonly interact via system calls and signals. Since state exists in the kernel, the scheduler can schedule threads created in the 1:1 model across different CPUs to execute in parallel. A side effect of this is that if a thread executes a system call that blocks, the other threads in the process can be scheduled and executed in the mean time. In this model, different threads can share the same virtual address space but care must be taken to synchronize access to the same memory regions. Unfortunately, since the kernel has to be notified when a new thread is created in userland so corresponding state can be created in the kernel, this setup cost is overhead that must be paid each time a thread is created and there is an upper bound on the number of threads and thread state that the kernel can track before performance begins to degrade.

You may be familiar with libpthread and the function pthread_create. On Linux, this creates user and kernel state.

1:N

The 1:N model, or one kernel thread for N user threads, is a model that is commonly called “green threads” or “lightweight threads.”

Pros:

  • Thread creation, execution, and cleanup are cheap
  • Lots of threads can be created (10s of thousands or more)

Cons:

  • Kernel scheduler doesn’t know about threads so they can’t be scheduled across CPUs or take advantage of SMP
  • Blocking I/O operations can block all the green threads

In this model a process manages thread creation, termination, and scheduling completely on its own without the help or knowledge of the kernel. The major upside of this model is that thread creation, termination, cleanup, and synchronization is extremely cheap, and it is possible to create huge numbers of threads in user-land. This model has several downsides, though. One of the major downsides of not being able to utilize the kernel’s scheduler. As a result, all the user-land threads execute on the same CPU and cannot take advantage of true parallel execution. One way to cope with this is to create multiple processes (perhaps via fork()) and then have the processes communicate with each other. A model like this begins to look very much like the M:N model described below.

MRI Ruby 1.8.7 has green threads. Early versions of Java also had green threads.

M:N

The M:N model, or M kernel threads for N user threads, is a model that is a hybrid of the previous two models.

Pros:

  • Take advantage of multiple CPUs
  • Not all threads are blocked by blocking system calls
  • Cheap creation, execution, and cleanup

Cons:

  • Need scheduler in userland and kernel to work with each other
  • Green threads doing blocking I/O operations will block all other green threads sharing same kernel thread
  • Difficult to write, maintain, and debug code

This hybrid model appears to be a best of both worlds solution that includes all the advantages of 1:1 and 1:N threading without any of the downsides. Unfortunately the cost of the downsides outweighs many of the advantages to such an extent that it isn’t worth it in many cases to build/use an M:N threading model. In general, building and synchronizing a user-land scheduler with a kernel scheduler makes programming in this model extremely difficult and error prone. Research on M:N threading vs 1:1 threading was done for the Linux kernel to determine how threading was going to evolve. Research into performance implications and use cases on Linux showed the 1:1 model to be superior in general. On the other hand, in specific problem domains that are well understood M:N may be the right choice.

Erlang has what many consider to be an M:N threading model. Prior to Solaris 9, Solaris supported an M:N threading model.

So which should I use?

Well, it is a tough call. You need to sit and think awhile about what your specific system needs and how intelligent your libraries are. In some implementations of the 1:N threading model I/O operations will all be abstracted away into a non-blocking I/O subsystem. If your library of choice does not (or cannot due to language design) hook into this non-blocking I/O subsystem, your library may block all your green threads clobbering your performance.

You should strongly consider the threading model(s) supported by the programming language(s) and libraries you choose because this decision will have impact on your performance, application execution time, and I/O operations.

Thanks for reading!

Written by Joe Damato

October 8th, 2008 at 1:00 pm

Posted in systems

Tagged with , , , , ,