time to bleed by Joe Damato

technical ramblings from a wanna-be unix dinosaur

Archive for October, 2008

I/O models: how you move your data matters

View Comments

Above picture was shamelessly stolen from: http://computer-history.info/Page4.dir/pages/IBM.7030.Stretch.dir/

In this blog post I’m going to follow suit on my threading models post (here) and talk about different types of I/O, how they work, and when you might want to consider using them. Much like with threading models, I/O models have terminology which can be confusing. The confusion leads to misconceptions which will hopefully be cleared up here.

Let’s start first by going over some operating system basics.

System Calls

A system call is a common interface which allows user applications and the operating system kernel to interact with one another. Some familiar functions which are system calls: open(), read(), and write(). These are system calls which ask the kernel to do I/O on behalf of the user process.

There is a cost associated with making system calls. In Linux, system calls are implemented via a software interrupt which causes a privilege level change in the processor – this switch from user to kernel mode is commonly called a context-switch.

User applications typically execute at the most restricted privilege level available where interaction with I/O devices (and other stuff) is not allowed. As a result user applications use system calls to get the kernel to complete privileged I/O (and other) operations.

Synchronous blocking I/O

This is the most familiar and most common type of I/O out there. When an I/O operation is initiated in this model (maybe by calling a system call such as read(), write(), ioctl(), …), the user application making the system call is put into a waiting state by the kernel. The application sleeps until the I/O operation has completed (or has generated an error) at which point it is scheduled to run again. Data is transferred from the device to memory and possibly into another buffer for the user-land application.

Pros:

  • Easy to use and well understood
  • Ubiquitous

Cons:

  • Does not maximize I/O throughput
  • Causes all threads in a process to block if that process uses green threads

This method of I/O is very straight forward and simple to use, but it has many downsides. In a previous post about threading models, I mentioned that doing blocking I/O in a green thread causes all green threads to stop executing until the I/O operation has completed.

This happens because there is only one kernel context which can scheduled, so that context is put into a waiting state in the kernel until the I/O has been copied to the user buffer and the process can run again.

Synchronous non-blocking I/O

This model of I/O is not very well known compared to other models. This is good because this model isn’t very useful.

In this model, a file descriptor is created via open(), but a flag is passed in (O_NONBLOCK on most Linux kernels) to tell the kernel: If data is not available immediately, do not put me to sleep. Instead let me know so I can go on with my life. I’ll try back later.

Pros:

  • If no I/O is available other work can be completed in the meantime
  • When I/O is available, is does not block the thread (even models with green threads)

Cons:

  • Does not maximize I/O throughput for the application
  • Lots of system call overhead – constantly making system calls to see if I/O is ready
  • Can be high latency if I/O arrives and a system call is not made for a while

This model of I/O is typically very inefficient because the I/O system call made by the application may return EAGAIN or EWOULDBLOCK repeatedly. The application can either:

  • wait around for the data to finish (repeatedly calling its I/O system call over and over)  — or
  • try to do other work for a bit, and retry the I/O system call later

At some point the I/O will either return an error or it will be able to complete.

If this type of I/O is used in a system with green threads, the entire process is not blocked but the efficiency is very poor due to the constant polling with system calls from user-land. Each time a system call is invoked a privelege level change occurs on the processor and the execution state of the application has to be saved out to memory (or disk!) so that the kernel can execute.

Asynchronous blocking I/O

This model of I/O is much more well known. In fact, this is how Ruby implements I/O for its green threads.

In this model, non-blocking file descriptors are created (similar to the previous model) and they monitored by calling either select() or poll(). The system call to select()/poll() blocks the process (the process is put into a sleeping state in the kernel) and the system call returns when either an error has occurred or when the file descriptors are ready to be read from or written to.

Pros:

  • When I/O is available is does not block
  • Lots of I/O can be issued to execute in parallel
  • Notifications occur when one or more file descriptors are ready (helps to improve I/O throughput)

Cons:

  • Calling select(), poll(), or epoll_wait() blocks the calling thread (entire application if using green threads)
  • Lots of file descriptors for I/O means lots that have to be checked (can be avoided with epoll)

What is important to note here is that more than one file descriptor can be monitored and when select/poll returns, more than one of the file descriptors may be able to do non-blocking I/O. This is great because it increases the application’s I/O throughput by allowing many I/O operations to occur in parallel.

Of course there are two main drawbacks of using this model:

  • select()/poll() block – so if they are used in a system with green threads, all the threads are put to sleep while these system calls are executing.
  • You must check the entire set of file descriptors to determine which are ready. This can be bad if you have a lot of file descriptors, because you can potentially spend a lot of time checking file descriptors which aren’t ready (epoll() fixes this problem).

This model is important for all you Ruby programmers out there — this is the type of I/O that Ruby uses internally. The calls to select cause Ruby to block while they are being executed.

There are some work-arounds though:

  • Timeouts – select() and poll() let you set timeouts so your app doesn’t have to sleep endlessly if there is no I/O to process – it can continue executing other code in the meantime. This what Ruby does.
  • epoll() (or kqueue on bsd)- epoll() allows you to register a set of file descriptors you are interested in. You then make blocking epoll_wait calls (they accept timeouts) which will return only the file descriptors which are ready for I/O. This allows you to avoid searching through all your file descriptors every time.

At the very least you should set a timeout so that you can do other work if no I/O is ready. If possible though, use epoll().

Asynchronous non-blocking I/O

This is probably the least widely known model of I/O out there. This model of io is implemented via the libaio library in Linux.

In this I/O model, you can initiate I/O using aio_read(), aio_write(), and a few others. Before using these functions, you must set up a struct aiocb including fields which indicate how you’d like to get notifications and where the data can be read from or written to. Notifications can be delivered in a couple different ways:

  • Signal – a SIGIO is delivered to the process when the I/O has completed
  • Callback – a callback function is called when the I/O has completed

Pros:

  • Helps maximize I/O throughput by allowing lots of I/O to issued in parallel
  • Allows application to continue processing while I/O is executing, callback or POSIX signal when done

Cons:

  • Wrapper for libaio may not exist for your programming environment
  • Network I/O may not be supported

This method of I/O is really awesome because it does not block the calling application and allows multiple I/O operations to executed in parallel which increases the I/O throughput of the application.

The downsides to using libaio are:

  • Wrapper may not exist for your favorite programming language.
  • Unclear whether libaio supports network I/O on all systems — may only support disk I/O. When this happens, the library falls back to using normal synchronous blocking I/O.

You should try out this I/O model if your programming environment has support for it and it either has support for network I/O or you don’t need it.

Conclusion

In conclusion, you should use synchronous blocking I/O when you are writing small apps which won’t see much traffic. For more intense applications, you should definitely use one of the two asynchronous models. If possible, avoid synchronous non-blocking I/O at all costs.

Remember that the goal is to increase I/O throughput to scale your application to withstand thousands of requests per second. Doing any sort of blocking I/O in your application can (depending on threading model) cause your entire application to block, increasing latency and slowing the user experience to a crawl.

Written by Joe Damato

October 27th, 2008 at 8:58 am

Posted in systems

Tagged with , , , , ,

Plugging Ruby Memory Leaks: heap/stack dump patches to help take out the trash.

View Comments

EDIT: For all those tuning into my blog for the first time be sure to check out the Ruby threading bugfix from an earlier post: Ruby Threading Bugfix: Small fix goes a long way

In this blog post I’m releasing some patches to MRI Ruby 1.8.7p72 that add heap dumping, object reference finder, stack dumping, object allocation/deallocation tracking, and some more goodies to MRI Ruby 1.8.7p72. These are some changes to Ruby I’ve been working on at the startup I am with (Kickball Labs!), but I think they are generally useful to other people who are looking to try and track down memory problems in Ruby. Let’s get started…

Memory leaks? In Ruby ?!

Many people working on large Ruby applications have complained about memory consumption and the lack of tools available to dump the object heap and search for which objects hold references to other objects. In a future blog post, I hope to talk about different methods of garbage collection and dissect some popular languages and their GC algorithms. Very briefly though, let’s take a quick look at how Ruby’s GC works and how leaks can occur.

The garbage collector in MRI Ruby marks objects as in use during the mark phase if a reference to the object is found. The traversal is started at a set of top level root objects which are marked as in use and recurses on the objects which are referenced from the current object. In this way all reachable objects are marked as in use. The sweep phase then hits every object on the heap and marks objects as free if they are not in use.

We all try to write code which is decoupled, clean, and well documented. Let’s face it though: we are human and we make mistakes. Image a pile of objects all holding references to each other. As long as this pile of objects has no reachable reference from a path originating from a root object, these objects will get garbage collected. Unfortunately, from time to time we forget to clear out all references to old objects we are no longer using. In our pile of objects example, all it takes is one reachable reference from a root object to keep the entire pile of objects in memory.

Here at Kickball labs we thought we were encountering such a situation, but we had no easy way to verify and track it down. I hacked up a heap dumper, stack dumper, reference finder, and some other stuff to help in our quest to take out the trash.

Memtrack APIs

To start I did a bunch of google searching to find if anyone else had gone down this path before. I stumbled across a company called Software Verify, which released a really cool set of APIs, Ruby Memory Tracking API , for tracking object allocation, deallocation, getting the root heap objects, and getting references to objects. The API was really clean and allowed me to hook in my own callbacks for each of those events. One small downside though; the memtrack APIs were for ruby 1.8.6 and when I created a diff they did not apply cleanly to 1.8.7. With a tiny bit of work I was able to get APIs to apply to 1.8.7p72. The patch can be found below and applies to MRI Ruby 1.8.7-p72.

Heap/Stack dumper overview

I hooked in some callbacks to the Memtrack API and added some additional code for heap_find and stack_dump. My algorithm for heap_dump and heap_find is pretty simple. I start by traversing each of the objects on the heap and dumping them each (for heap_dump) or checking if the object has a reference to the object passed into heap_find.

I also copied some code from the mark phase of Ruby’s GC to check the contents of the registers and stack frames. I did this because Ruby checks this information before marking an object as in use. You may be wondering how this can possibly work; couldn’t any FIXNUM that gets passed around in the Ruby internals look like a heap pointer? Well… not quite. Ruby FIXNUMs are 31 bits and as a result they can be bit shifted by 1 bit and their LSB can be set to 1 without losing any data. Doing this ensures that the resulting value cannot look like a pointer to the heap.

HOWTO use the stack/heap dumper

Four functions were added to GC: heap_dump, heap_find, heap_counts, and stack_dump. After applying the patches below and rebuilding Ruby, you can use these functions. There are a bunch of environment variables you can set to control how much output you get. WARNING: There can be a lot of output.

Environment vars and what they do:

RUBY_HEAP_LOG – setting this environment var will cause heap dumping output to be written to the specified file. If this is not set data will be written to stderr

RUBY_HEAP_ALLOC – setting this environment var to anything greather than or equal to 1 will cause output whenever a ruby object is allocated or deallocated this creates a LOT of output.

RUBY_HEAP_SETUP – setting this environment var to anything greater than or equal to 1 will cause output whenever a ruby object has its klass field set.

Ok – now let’s take a quick look at the four added functions and what they do when you use them:

GC.heap_dump – this function writes to stderr (unless RUBY_HEAP_LOG is set) the contents of the Ruby heap. It outputs each object that is not marked as free and information about that object and the objects it references. For strings, for example, the value and length of the string will be output. For classes, the name of the class (unless it is an anonymous class) and the superclass.

GC.heap_find – this function takes one argument and traverses the heap looking for objects which hold a reference to the argument. Once found, it outputs information about the object holding the reference and also information about the object you passed in.

GC.heap_counts – this function is poorly implemented and can only be called after you have called GC.heap_dump. This function outputs the number of each type of Ruby object that has been created (object, node, class, module, string, fixnum, float, etc).

GC.stack_dump – this function checks the registers and stack frames for things which look like pointers to the heap. If the pointer looks like it is pointed at the heap, the address of the pointer is output as well as information about the object it references.

The good stuff: Patches for Ruby 1.8.7p72

Apply the patches below to the MRI Ruby 1.8.7p72 code base and rebuild Ruby. The code below is a bit ugly as it was written in a rush. Use with caution and be sure to email me with any bugfixes or cool additions you come up with!

Patch to apply Memtrack API to MRI Ruby 1.8.7-p72:
memtrack-api-187p72.patch

Patch to apply heap dumper, stack dumper, ref finder, alloc/dealloc tracker, and everything else which is awesome:

heapdumper.patch

Conclusion

Ruby is lacking tools for serious memory analysis. Using the Memtrack API and some other code, tracking down memory issues can be less painful. The tools included in the patches above are only a first cut at developing real, useful memory analysis tools for Ruby. I hope to see more work in creating tools for Ruby.

Thanks for reading! If you have problems using the patches or you create a cool memory analysis tool send me some mail.

Written by Joe Damato

October 15th, 2008 at 11:30 am

Threading models: So many different ways to get stuff done.

View Comments

multitasking

Why do I care?

Threading models are often very confusing; there are many different models with different trade-offs and dissecting the details can be tough the first time around. It is important for any large scale project to consider what threading model(s) a programming language supports and what implications the model(s) will have on the system design so that the software system that performs as optimally as possible can be built.

Probably the source of a lot of the confusion surrounding threading models is the terminology used to describe the different components. I am going to try to explain some terminology which, to my knowledge, is the most commonly used.

User-land? Kernel-land?

This could be a blog post in and of itself, but let’s try to stay high level here. When I write “user-land” I am referring to the context in which normal applications run, such as a web-browser, or email client. When I write “kernel-land” I am referring to the context in which the kernel executes, typically a more privileged execution context that allows interaction with memory, I/O ports, process scheduling, and other funky stuff.

What is a process?

A process is a collection of various pieces of state for an executable that includes things such as virtual address space, per process flags, file descriptors, and more.

What is a thread?

A thread is just a collection of execution state for a program. Depending on the implementation this can include register values, execution stack, and more. Each process has at least one thread, the main thread. Some processes will create more threads. How new threads are created is where we begin considering the trade-offs.

Let’s look at some different threading models. I’m going to list the Pros and Cons first in case you don’t feel like reading the full explanation :) Let’s get started.

1:1

The 1:1 model, or one kernel thread for each user thread, is a very widespread model that is seen in many operating system implementations like Linux. It is sometimes referred to as “native threads.”

Pros:

  • Threads can execute on different CPUs
  • Threads do not block each other
  • Shared memory

Cons:

  • Setup overhead
  • Linux kernel bug with lots of threads (read more here)
  • Low limits on the number of threads which can be created

What does this mean? This means that each user-thread (execution state in user-land) is paired with a kernel-thread (execution state in kernel-land). The two commonly interact via system calls and signals. Since state exists in the kernel, the scheduler can schedule threads created in the 1:1 model across different CPUs to execute in parallel. A side effect of this is that if a thread executes a system call that blocks, the other threads in the process can be scheduled and executed in the mean time. In this model, different threads can share the same virtual address space but care must be taken to synchronize access to the same memory regions. Unfortunately, since the kernel has to be notified when a new thread is created in userland so corresponding state can be created in the kernel, this setup cost is overhead that must be paid each time a thread is created and there is an upper bound on the number of threads and thread state that the kernel can track before performance begins to degrade.

You may be familiar with libpthread and the function pthread_create. On Linux, this creates user and kernel state.

1:N

The 1:N model, or one kernel thread for N user threads, is a model that is commonly called “green threads” or “lightweight threads.”

Pros:

  • Thread creation, execution, and cleanup are cheap
  • Lots of threads can be created (10s of thousands or more)

Cons:

  • Kernel scheduler doesn’t know about threads so they can’t be scheduled across CPUs or take advantage of SMP
  • Blocking I/O operations can block all the green threads

In this model a process manages thread creation, termination, and scheduling completely on its own without the help or knowledge of the kernel. The major upside of this model is that thread creation, termination, cleanup, and synchronization is extremely cheap, and it is possible to create huge numbers of threads in user-land. This model has several downsides, though. One of the major downsides of not being able to utilize the kernel’s scheduler. As a result, all the user-land threads execute on the same CPU and cannot take advantage of true parallel execution. One way to cope with this is to create multiple processes (perhaps via fork()) and then have the processes communicate with each other. A model like this begins to look very much like the M:N model described below.

MRI Ruby 1.8.7 has green threads. Early versions of Java also had green threads.

M:N

The M:N model, or M kernel threads for N user threads, is a model that is a hybrid of the previous two models.

Pros:

  • Take advantage of multiple CPUs
  • Not all threads are blocked by blocking system calls
  • Cheap creation, execution, and cleanup

Cons:

  • Need scheduler in userland and kernel to work with each other
  • Green threads doing blocking I/O operations will block all other green threads sharing same kernel thread
  • Difficult to write, maintain, and debug code

This hybrid model appears to be a best of both worlds solution that includes all the advantages of 1:1 and 1:N threading without any of the downsides. Unfortunately the cost of the downsides outweighs many of the advantages to such an extent that it isn’t worth it in many cases to build/use an M:N threading model. In general, building and synchronizing a user-land scheduler with a kernel scheduler makes programming in this model extremely difficult and error prone. Research on M:N threading vs 1:1 threading was done for the Linux kernel to determine how threading was going to evolve. Research into performance implications and use cases on Linux showed the 1:1 model to be superior in general. On the other hand, in specific problem domains that are well understood M:N may be the right choice.

Erlang has what many consider to be an M:N threading model. Prior to Solaris 9, Solaris supported an M:N threading model.

So which should I use?

Well, it is a tough call. You need to sit and think awhile about what your specific system needs and how intelligent your libraries are. In some implementations of the 1:N threading model I/O operations will all be abstracted away into a non-blocking I/O subsystem. If your library of choice does not (or cannot due to language design) hook into this non-blocking I/O subsystem, your library may block all your green threads clobbering your performance.

You should strongly consider the threading model(s) supported by the programming language(s) and libraries you choose because this decision will have impact on your performance, application execution time, and I/O operations.

Thanks for reading!

Written by Joe Damato

October 8th, 2008 at 1:00 pm

Posted in systems

Tagged with , , , , ,

Ruby threading bugfix: small fix goes a long way.

View Comments

threads

Quick Overview of Ruby Threads

Ruby 1.8.7 (MRI) implements threads completely in userland (also called “green threads” for short) even if built with pthreads. This means that underlying OS kernel has no knowledge about any threads created in ruby programs. In the view of the kernel, it only sees a process with one thread. This one thread is the ruby interpreter which has its own scheduler and threading implementation built-in. What this means for the Ruby developer is that any thread which does I/O will cause the entire ruby process (the ruby interpretter and all ruby green threads) to block.

Implementing threads in userland has some interesting design questions, one of which is: How does the interpretter start and stop executing ruby threads? One way to implement this is to create a timer which interrupts the interpretter at some interval. Ruby (depending on your platform and build options) creates either:

  1. An interval timer with setitimer, which delivers a SIGVTALRM signal to the process at the specified interval, or
  2. A real native OS thread (via pthreads) which sleeps for the length of the interval

In either case, a flag called rb_thread_pending is set (for those of you following along with the Ruby source, the flag is checked with the CHECK_INTS macro). It is important to note, however that the timer created with setitimer is of type ITIMER_VIRTUAL which means time will be measured only when the interpretter is executing (and not during system calls executed on behalf of ruby) whereas the sleeping OS thread is always measuring time, regardless of whether or not Ruby is executing.

strace saves the day

I am working on an event-based real-time distributed (insert more buzzwords) system built in ruby. As a result I am constantly trying to push ruby to its limits, like many other people out there. I noticed that the latency of my eventloop started to increase and after I spawned threads to do short tasks (like send an email, for example). The weird thing was that the latency didn’t go down even after the thread had finished executing! To debug this problem I attached strace to my running ruby process and I saw this:

[joe@mawu]% strace -ttTp `pidof ruby` 2>&1 | egrep '(sigret|setitimer|timer|exit_group)'
19:41:21.282700 setitimer(ITIMER_VIRTUAL, {it_interval={0, 10000}, itvalue={0, 10000}}, NULL) = 0 <0.000022>
19:41:26.778386 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.780578 sigreturn()             = ? (mask now []) <0.000022>
19:41:26.814172 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.823761 sigreturn()             = ? (mask now []) <0.000022>
19:41:26.888419 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.890691 sigreturn()             = ? (mask now []) <0.000041>
19:41:26.904949 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.907327 sigreturn()             = ? (mask now []) <0.000040>
19:41:26.995445 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.997699 sigreturn()             = ? (mask now []) <0.000041>
19:41:27.144428 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:27.147146 sigreturn()             = ? (mask now []) <0.000023>
19:41:27.303472 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:27.306825 sigreturn()             = ? (mask now []) <0.000021>
...

Weird! Looks like the timer is interrupting the executing Ruby process causing it to enter the thread scheduler only to schedule the only thread in the app and start executing again. This was really bad for our system because our main eventloop was being constantly interrupted to the point where under high load the eventloop was unable to service connection requests fast enough and timing out our test scripts. This is also a big problem if you use ruby gems piled on top of ruby gems because the more layers of gem code executing for the short time quanta means that less of your actual app code gets to execute! Not cool, but before getting excited I decided to try to reproduce this on a smaller scale, so:

[joe@mawu]% strace -ttT ruby -e 't1 = Thread.new{ sleep(5) }; t1.join; 10000.times{"aaaaa" * 1000};' 2>&1 | egrep '(sigret|setitimer|timer|exit_group)'
19:41:21.282700 setitimer(ITIMER_VIRTUAL, {it_interval={0, 10000}, itvalue={0, 10000}}, NULL) = 0 <0.000022>
19:41:26.778386 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.780578 sigreturn()             = ? (mask now []) <0.000022>
19:41:26.814172 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.823761 sigreturn()             = ? (mask now []) <0.000022>
19:41:26.888419 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.890691 sigreturn()             = ? (mask now []) <0.000041>
19:41:26.904949 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.907327 sigreturn()             = ? (mask now []) <0.000040>
19:41:26.995445 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:26.997699 sigreturn()             = ? (mask now []) <0.000041>
19:41:27.144428 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:27.147146 sigreturn()             = ? (mask now []) <0.000023>
19:41:27.303472 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
19:41:27.306825 sigreturn()             = ? (mask now []) <0.000021>
19:41:27.314461 exit_group(0)           = ?

Definitely starting to look like a bug from the strace output.

I decided to dive into the ruby 1.8.7 MRI source code (eval.c for those following along in the source) and found that a timer is created whenever a thread is created, but the timer is not destroyed when the thread terminates! Definitely a bug. A quick fix to eval.c fixed the problem and my latency dropped like a rock!

Patch for ruby 1.8.7

I posted a patch to ruby-core and some code was added to fix pthread-enabled Ruby. NOTE: You should ALWAYS test new patches before applying them to your live site, this is no exception!
Ruby MRI 1.8.7p72 patch

Future directions

I’ve been asked a bunch of different questions about threads and threading models, so my next couple blog posts will be about different threading models. I’m going to dive into the details, go through the pros and cons, and try to clear things up a bit, so stay tuned and thanks for reading!

Written by Joe Damato

October 5th, 2008 at 10:17 pm