time to bleed by Joe Damato

technical ramblings from a wanna-be unix dinosaur

Archive for the ‘scaling’ Category

Fix a bug in Ruby’s configure.in and get a ~30% performance boost.

View Comments


Special thanks…

Going out to Jake Douglas for pushing the initial investigation and getting the ball rolling.

The whole --enable-pthread thing

Ask any Ruby hacker how to easily increase performance in a threaded Ruby application and they’ll probably tell you:

Yo dude… Everyone knows you need to configure Ruby with --disable-pthread.

And it’s true; configure Ruby with --disable-pthread and you get a ~30% performance boost. But… why?

For this, we’ll have to turn to our handy tool strace. We’ll also need a simple Ruby program to this one. How about something like this:

def make_thread
  Thread.new {
    a = []
    10_000_000.times {
      a << "a"
      a.pop
    }
  }
end

t = make_thread 
t1 = make_thread 

t.join
t1.join

Now, let's run strace on a version of Ruby configure'd with --enable-pthread and point it at our test script. The output from strace looks like this:

22:46:16.706136 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706177 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706218 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706259 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000005>
22:46:16.706301 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706342 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706383 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706425 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706466 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>

Pages and pages and pages of sigprocmask system calls (Actually, running with strace -c, I get about 20,054,180 calls to sigprocmask, WOW). Running the same test script against a Ruby built with --disable-pthread and the output does not have pages and pages of sigprocmask calls (only 3 times, a HUGE reduction).

OK, so let's just set a breakpoint in GDB... right?

OK, so we should just be able to set a breakpoint on sigprocmask and figure out who is calling it.

Well, not exactly. You can try it, but the breakpoint won't trigger (we'll see why a little bit later).

Hrm, that kinda sucks and is confusing. This will make it harder to track down who is calling sigprocmask in the threaded case.

Well, we know that when you run configure the script creates a config.h with a bunch of defines that Ruby uses to decide which functions to use for what. So let's compare ./configure --enable-pthread with ./configure --disable-pthread:

[joe@mawu:/home/joe/ruby]% diff config.h config.h.pthread 
> #define _REENTRANT 1
> #define _THREAD_SAFE 1
> #define HAVE_LIBPTHREAD 1
> #define HAVE_NANOSLEEP 1
> #define HAVE_GETCONTEXT 1
> #define HAVE_SETCONTEXT 1


OK, now if we grep the Ruby source code, we see that whenever HAVE_[SG]ETCONTEXT are set, Ruby uses the system calls setcontext() and getcontext() to save and restore state for context switching and for exception handling (via the EXEC_TAG).

What about when HAVE_[SG]ETCONTEXT are not define'd? Well in that case, Ruby uses _setjmp/_longjmp.

Bingo!

That's what's going on! From the _setjmp/_longjmp man page:

... The _longjmp() and _setjmp() functions shall be equivalent to longjmp() and setjmp(), respectively, with the additional restriction that _longjmp() and _setjmp() shall not manipulate the signal mask...

And from the [sg]etcontext man page:

... uc_sigmask is the set of signals blocked in this context (see sigprocmask(2)) ...


The issue is that getcontext calls sigprocmask on every invocation but _setjmp does not.

BUT WAIT if that's true why didn't GDB hit a sigprocmask breakpoint before?

x86_64 assembly FTW, again

Let's fire up gdb and figure out this breakpoint-not-breaking thing. First, let's start by disassembling getcontext (snipped for brevity):

(gdb) p getcontext
$1 = {} 0x7ffff7825100
(gdb) disas getcontext
...
0x00007ffff782517f : mov $0xe,%rax
0x00007ffff7825186 : syscall
...

Yeah, that's pretty weird. I'll explain why in a minute, but let's look at the disassembly of sigprocmask first:

(gdb) p sigprocmask
$2 = {} 0x7ffff7817340 <__sigprocmask>
(gdb) disas sigprocmask
...
0x00007ffff7817383 <__sigprocmask+67>: mov $0xe,%rax
0x00007ffff7817388 <__sigprocmask+72>: syscall
...

Yeah, this is a bit confusing, but here's the deal.

Recent Linux kernels implement a shiny new method for calling system calls called sysenter/sysexit. This new way was created because the old way (int $0x80) turned out to be pretty slow. So Intel created some new instructions to execute system calls without such huge overhead.

All you need to know right now (I'll try to blog more about this in the future) is that the %rax register holds the system call number. The syscall instruction transfers control to the kernel and the kernel figures out which syscall you wanted by checking the value in %rax. Let's just make sure that sigprocmask is actually 0xe:

[joe@pluto:/usr/include]% grep -Hrn "sigprocmask" asm-x86_64/unistd.h 
asm-x86_64/unistd.h:44:#define __NR_rt_sigprocmask                     14


Bingo. It's calling sigprocmask (albeit a bit obscurely).

OK, so getcontext isn't calling sigprocmask directly, instead it replicates a bunch of code that sigprocmask has in its function body. That's why we didn't hit the sigprocmask breakpoint; GDB was going to break if you landed on the address 0x7ffff7817340 but you didn't.

Instead, getcontext reimplements the wrapper code for sigprocmask itself and GDB is none the wiser.

Mystery solved.

The patch

Get it HERE

The patch works by adding a new configure flag called --disable-ucontext to allow you to specifically disable [sg]etcontext from being called, you use this in conjunction with --enable-pthread, like this:

./configure --disable-ucontext --enable-pthread


After you build Ruby configured like that, its performance is on par with (and sometimes slightly faster) than Ruby built with --disable-pthread for about a 30% performance boost when compared to --enable-pthread.

I added the switch because I wanted to preserve the original Ruby behavior, if you just pass --enable-pthread without --disable-ucontext Ruby will do the old thing and generate piles of sigprocmasks.

Conclusion

  1. Things aren't always what they seem - GDB may lie to you. Be careful.
  2. Use the source, Luke. Libraries can do unexpected things, debug builds of libc can help!
  3. I know I keep saying this, assembly is useful. Start learning it today!

If you enjoyed this blog post, consider subscribing (via RSS) or following (via twitter).

You'll want to stay tuned; tmm1 and I have been on a roll the past week. Lots of cool stuff coming out!

Written by Joe Damato

May 5th, 2009 at 3:20 am

6 Line EventMachine Bugfix = 2x faster GC, +1300% requests/sec

View Comments




Nothing is possible without lunch

So Aman Gupta (tmm1) and I were eating lunch at the Oaxacan Kitchen on Tuesday and as usual, we were talking about scaling Ruby. We got into a small debate about which phase of garbage collection took the most CPU time.

Aman’s claim:

  • The mark phase, specifically the stack marking phase because of the huge stack frames created by rb_eval

My claim:

  • The sweep phase, because every single object has to be touched and some freeing happens.

I told Aman that I didn’t believe the stack frames were that large, and we bet on how big we thought they would be. Couldn’t be more than a couple kilobytes, could it? Little did we know how wrong our estimates were.

Quick note about Ruby’s GC

Ruby MRI has a mark-and-sweep garbage collector. As part of the mark phase, it scans the process stack. This is required because a pointer to a Ruby object can be passed to a C extension (like Eventmachine, or Hpricot, or whatever). If that happens, it isn’t safe to free the object yet. So Ruby does a simple scan and checks if each word on the stack is a pointer to the Ruby heap, if so, that item cannot be freed.

GDB to the rescue

We get back from lunch, launch our application, attach GDB and set a breakpoint. The breakpoint gets triggered and we see this seemingly innocuous stack trace [Note: To help with debugging, we compiled the EventMachine gem with -fno-omit-frame-pointer]:

#0 0x00007ffff77629ac in epoll_wait () from /lib/libc.so.6
#1 0x00007ffff6c0b220 in EventMachine_t::_RunEpollOnce (this=0x158d7e0) at em.cpp:461
#2 0x00007ffff6c0b86c in EventMachine_t::_RunOnce (this=0x158d7e0) at em.cpp:423
#3 0x00007ffff6c0bbd6 in EventMachine_t::Run (this=0x158d7e0) at em.cpp:404
#4 0x00007ffff6c06638 in evma_run_machine () at cmain.cpp:83
#5 0x00007ffff6c1897f in t_run_machine_without_threads (self=26066936) at rubymain.cpp:154
#6 0x000000000041d598 in call_cfunc (func=0x7ffff6c1896e , recv=26066936, len=0, argc=0, argv=0x0) at eval.c:5759
#7 0x000000000041c92f in rb_call0 (klass=26065816, recv=26066936, id=29417, oid=29417, argc=0, argv=0x0, body=0x18dba10, flags=0) at eval.c:5911
#8 0x000000000041e0ad in rb_call (klass=26065816, recv=26066936, mid=29417, argc=0, argv=0x0, scope=2, self=26066936) at eval.c:6158
#9 0x00000000004160d5 in rb_eval (self=26066936, n=0x1940330) at eval.c:3514
#10 0x00000000004150b7 in rb_eval (self=26066936, n=0x1941018) at eval.c:3357
#11 0x000000000041d196 in rb_call0 (klass=26065816, recv=26066936, id=5393, oid=5393, argc=0, argv=0x0, body=0x1941018, flags=0) at eval.c:6062
#12 0x000000000041e0ad in rb_call (klass=26065816, recv=26066936, mid=5393, argc=0, argv=0x0, scope=0, self=47127864) at eval.c:6158
#13 0x0000000000415d01 in rb_eval (self=47127864, n=0x2cf5298) at eval.c:3493
#14 0x00000000004148b2 in rb_eval (self=47127864, n=0x2cf4380) at eval.c:3223
#15 0x000000000041d196 in rb_call0 (klass=47127808, recv=47127864, id=5313, oid=5313, argc=0, argv=0x0, body=0x2cf4380, flags=0) at eval.c:6062
#16 0x000000000041e0ad in rb_call (klass=47127808, recv=47127864, mid=5313, argc=0, argv=0x0, scope=0, self=9606072) at eval.c:6158
#17 0x0000000000415d01 in rb_eval (self=9606072, n=0x194b2a0) at eval.c:3493
#18 0x00000000004148b2 in rb_eval (self=9606072, n=0x19587b0) at eval.c:3223
#19 0x000000000041072c in eval_node (self=9606072, node=0x19587b0) at eval.c:1437
#20 0x0000000000410dff in ruby_exec_internal () at eval.c:1642
#21 0x0000000000410e4f in ruby_exec () at eval.c:1662
#22 0x0000000000410e72 in ruby_run () at eval.c:1672
#23 0x000000000040e78a in main (argc=3, argv=0x7fffffffebd8, envp=0x7fffffffebf8) at main.c:48

Looks pretty normal, nothing to worry about, right?

We started checking the rb_eval frames because we assumed that those would be the largest stack frames. The rb_eval function inlines other functions and call itself recursively. So how big is one of the rb_eval frames?

(gdb) frame 10
#10 0x00000000004150b7 in rb_eval (self=26066936, n=0x1941018) at eval.c:3357
3357 result = rb_eval(self, node->nd_head);
(gdb) p $rbp-$rsp
$2 = 1904

1,904 bytes – pretty large. If all the stack frames are that large, we are looking at around 47,600 bytes. Pretty serious. Let’s verify that Ruby thinks the stack is a sane size. There is a global in the Ruby interpreter called rb_gc_stack_start. It gets set when the Ruby stack is created in Init_stack(). When Ruby calculates the stack size it subtracts the current stack pointer from rb_gc_stack_start [remember on x86_64, the stack grows from high addresses to low addresses]. Let’s do that and see how big Ruby thinks the stack is.

(gdb) p (unsigned int)rb_gc_stack_start - (unsigned int)$rsp
$3 = 802688

Wait, wait, wait. 802,688 bytes with only 23 stack frames? WTF?! Something is wrong. We started at the top and checked all the rb_eval stack frames, but none of them are larger than 2kb. We did find something quite a bit larger than 2kb, though.

(gdb) frame 1
#1 0x00007ffff6c0b220 in EventMachine_t::_RunEpollOnce (this=0x158d7e0) at em.cpp:461
461 s = epoll_wait (epfd, ev, MaxEpollDescriptors, timeout == 0 ? 5 : timeout);
(gdb) p $rbp-$rsp
$28 = 786816

Uh, the RunEpollOnce stack frame is 786,816 bytes? That’s got to be wrong. WTF?

Time to bring out the big guns.

objdump + x86_64 asm FTW

I pumped EventMachine’s shared object into objdump and captured the assembly dump:

objdump -d rubyeventmachine.so > em.S

I headed down to the RunEpollOnce function and saw the following:

2f12b: 48 81 ec 78 01 0c 00 sub $0xc0178,%rsp

Interesting. So the code is moving %rsp down by 786,808 bytes to make room for something big. So, let’s see if the EventMachine code matches up with the assembly output.

struct epoll_event ev [MaxEpollDescriptors];

Where MaxEpollDescriptors = 64*1024 and sizeof(struct epoll_event) == 12. That matches up with the assembly dump and the GDB output.

Usually, doing something like that in C/C++ is (usually) OK. Avoiding the heap whenever you can is a good idea because you avoid heap-lock contention, fragmenting the heap, and memory overhead for tracking the memory region. When writing Ruby extensions, this isn’t necessarily true. Remember, Ruby’s GC algorithm scans the entire process stack searching for references to Ruby objects. This EventMachine code causes Ruby to search an extra ~800,000 bytes drastically slowing down garbage collection.

The patch

Get the patch HERE

The patch simply moves the stack allocated struct epoll_event ev to the class definition so that it is allocated on the heap when an instance of the class is created with new. This does not change the memory usage of the process at all. It just moves the object off the stack. This makes all the difference because Ruby’s GC scans the process stack and not the process heap.

On top of all that, this patch helps with Ruby’s green threads, too. If the epoll_wait causes a Ruby event to fire and that event creates a Ruby thread, that Ruby thread gets an entire copy of the existing stack. Each time that thread is switched into and out of, that thread stack has to be memcpy’d into and out of place. Reducing those memcpys by ~800,000 bytes is a HUGE performance win. Want to learn more about threading implementations? Check out my threading models post: here.

Fixing this turned out to be pretty simple. A six (6!!) line patch:

  • Speeds up GC by 2-3x because of the huge decrease in stack frame size.
  • Fixes an open bug in EventMachine where using threads with Epoll causes lots of slowness. The reason is that each thread will inherit an ~800,000 byte stack that gets copied in and out every context switch.
  • This results in an increase from 500 requests/sec to 7000 requests/sec when using Sinatra+Thin+Epoll+Threads. That is pretty ill.

Conclusion

All in all, a productive debugging session lasting about an hour. The result was a simple patch, with 2 big performance improvements.

A couple things to take away from this experience:

  • Spend time learning your debugging tools because it pays off, especially nm, objdump, and of course GDB.
  • Getting familiar with x86_64 assembly is crucial if you hope to debug complex software and optimize it correctly.

Keep your eyes open for up-coming blog posts about x86_64 assembly! Don’t forget to subscribe via RSS or follow me on twitter

Written by Joe Damato

April 29th, 2009 at 1:36 am

Yo Dawg: Using a package management system to install a package management system

View Comments


Consider the following scenario: You would like to run a common Linux distro (Debian Etch, Centos/RHEL, whatever) for stability, the large community surrounding it, and maybe even for third-party support.

There’s a catch though.

You also want to easily use and deploy a small number of custom packages. Why? Maybe you want to apply a patch for a library, compiler, interpreter, or something else you use. Sure, you could build a .deb or .rpm, but there is a bit of a learning curve; is that learning curve worth it just so you can apply a handful of patches?
At Kickball Labs, we wanted to use the “stable” versions of packages that come bundled with Debian for the base system, but we also wanted to be able to use new packages that have features we are interested in. We decided to layer pacman on top of apt and install a small number of custom packages to a /custom directory on the filesystem. This enables us to use stable packages by default, but let’s us override them when we feel it is necessary.

What sucks about RPM and APT (imho)

  1. Getting other people to use them – OK, so you’ve bought in to RPM or APT and you don’t mind reading all the docs and cuddling up with the man pages. But what about the rest of your team? Unless there is only one person constantly cranking out custom packages, everyone is going to have to learn RPM or APT. Do you really want to waste valuable engineer brain cycles reading and debugging busted packages when instead you could be writing code?
  2. Too much work to add 1 patch – Let’s say I want to add one patch to fix a memory leak to libX. Here’s what I have to do for debian packages:
    1. Download and unpack the library source.
    2. Add a debian/ sub-directory.
    3. Create a changelog, control, and files file.
    4. Create a file with a list of the patches that are being applied.
    5. Drop in the patch.
    6. Test the package.

    Wow. Extremely painful. Especially for just one patch. Hell, you might even throw the deb away after if you decide you don’t like the patch.

  3. Source control – So you don’t mind the previous points. They don’t bother you all that much. But what about source control? How do you keep track of your Debian package files? You could keep an entire copy of the library’s source with your debian/ sub-directory in your git/svn/whatever. That kind of sucks, though. What if you got your source code from the git/svn of the project instead of via a tarball? Yeah, I guess you could put all that into source control too. You could also check in your debian/ sub-dirs into a repository and then symlink them into the source for the library…. What a pain.

pacman and the almighty PKGBUILD

This is where pacman saves the day.

  1. pacman is simple – It doesn’t try to solve Global Warming. It just provides a dead simple set of command line switches for installing, removing, upgrading, and syncing packages. Not many options, but that is exactly what I want. You can just put a bunch of packages in a directory, point a webserver at it and its a pacman package server.
  2. PKGBUILD files are simple – PKGBUILD files are just plain text files with a few fields. The fields are easy to understand and you can learn how to write your first PKGBUILD in 5 minutes.
  3. Easily use with source control – Since the actual PKGBUILD file is plain text, your source control system should be able to easily keep track of changes. You don’t need to check in all the source, either. You can just point the PKGBUILD at a URL and it will automagically run wget and unpack the source. You can include a source tarball if you really want to, of course.
  4. Quickly create create a new PKGBUILD or add a patch to an existing one – To add a new patch to an existing PKGBUILD I just add the filename to the source = line, and add a patch -p N < file line and I'm done. If the PKGBUILD doesn't exist, I can easily create a new one because the file format is dead simple

Getting it on Debian

This part is kind of weird. We want to get pacman on Debian. There isn't an apt package, so what now? Well, we can build a .deb file that installs pacman so we can use PKGBUILDs. Basically, we use a package management system to install a package management system.

There's gotta be a "Yo Dawg" in there somewhere.

Get it here and be sure to get its dependency (libdownload) here.

A look at some PKGBUILDs

Let's take a look some PKGBUILDs that we use at Kickball Labs.

The first is a simple PKGBUILD for ltrace, a program like strace but for library calls. It just downloads the source, passes in some custom options to configure, builds the binary, and then installs to the package directory.

pkgname=ltrace
pkgver=0.5.1
pkgrel=1
pkgdesc="ltrace is a debugging program which runs a specified command until it exits"
url="http://packages.debian.org/unstable/utils/ltrace"
arch=('x86_64')
source=(http://ftp.debian.org/debian/pool/main/l/ltrace/${pkgname}_${pkgver}.orig.tar.gz)

build()
{
  cd $startdir/src/$pkgname-$pkgver

  ./configure --prefix=/custom --sysconfdir=/custom/etc
  make || return 1
  make DESTDIR=$startdir/pkg install
}

Download it here.

This next PKGBUILD is a bit more intense. It is our PKGBUILD for Ruby, with a bunch of extra patches (fibers, ruby GC patches, and ruby thread bugfixes).

pkgname=ruby
pkgver=1.8.7_p72
_pkgver=1.8.7-p72
pkgrel=27
pkgdesc="An object-oriented language for quick and easy programming"
arch=(i686 x86_64)
license=('custom')
url="http://www.ruby-lang.org/en/"
depends=(google-perftools)
provides=(ruby)
conflicts=(ruby)
source=(ftp://ftp.ruby-lang.org/pub/ruby/stable/ruby-${_pkgver}.tar.bz2 thread_timer.patch fibers.patch ruby-186-gc-new.patch dump_heap.patch)

options=('!emptydirs' 'force')

build() {
  sudo apt-get install libreadline5-dev zlib1g-dev libncurses5-dev libssl-dev libgdbm-dev libdb4.4-dev

  cd ${startdir}/src/${pkgname}-${_pkgver}

  patch -p1 < ${startdir}/src/fibers.patch || return 1
  patch -p0 < ${startdir}/src/thread_timer.patch || return 1
  patch -p1 < ${startdir}/src/ruby-186-gc-new.patch || return 1
  patch -p1 < ${startdir}/src/dump_heap.patch || return 1

  # include /custom in cflags/ldflags so extensions compile
  export CFLAGS="-I/custom/include -g3 -gdwarf-2 -ggdb -O0"
  export LDFLAGS="-L/custom/lib"
  export LIBS="-L/custom/lib -ltcmalloc_minimal"

  ./configure --prefix=/custom --enable-shared --disable-pthread
  make || return 1
  make DESTDIR=${startdir}/pkg install
}

Download it here.

Conclusion

Package management is painful. If you have any plans on building a service that scales to multiple machines, you had better have a good solution for creating and distributing packages. pacman is good for this because:

  1. It's easy to learn and use, encouraging you to make everything (from libraries to configuration files and more) a PKGBUILD.
  2. The simple plain text file format works great with your source control system of choice.
  3. Applied a patch you didn't like? Just roll the PKGBUILD file back with your package manager.
  4. Create a PKGBUILD repository by just putting the tarballs generated from your PKGBUILD files in a directory and pointing a web server at it. This is great for bringing up new hardware in a datacenter - just install pacman, point it at your repository, and install your base package which sets up the all your passwd, host, or other config files.

Written by Joe Damato

April 27th, 2009 at 12:21 am

a/b test mallocs against your memory footprint

View Comments


The other day at Kickball Labs we were discussing whether linking Ruby against tcmalloc (or ptmalloc3, nedmalloc, or any other malloc) would have any noticeable effect on application latency. After taking a side in the argument, I started wondering how we could test this scenario.

We had a couple different ideas about testing:

  • Look at other people’s benchmarks
    BUT do the memory workloads tested in the benchmarks actually match our own workload at all?
  • Run different allocators on different Ruby backends
    BUT different backends will get different users who will use the system differently and cause different allocation patterns
  • Try to recreate our applications memory footprint and test that against different mallocs
    BUT how?

I decided to explore the last option and came up with an interesting solution. Let’s dive into how to do this.

Get the code:

http://github.com/ice799/malloc_wrap/tree/master

Step 1: We need to get a memory footprint of our process

So we have some random binary  (in this case it happens to be a Ruby interpreter, but it could be anything) and we’d like to track when it calls malloc/realloc/calloc and free (from now on I’ll refer to all of these as malloc-family for brevity). There are two ways to do this, the right way and the wrong/hacky/unsafe way.

  • The “right” way to do this, with libc malloc hooks:

    Edit your application code to use the malloc debugging hooks provided by libc. When a malloc-family function is called, your hook executes and outputs to a file which function was called and what arguments were passed to it.

  • The “wrong/hacky/unsafe” way to do this, with LD_PRELOAD:

    Create a shim library and point LD_PRELOAD at it. The shim exports the malloc-family symbols, and when your application calls one of those functions, the shim code gets executed. The shim logs which function was called and with what arguments. The shim then calls the libc version of the function (so that memory is actually allocated/freed) and returns control to the application.

I chose to do it the second way, because I like living on the edge. The second way is unsafe because you can’t call any functions which use a malloc-family function before your hooks are setup. If you do, you can end up in an infinite loop and crash the application.

You can check out my implementation for the shim library here: malloc_wrap.c

Why does your shim output such weirdly formatted data?

Answer is sort of complicated, but let’s keep it simple: I originally had a different idea about how I was going to use the output. When that first try failed, I tried something else and translated the data to the format I needed it in, instead of re-writing the shim. What can I say, I’m a lazy programmer.

OK, so once you’ve built the shim (gcc -O2 -Wall -ldl -fPIC -o malloc_wrap.so -shared malloc_wrap.c), you can launch your binary like this:

% LD_PRELOAD=/path/to/shim/malloc_wrap.so /path/to/your/binary -your -args

You should now see output in /tmp/malloc-footprint.pid

Step 2: Translate the data into a more usable format

Yeah, I should have went back and re-written the shim, but nothing happens exactly as planned. So, I wrote a quick ruby script to convert my output into a more usable format. The script sorts through the output and renames memory addresses to unique integer ids starting at 1 (0 is hardcoded to NULL).

The format is pretty simple. The first line of the file has the number of calls to malloc-family functions, followed by a blank line, and then the memory footprint. Each line of the memory footprint has 1 character which represents the function called followed by a few arguments. For the free() function, there is only one argument, the ID of the memory block to free. malloc/calloc/realloc have different arguments, but the first argument following the one character is always the ID of the return value. The next arguments are the arguments usually passed to malloc/calloc/realloc in the same order.

Have a look at my ruby script here: build_trace_file.rb

It might take a while to convert your data to this format, I suggest running this in a screen session, especially if your memory footprint data is large. Just as a warning, we collected 15 *gigabytes* of data over a 10 hour period. This script took *10 hours* to convert the data. We ended up with a 7.8 gigabyte file.

% ruby /path/to/script/build_trace_file.rb /path/to/raw/malloc-footprint.PID /path/to/converted/my-memory-footprint

Step 3: Replay the allocation data with different allocators and measure time, memory usage.

OK, so we now have a file which represents the memory footprint of our application. It’s time to build the replayer, link against your malloc implementation of choice, fire it up and start measuring time spent in allocator functions and memory usage.

Have a look at the replayer here: alloc_tester.c
Build the replayer: gcc -ggdb -Wall -ldl -fPIC -o tester alloc_tester.c

Use ltrace

ltrace is similar to strace, but for library calls. You can use ltrace -c to sum the amount of time spent in different library calls and output a cool table at the end, it will look something like this:

% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
86.70   37.305797          62    600003 fscanf
10.64    4.578968          33    138532 malloc
2.36    1.014294          18     55263 free
0.25    0.109550          18      5948 realloc
0.03    0.011407          45       253 printf
0.02    0.010665          42       252 puts
0.00    0.000167          20         8 calloc
0.00    0.000048          48         1 fopen
------ ----------- ----------- --------- --------------------
100.00   43.030896                800260 total

Conclusion

Using a different malloc implementation can provide a speed/memory increases depending on your allocation patterns. Hopefully the code provided will help you test different allocators to determine whether or not swapping out the default libc allocator is the right choice for you. Our results are still pending; we had a lot of allocator data (15g!) and it takes several hours to replay the data with just one malloc implementation. Once we’ve gathered some data about the different implementations and their effects, I’ll post the results and some analysis. As always, stay tuned and thanks for reading!

Written by Joe Damato

March 16th, 2009 at 8:39 pm

Fibers implemented for Ruby 1.8.{6,7}

View Comments

At Kickball Labs, Aman Gupta (http://github.com/tmm1) and I (http://github.com/ice799) have been working on an implementation of Fibers for Ruby 1.8.{6,7}. It is API compatible to Fibers in Ruby 1.9, except for the “transfer” method, which is currently unimplemented. This patch will allow you to use fibers with mysqlplus and neverblock.

THIS IS ALPHA SOFTWARE (we are using it in production, though), USE WITH CAUTION.

Raw patches

Patch against ruby-1.8.7_p72: HERE.

Patch against ruby-1.8.6_p287: HERE.

To use the patch:
Download ruby source Ruby 1.8.7_p72, or if you prefer: Ruby 1.8.6-p287

Then, perform the following:

cd your-ruby-src-directory/
wget http://timetobleed.com/files/fibers-RUBY_VERSION.patch
patch -p1 < fibers.patch
./configure —-disable-pthread —-prefix=/tmp/ruby-with-fibers/ &&  make && sudo make install
/tmp/ruby-with-fibers/bin/ruby test/test_fiber.rb

This will patch ruby and install it to a custom location: /tmp/ruby-with-fibers so you can test and play around with it without overwriting your existing Ruby installation.

Github

I am currently working on getting the ruby 1.8.6 patched code up on github, but Aman has a branch of ruby 1.8.7_p72 called fibers with the code at http://github.com/tmm1/ruby187/tree/fibers

What are fibers?

Fibers are (usually) non-preemptible lightweight user-land threads.

But I thought Ruby 1.8.{6,7} already had green threads?

You are right; it does. Fibers are simply ruby green threads, without preemption. The programmer (you) gets to decide when to pause and resume execution of a fiber instead of a timer.

Why would I use fibers?

Bottom line: Your I/O should be asynchronous whenever possible, but sometimes re-writing your entire code base to be asynch and have callbacks can be difficult or painful. A simple solution to this problem is to create or use (see: NeverBlock) some middleware that wraps code paths which make I/O requests in a fiber.

The middleware can issue the asynch I/O operation in a fiber, and yield. Once the middleware’s asynch callback is hit, the Fiber can be resumed. Using NeverBlock (or rolling something similar yourself), should require only minimal code changes to your application, and will essentially make all of your I/O requests asynchronous without much pain at all.

How do I use fibers?

There are already lots of great tutorials about fibers basics here and here.

Let’s take a look at something that drives home the point about being able to drop in some middleware to make synchronous code act asynchronous with minimal changes.

Consider the following code snippet:

require ‘rubygems’
require ‘sinatra’

# eventmachine/thin
require ‘eventmachine’
require ‘thin’

# mysql
require ‘mysqlplus’

# single threaded
DB = Mysql.connect

disable :reload

get ‘/’ do
  4.times do
    DB.query(‘select sleep(0.25)’)
  end
  ‘done’
end
 

This code snippet creates a simple webservice which connects to a mysql database and issues long running queries (in this case, 4 queries which execute for a total of 1 second).

In this implementation, only one request can be handled at a time; the DB.query blocks, so the other users have to wait to have their queries executed.

This sucks because certainly mysql can handle more than just 4 sleep(0.25) queries a second! But, what are our options?

Well, we can rewrite the code to be asynchronous and string together some callbacks. For my contrived example, doing that would be pretty easy and it’d be only slightly harder to read. Let’s use our imaginations. Let’s pretend the code snippet I just showed you was some huge, ugly, scary blob of code and rewritting it to be asynchronous would not only take a long time, it would also make the code very ugly and difficult to read.

Now, let’s drop in fibers:

require ‘rubygems’
require ‘sinatra’

# eventmachine/thin
require ‘eventmachine’
require ‘thin’

# mysql
require ‘mysqlplus’

# fibered
require ‘neverblock’
require ‘never_block/servers/thin’
require ‘neverblock-mysql’
class Thin::Server
 def fiber_pool() @fiber_pool ||= NB::Pool::FiberPool.new(20) end
end

DB = NB::DB::PooledDBConnection.new(20){ NB::DB::FMysql.connect }

disable :reload

get ‘/’ do
  4.times do
    DB.query(‘select sleep(0.25)’)
  end
  ‘done’
end
 

NOTICE: The application code hasn’t changed, we simply monkey patched Thin to use a pool of fibers.

Suddenly, our application can handle 20 connections. This is all handled by NeverBlock and mysqlplus.

  • NeverBlock uses the fiber pool to issue an asynch DB query via mysqplus.
  • After the asynch query is executed, NeverBlock pauses the executing fiber
  • At this point other requests can be serviced
  • When the data comes back from the mysql server, a callback in NeverBlock is executed.
  • The callback resumes the paused fiber, which continues executing.

Pretty sick, right?

Memory consumption, context switches, cooperative multi-threading, oh my!

In our implementation, fibers are ruby green threads, but with no scheduler or preemption. In fact, our fiber implementation shares many code-paths with the existing green thread implementation. As a result, there is very little difference in memory consumption between green threads and our fiber implementation.

Context switches are a different matter all together. The whole point of building a fiber implementation is to allow the programmer to decide when context switching is appropriate. In most circumstances, the application should be undergoing many fewer context switches with fibers and the context switches that do happen occur precisely when needed. As a result, the application can tend to run faster (fewer context switches ==> fewer stack copies ==> fewer CPU cycles).

The major advantage of fibers over green threads is that you get to control when execution starts and stops. The major disadvantage of fibers is that if you have to code carefully, to ensure that you are starting and stopping your fibers appropriately.

Future Directions

Next stop will be “stackless” fibers. I have a fork of the fibers implementation in the works that pre-allocates fiber stacks on the ruby process’ heap. I am hoping to eliminate the overhead associated with switching between fibers by simply shuffling pointers around.

A preliminary version seems to work, although a few bugs that crop up when you use fibers and threads together need to be squashed before the code can be considered “alpha” stage. When it’s done, you’ll find it right here.

Written by Joe Damato

February 5th, 2009 at 5:25 pm