The other day at Kickball Labs we were discussing whether linking Ruby against tcmalloc (or ptmalloc3, nedmalloc, or any other malloc) would have any noticeable effect on application latency. After taking a side in the argument, I started wondering how we could test this scenario.
We had a couple different ideas about testing:
- Look at other people’s benchmarks
BUT do the memory workloads tested in the benchmarks actually match our own workload at all?
- Run different allocators on different Ruby backends
BUT different backends will get different users who will use the system differently and cause different allocation patterns
- Try to recreate our applications memory footprint and test that against different mallocs
I decided to explore the last option and came up with an interesting solution. Let’s dive into how to do this.
Get the code:
Step 1: We need to get a memory footprint of our process
So we have some random binary (in this case it happens to be a Ruby interpreter, but it could be anything) and we’d like to track when it calls malloc/realloc/calloc and free (from now on I’ll refer to all of these as malloc-family for brevity). There are two ways to do this, the right way and the wrong/hacky/unsafe way.
The “right” way to do this, with libc malloc hooks:
Edit your application code to use the malloc debugging hooks provided by libc. When a malloc-family function is called, your hook executes and outputs to a file which function was called and what arguments were passed to it.
The “wrong/hacky/unsafe” way to do this, with LD_PRELOAD:
Create a shim library and point LD_PRELOAD at it. The shim exports the malloc-family symbols, and when your application calls one of those functions, the shim code gets executed. The shim logs which function was called and with what arguments. The shim then calls the libc version of the function (so that memory is actually allocated/freed) and returns control to the application.
I chose to do it the second way, because I like living on the edge. The second way is unsafe because you can’t call any functions which use a malloc-family function before your hooks are setup. If you do, you can end up in an infinite loop and crash the application.
You can check out my implementation for the shim library here: malloc_wrap.c
Why does your shim output such weirdly formatted data?
Answer is sort of complicated, but let’s keep it simple: I originally had a different idea about how I was going to use the output. When that first try failed, I tried something else and translated the data to the format I needed it in, instead of re-writing the shim. What can I say, I’m a lazy programmer.
OK, so once you’ve built the shim (gcc -O2 -Wall -ldl -fPIC -o malloc_wrap.so -shared malloc_wrap.c), you can launch your binary like this:
You should now see output in /tmp/malloc-footprint.pid
Step 2: Translate the data into a more usable format
Yeah, I should have went back and re-written the shim, but nothing happens exactly as planned. So, I wrote a quick ruby script to convert my output into a more usable format. The script sorts through the output and renames memory addresses to unique integer ids starting at 1 (0 is hardcoded to NULL).
The format is pretty simple. The first line of the file has the number of calls to malloc-family functions, followed by a blank line, and then the memory footprint. Each line of the memory footprint has 1 character which represents the function called followed by a few arguments. For the free() function, there is only one argument, the ID of the memory block to free. malloc/calloc/realloc have different arguments, but the first argument following the one character is always the ID of the return value. The next arguments are the arguments usually passed to malloc/calloc/realloc in the same order.
Have a look at my ruby script here: build_trace_file.rb
It might take a while to convert your data to this format, I suggest running this in a screen session, especially if your memory footprint data is large. Just as a warning, we collected 15 *gigabytes* of data over a 10 hour period. This script took *10 hours* to convert the data. We ended up with a 7.8 gigabyte file.
Step 3: Replay the allocation data with different allocators and measure time, memory usage.
OK, so we now have a file which represents the memory footprint of our application. It’s time to build the replayer, link against your malloc implementation of choice, fire it up and start measuring time spent in allocator functions and memory usage.
Have a look at the replayer here: alloc_tester.c
Build the replayer: gcc -ggdb -Wall -ldl -fPIC -o tester alloc_tester.c
ltrace is similar to strace, but for library calls. You can use ltrace -c to sum the amount of time spent in different library calls and output a cool table at the end, it will look something like this:
% time seconds usecs/call calls function ------ ----------- ----------- --------- -------------------- 86.70 37.305797 62 600003 fscanf 10.64 4.578968 33 138532 malloc 2.36 1.014294 18 55263 free 0.25 0.109550 18 5948 realloc 0.03 0.011407 45 253 printf 0.02 0.010665 42 252 puts 0.00 0.000167 20 8 calloc 0.00 0.000048 48 1 fopen ------ ----------- ----------- --------- -------------------- 100.00 43.030896 800260 total
Using a different malloc implementation can provide a speed/memory increases depending on your allocation patterns. Hopefully the code provided will help you test different allocators to determine whether or not swapping out the default libc allocator is the right choice for you. Our results are still pending; we had a lot of allocator data (15g!) and it takes several hours to replay the data with just one malloc implementation. Once we’ve gathered some data about the different implementations and their effects, I’ll post the results and some analysis. As always, stay tuned and thanks for reading!