Archive for the ‘allocator’ tag
Rewrite your Ruby VM at runtime to hot patch useful features

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.
Some notes before the blood starts flowin’
- CAUTION: What you are about to read is dangerous, non-portable, and (in most cases) stupid.
- The code and article below refer only to the x86_64 architecture.
- Grab some gauze. This is going to get ugly.
TLDR
This article shows off a Ruby gem which has the power to overwrite a Ruby binary in memory while it is running to allow your code to execute in place of internal VM functions. This is useful if you’d like to hook all object allocation functions to build a memory profiler.
This gem is on GitHub
Yes, it’s on GitHub: http://github.com/ice799/memprof.
I want a memory profiler for Ruby
This whole science experiment started during RubyConf when Aman and I began brainstorming ways to build a memory profiling tool for Ruby.
The big problem in our minds was that for most tools we’d have to include patches to the Ruby VM. That process is long and somewhat difficult, so I started thinking about ways to do this without modifying the Ruby source code itself.
The memory profiler is NOT DONE just yet. I thought that the hack I wrote to let us build something without modifying Ruby source code was interesting enough that it warranted a blog post. So let’s get rolling.
What is a trampoline?
Let’s pretend you have 2 functions: functionA() and functionB(). Let’s assume that functionA() calls functionB().
Now also imagine that you’d like to insert a piece of code to execute in between the call to functionB(). You can imagine inserting a piece of code that diverts execution elsewhere, creating a flow: functionA() –> functionC() –> functionB()
You can accomplish this by inserting a trampoline.
A trampoline is a piece of code that program execution jumps into and then bounces out of and on to somewhere else1.
This hack relies on the use of multiple trampolines. We’ll see why shortly.
Two different kinds of trampolines
There are two different kinds of trampolines that I considered while writing this hack, let’s take a closer look at both.
Caller-side trampoline
A caller-side trampoline works by overwriting the opcodes in the .text segment of the program in the calling function causing it to call a different function at runtime.
The big pros of this method are:
- You aren’t overwriting any code, only the address operand of a
callqinstruction. - Since you are only changing an operand, you can hook any function. You don’t need to build custom trampolines for each function.
This method also has some big cons too:
- You’ll need to scan the entire binary in memory and find and overwrite all address operands of
callq. This is problematic because if you overwrite any false-positives you might break your application. - You have to deal with the implications of
callq, which can be painful as we’ll see soon.
Callee-side trampoline
A callee-side trampoline works by overwriting the opcodes in the .text segment of the program in the called function, causing it to call another function immediately
The big pro of this method is:
- You only need to overwrite code in one place and don’t need to worry about accidentally scribbling on bytes that you didn’t mean to.
this method has some big cons too:
- You’ll need to carefully construct your trampoline code to only overwrite as little of the function as possible (or some how restore opcodes), especially if you expect the original function to work as expected later.
- You’ll need to special case each trampoline you build for different optimization levels of the binary you are hooking into.
I went with a caller-side trampoline because I wanted to ensure that I can hook any function and not have to worry about different Ruby binaries causing problems when they are compiled with different optimization levels.
The stage 1 trampoline
To insert my trampolines I needed to insert some binary into the process and then overwrite callq instructions like this:
41150b: e8 cc 4e 02 00 callq 4363dc [rb_newobj] 411510: 48 89 45 f8 ....
In the above code snippet, the byte e8 is the callq opcode and the bytes cc 4e 02 00 are the distance to rb_newobj from the address of the next instruction, 0×411510
All I need to do is change the 4 bytes following e8 to equal the displacement between the next instruction, 0×411510 in this case, and my trampoline.
Problem.
My first cut at this code lead me to an important realization: the callq instructions used expect a 32bit displacement from the function I am calling and not absolute addresses. But, the 64bit address space is very large. The displacement between the code for the Ruby binary that lives in the .text segment is so far away from my Ruby gem that the displacement cannot be represented with only 32bits.
So what now?
Well, luckily mmap has a flag MAP_32BIT which maps a page in the first 2GB of the address space. If I map some code there, it should be well within the range of values whose displacement I can represent in 32bits.
So, why not map a second trampoline to that page which can contains code that can call an absolute address?
My stage 1 trampoline code looks something like this:
/* the struct below is just a sequence of bytes which represent the
* following bit of assembly code, including 3 nops for padding:
*
* mov $address, %rbx
* callq *%rbx
* ret
* nop
* nop
* nop
*/
struct tramp_tbl_entry ent = {
.mov = {'\x48','\xbb'},
.addr = (long long)&error_tramp,
.callq = {'\xff','\xd3'},
.ret = '\xc3',
.pad = {'\x90','\x90','\x90'},
};
tramp_table = mmap(NULL, 4096, PROT_WRITE|PROT_READ|PROT_EXEC,
MAP_32BIT|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (tramp_table != MAP_FAILED) {
for (; i < 4096/sizeof(struct tramp_tbl_entry); i ++ ) {
memcpy(tramp_table + i, &ent, sizeof(struct tramp_tbl_entry));
}
}
}
It mmaps a single page and writes a table of default trampolines (like a jump table) that all call an error trampoline by default. When a new trampoline is inserted, I just go to that entry in the table and insert the address that should be called.
To get around the displacement challenge described above, the addresses I insert into the stage 1 trampoline table are addresses for stage 2 trampolines.
The stage 2 trampoline
Setting up the stage 2 trampolines are pretty simple once the stage 1 trampoline table has been written to memory. All that needs to be done is update the address field in a free stage 1 trampoline to be the address of my stage 2 trampoline. These trampolines are written in C and live in my Ruby gem.
static void
insert_tramp(char *trampee, void *tramp) {
void *trampee_addr = find_symbol(trampee);
int entry = tramp_size;
tramp_table[tramp_size].addr = (long long)tramp;
tramp_size++;
update_image(entry, trampee_addr);
}
An example of a stage 2 trampoline for rb_newobj might be:
static VALUE
newobj_tramp() {
/* print the ruby source and line number where the allocation is occuring */
printf("source = %s, line = %d\n", ruby_sourcefile, ruby_sourceline);
/* call newobj like normal so the ruby app can continue */
return rb_newobj();
}
Programatically rewriting the Ruby binary in memory
Overwriting the Ruby binary to cause my stage 1 trampolines to get hit is pretty simple, too. I can just scan the .text segment of the binary looking for bytes which look like callq instructions. Then, I can sanity check by reading the next 4 bytes which should be the displacement to the original function. Doing that sanity check should prevent false positives.
static void
update_image(int entry, void *trampee_addr) {
char *byte = text_segment;
size_t count = 0;
int fn_addr = 0;
void *aligned_addr = NULL;
/* check each byte in the .text segment */
for(; count < text_segment_len; count++) {
/* if it looks like a callq instruction... */
if (*byte == '\xe8') {
/* the next 4 bytes SHOULD BE the original displacement */
fn_addr = *(int *)(byte+1);
/* do a sanity check to make sure the next few bytes are an accurate displacement.
* this helps to eliminate false positives.
*/
if (trampee_addr - (void *)(byte+5) == fn_addr) {
aligned_addr = (void*)(((long)byte+1)&~(0xffff));
/* mark the page in the .text segment as writable so it can be modified */
mprotect(aligned_addr, (void *)byte+1 - aligned_addr + 10,
PROT_READ|PROT_WRITE|PROT_EXEC);
/* calculate the new displacement and write it */
*(int *)(byte+1) = (uint32_t)((void *)(tramp_table + entry)
- (void *)(byte + 5));
/* disallow writing to this page of the .text segment again */
mprotect(aligned_addr, (((void *)byte+1) - aligned_addr) + 10,
PROT_READ|PROT_EXEC);
}
}
byte++;
}
}
Sample output
After requiring my ruby gem and running a test script which creates lots of objects, I see this output:
... source = test.rb, line = 8 source = test.rb, line = 8 source = test.rb, line = 8 source = test.rb, line = 8 source = test.rb, line = 8 source = test.rb, line = 8 source = test.rb, line = 8 ...
Showing the file name and line number for each object getting allocated. That should be a strong enough primitive to build a Ruby memory profiler without requiring end users to build a custom version of Ruby. It should also be possible to re-implement bleak_house by using this gem (and maybe another trick or two).
Awesome.
Conclusion
- One step closer to building a memory profiler without requiring end users to find and use patches floating around the internet.
- It is unclear whether cheap tricks like this are useful or harmful, but they are fun to write.
- If you understand how your system works at an intimate level, nearly anything is possible. The work required to make it happen might be difficult though.
Thanks for reading and don't forget to subscribe (via RSS or e-mail) and follow me on twitter.
References
a/b test mallocs against your memory footprint

The other day at Kickball Labs we were discussing whether linking Ruby against tcmalloc (or ptmalloc3, nedmalloc, or any other malloc) would have any noticeable effect on application latency. After taking a side in the argument, I started wondering how we could test this scenario.
We had a couple different ideas about testing:
- Look at other people’s benchmarks
BUT do the memory workloads tested in the benchmarks actually match our own workload at all? - Run different allocators on different Ruby backends
BUT different backends will get different users who will use the system differently and cause different allocation patterns - Try to recreate our applications memory footprint and test that against different mallocs
BUT how?
I decided to explore the last option and came up with an interesting solution. Let’s dive into how to do this.
Get the code:
http://github.com/ice799/malloc_wrap/tree/master
Step 1: We need to get a memory footprint of our process
So we have some random binary (in this case it happens to be a Ruby interpreter, but it could be anything) and we’d like to track when it calls malloc/realloc/calloc and free (from now on I’ll refer to all of these as malloc-family for brevity). There are two ways to do this, the right way and the wrong/hacky/unsafe way.
-
The “right” way to do this, with libc malloc hooks:
Edit your application code to use the malloc debugging hooks provided by libc. When a malloc-family function is called, your hook executes and outputs to a file which function was called and what arguments were passed to it.
-
The “wrong/hacky/unsafe” way to do this, with LD_PRELOAD:
Create a shim library and point LD_PRELOAD at it. The shim exports the malloc-family symbols, and when your application calls one of those functions, the shim code gets executed. The shim logs which function was called and with what arguments. The shim then calls the libc version of the function (so that memory is actually allocated/freed) and returns control to the application.
I chose to do it the second way, because I like living on the edge. The second way is unsafe because you can’t call any functions which use a malloc-family function before your hooks are setup. If you do, you can end up in an infinite loop and crash the application.
You can check out my implementation for the shim library here: malloc_wrap.c
Why does your shim output such weirdly formatted data?
Answer is sort of complicated, but let’s keep it simple: I originally had a different idea about how I was going to use the output. When that first try failed, I tried something else and translated the data to the format I needed it in, instead of re-writing the shim. What can I say, I’m a lazy programmer.
OK, so once you’ve built the shim (gcc -O2 -Wall -ldl -fPIC -o malloc_wrap.so -shared malloc_wrap.c), you can launch your binary like this:
You should now see output in /tmp/malloc-footprint.pid
Step 2: Translate the data into a more usable format
Yeah, I should have went back and re-written the shim, but nothing happens exactly as planned. So, I wrote a quick ruby script to convert my output into a more usable format. The script sorts through the output and renames memory addresses to unique integer ids starting at 1 (0 is hardcoded to NULL).
The format is pretty simple. The first line of the file has the number of calls to malloc-family functions, followed by a blank line, and then the memory footprint. Each line of the memory footprint has 1 character which represents the function called followed by a few arguments. For the free() function, there is only one argument, the ID of the memory block to free. malloc/calloc/realloc have different arguments, but the first argument following the one character is always the ID of the return value. The next arguments are the arguments usually passed to malloc/calloc/realloc in the same order.
Have a look at my ruby script here: build_trace_file.rb
It might take a while to convert your data to this format, I suggest running this in a screen session, especially if your memory footprint data is large. Just as a warning, we collected 15 *gigabytes* of data over a 10 hour period. This script took *10 hours* to convert the data. We ended up with a 7.8 gigabyte file.
Step 3: Replay the allocation data with different allocators and measure time, memory usage.
OK, so we now have a file which represents the memory footprint of our application. It’s time to build the replayer, link against your malloc implementation of choice, fire it up and start measuring time spent in allocator functions and memory usage.
Have a look at the replayer here: alloc_tester.c
Build the replayer: gcc -ggdb -Wall -ldl -fPIC -o tester alloc_tester.c
Use ltrace
ltrace is similar to strace, but for library calls. You can use ltrace -c to sum the amount of time spent in different library calls and output a cool table at the end, it will look something like this:
% time seconds usecs/call calls function ------ ----------- ----------- --------- -------------------- 86.70 37.305797 62 600003 fscanf 10.64 4.578968 33 138532 malloc 2.36 1.014294 18 55263 free 0.25 0.109550 18 5948 realloc 0.03 0.011407 45 253 printf 0.02 0.010665 42 252 puts 0.00 0.000167 20 8 calloc 0.00 0.000048 48 1 fopen ------ ----------- ----------- --------- -------------------- 100.00 43.030896 800260 total
Conclusion
Using a different malloc implementation can provide a speed/memory increases depending on your allocation patterns. Hopefully the code provided will help you test different allocators to determine whether or not swapping out the default libc allocator is the right choice for you. Our results are still pending; we had a lot of allocator data (15g!) and it takes several hours to replay the data with just one malloc implementation. Once we’ve gathered some data about the different implementations and their effects, I’ll post the results and some analysis. As always, stay tuned and thanks for reading!

