Archive for the ‘linux’ tag
Dynamic Linking: ELF vs. Mach-O

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.
The intention of this post is to highlight some of the similarities and differences between ELF and Mach-O dynamic linking that I encountered while building memprof.
I hope to write more posts about similarities and differences in other aspects of Mach-O and ELF that I stumbled across to shed some light on what goes on down there and provide (in some cases) the only documentation.
Procedure Linkage Table
The procedure linkage table (PLT) is used to determine the absolute address of a function at runtime. Both Mach-O and ELF objects have PLTs that are generated at compile time. The initial table simply invokes the dynamic linker which finds the symbol you want. The way this works is very similar at a high level in ELF and Mach-O, but there are some implementation differences that I thought were worth mentioning.
Mach-O PLT arrangement
Mach-O objects have several different sections across different segments that are all involved to create a PLT entry for a specific symbol.
Consider the following assembly stub which calls out to the PLT entry for malloc:
# MACH-O calling a PLT entry (ELF is nearly identical) 0x000000010008c504 [str_new+52]: callq 0x10009ebbc [dyld_stub_malloc]
The dyld_stub prefix is added by GDB to let the user know that the callq instruction is calling a PLT entry and not malloc itself. The address 0x10009ebbc is the first instruction of malloc‘s PLT entry in this Mach-O object. In Mach-O terminology, the instruction at 0x10009ebbc is called a symbol stub. Symbol stubs in Mach-O objects are found in the __TEXT segment in the __symbol_stub1 section.
Let’s examine some instructions at the symbol stub address above:
# MACH-O "symbol stubs" for malloc and other functions 0x10009ebbc [dyld_stub_malloc]: jmpq *0x3ae46(%rip) # 0x1000d9a08 0x10009ebc2 [dyld_stub_realloc]: jmpq *0x3ae48(%rip) # 0x1000d9a10 0x10009ebc8 [dyld_stub_seekdir$INODE64]: jmpq *0x3ae4c(%rip) # 0x1000d9a20 . . . .
Each Mach-O symbol stub is just a single jmpq instruction. That jmpq instruction either:
- Invokes the dynamic linker to find the symbol and transfer execution there
- Transfers execution directly to the function.
OR
via an entry in a table.
In the example above, GDB is telling us that the address of the table entry for malloc is 0x1000d9a08. This table entry is stored in a section called the __la_symbol_ptr within the __DATA segment.
Before malloc has been resolved, the address in that table entry points to a helper function which (eventually) invokes the dynamic linker to find malloc and fill in its address in the table entry.
Let’s take a look at what a few entries of the helper functions look like:
# MACH-O stub helpers 0x1000a08d4 [stub helpers+6986]: pushq $0x3b73 0x1000a08d9 [stub helpers+6991]: jmpq 0x10009ed8a [stub helpers] 0x1000a08de [stub helpers+6996]: pushq $0x3b88 0x1000a08e3 [stub helpers+7001]: jmpq 0x10009ed8a [stub helpers] 0x1000a08e8 [stub helpers+7006]: pushq $0x3b9e 0x1000a08ed [stub helpers+7011]: jmpq 0x10009ed8a [stub helpers] . . . .
Each symbol that has a PLT entry has 2 instructions above; a pair of pushq and jmpq. This instruction sequence sets an ID for the desired function and then invokes the dynamic linker. The dynamic linker looks up this ID so it knows which function it should be looking for.
ELF PLT arrangement
ELF objects have the same mechanism, but organize each PLT entry into chunks instead of splicing them out across different sections. Let’s take a look at a PLT entry for malloc in an ELF object:
# ELF complete PLT entry for malloc 0x40f3d0 [malloc@plt]: jmpq *0x2c91fa(%rip) # 0x6d85d0 0x40f3d6 [malloc@plt+6]: pushq $0x2f 0x40f3db [malloc@plt+11]: jmpq 0x40f0d0 . . . .
Much like a Mach-O object, an ELF object uses a table entry to direct the flow of execution to either invoke the dynamic linker or transfer directly to the desired function if it has already been resolved.
Two differences to point out here:
- ELF puts the entire PLT entry together in nicely named section called
pltinstead of splicing it out across multiple sections. - The table entries indirected through with the initial
jmpqinstruction are stored in a section named:.got.plt.
Both invoke an assembly trampoline…
Both Mach-O and ELF objects are set up to invoke the runtime dynamic linker. Both need an assembly trampoline to bridge the gap between the application and the linker. On 64bit Intel based systems, linkers in both systems must comply to the same Application Binary Interace (ABI).
Strangely enough, the two linkers have slightly different assembly trampolines even though they share the same calling convention1 2.
Both trampolines ensure that the program stack is 16-byte aligned to comply with the amd64 ABI’s calling convention. Both trampolines also take care to save the “general purpose” caller-saved registers prior to invoking the dynamic link, but it turns out that the trampoline in Linux does not save or restore the SSE registers. It turns out that this “shouldn’t” matter, so long as glibc takes care not to use any of those registers in the dynamic linker. OSX takes a more conservative approach and saves and restores the SSE registers before and after calling out the dynamic linker.
I’ve included a snippet from the two trampolines below and some comments so you can see the differences up close.
Different trampolines for the same ABI
The OSX trampoline:
dyld_stub_binder: pushq %rbp movq %rsp,%rbp subq $STACK_SIZE,%rsp # at this point stack is 16-byte aligned because two meta-parameters where pushed movq %rdi,RDI_SAVE(%rsp) # save registers that might be used as parameters movq %rsi,RSI_SAVE(%rsp) movq %rdx,RDX_SAVE(%rsp) movq %rcx,RCX_SAVE(%rsp) movq %r8,R8_SAVE(%rsp) movq %r9,R9_SAVE(%rsp) movq %rax,RAX_SAVE(%rsp) movdqa %xmm0,XMMM0_SAVE(%rsp) movdqa %xmm1,XMMM1_SAVE(%rsp) movdqa %xmm2,XMMM2_SAVE(%rsp) movdqa %xmm3,XMMM3_SAVE(%rsp) movdqa %xmm4,XMMM4_SAVE(%rsp) movdqa %xmm5,XMMM5_SAVE(%rsp) movdqa %xmm6,XMMM6_SAVE(%rsp) movdqa %xmm7,XMMM7_SAVE(%rsp) movq MH_PARAM_BP(%rbp),%rdi # call fastBindLazySymbol(loadercache, lazyinfo) movq LP_PARAM_BP(%rbp),%rsi call __Z21_dyld_fast_stub_entryPvl
The OSX trampoline saves all the caller saved registers as well as the the %xmm0 - %xmm7 registers prior to invoking the dynamic linker with that last call instruction. These registers are all restored after the call instruction, but I left that out for the sake of brevity.
The Linux trampoline:
subq $56,%rsp cfi_adjust_cfa_offset(72) # Incorporate PLT movq %rax,(%rsp) # Preserve registers otherwise clobbered. movq %rcx, 8(%rsp) movq %rdx, 16(%rsp) movq %rsi, 24(%rsp) movq %rdi, 32(%rsp) movq %r8, 40(%rsp) movq %r9, 48(%rsp) movq 64(%rsp), %rsi # Copy args pushed by PLT in register. movq %rsi, %r11 # Multiply by 24 addq %r11, %rsi addq %r11, %rsi shlq $3, %rsi movq 56(%rsp), %rdi # %rdi: link_map, %rsi: reloc_offset call _dl_fixup # Call resolver.
The Linux trampoline doesn’t touch the SSE registers because it assumes that the dynamic linker will not modify them thus avoiding a save and restore.
Conclusion
- Tracing program execution from call site to the dynamic linker is pretty interesting and there is a lot to learn along the way.
- glibc not saving and restoring
%xmm0-%xmm7kind of scares me, but there is a unit test included that disassembles the built ld.so searching it to make sure that those registers are never touched. It is still a bit frightening. - Stay tuned for more posts explaining other interesting similarities and differences between Mach-O and ELF coming soon.
Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.
References
String together global offset tables to build a Ruby memory profiler

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.
Disclaimer
The tricks, techniques, and ugly hacks in this article are PLATFORM SPECIFIC, DANGEROUS, and NOT PORTABLE.
This is the third article in a series of articles describing a set of low level hacks that I used to create memprof a Ruby level memory profiler. You should be able to survive without reading the other articles in this series, but you can check them out here and here.
How is this different from the other hooking articles/techniques?
The previous articles explained how to insert trampolines in the .text segment of a binary. This article explains a cool technique for hooking functions in the .text segment of shared libraries, allowing your handler to run, and then resuming execution. Hooking shared libraries turns out to be less work than hooking the binary (in the case of Ruby, that is), but making it all happen was a bit tricky. Read on to learn more.
The “problem” with shared libraries
The problem is that if a trampoline is inserted into the code of the shared library, the trampoline will need to invoke the dynamic linker to resolve the function that is being hooked, call the function, do whatever additional logic is desired, and then resume execution.
In other words you need to (somehow) insert a trampoline for a function that will call the function being trampolined without ending up in an infinite loop.
The additional complexity occurs because when shared libraries are loaded, the kernel decides at runtime where exactly in memory the library should be loaded. Since the exact location of symbols is not known at link time, a procedure linkage table (.plt) is created so that the program and the dynamic linker can work together to resolve symbol addresses.
I explained how .plts work in a previous article, but looking at this again is worthwhile. I’ve simplified the explanation a bit1, but at a high level:
- Program calls a function in a shared object, the link editor makes sure that the program jumps to a stub function in the
.plt - The program sets some data up for the dynamic linker and then hands control over to it.
- The dynamic linker looks at the info set up by the program and fills in the absolute address of the function that was called in the
.pltin the global offset table (.got). - Then the dynamic linker calls the function.
- Subsequent calls to the same function jump to the same stub in the
.plt, but every time after the first call the absolute address is already in the.got(because when the dynamic linker is invoked the first time, it fills in the absolute address in the.got).
Disassembling a short Ruby VM function that calls rb_newobj (a memory allocation routine that we’d like to hook), shows the calls to the .plt:
000000000001af10: . . . . 1af14: e8 e7 c6 ff ff callq 17600 [rb_newobj@plt] . . . .
Let’s take a look at the corresponding .plt stub:
0000000000017600: 17600: ff 25 6a 9c 2c 00 jmpq *0x2c9c6a(%rip) # 2e1270 [_GLOBAL_OFFSET_TABLE_+0x288] 17606: 68 4e 00 00 00 pushq $0x4e 1760b: e9 00 fb ff ff jmpq 17110 <_init+0x18>
Important fact: The program and each shared library has its own .plt and .got sections (amongst other sections). Keep this in mind as it’ll be handy very shortly.
That is a lot of stub code to reproduce in the trampoline. Reproducing that stuff in the trampoline shouldn’t be hard, but invites a large number of bugs over to play. Is there a better way?
What is a global offset table (.got)?
The global offset table (.got) is a table of absolute addresses that can be filled in at runtime. In the assembly dump above, the .got entry for rb_newobj is referenced in the .plt stub code.
Intercepting a function call
It would be awesome if it were possible to overwrite the .got entry for rb_newobj and insert the address of a trampoline. But how would the intercepting function call rb_newobj itself without ending up in an infinite loop?
The important fact above comes in to save the day.
Since each shared object has its own .plt and .got sections, it is possible to overwrite the .got entry for rb_newobj in every shared object except for the object where the trampoline lives. Then, when rb_newobj is called, the .plt entry will redirect execution to the trampoline. The trampoline then calls out to its .plt entry for rb_newobj which is left untouched allowing rb_newobj to be resolved and called out to successfully.
Not as easy as it sounds, though
This solution is less work than the other hooking methods, but it has its own particular details as well:
- You’ll need to walk the link map at runtime to determine the base address for the shared library you are hooking (it could be anywhere).
- Next, you’ll need to parse the
.rela.pltsection which contains information on the location of each.pltstub, relative to the base address of the shared object. - Once you have the address of the
.pltstub, you’ll need to determine the absolute address of the.gotentry by parsing the first instruction of the.pltstub (ajmp) as seen in the disassembly above. - Finally, you can write to the
.gotentry the address of your trampoline, as long as the trampoline lives in a different shared library.
You’ve now successfully managed to poison the .got entry of a symbol in one shared library to direct execution to your own function which can then call the intercepted function itself without getting stuck in an infinite loop.
Conclusion
- There are lots of sections in each ELF object. Each section is special and important.
- ELF documentation can be difficult to obtain and understand.
- Got pretty lucky this time around. I was getting a little worried that it would get complicated. Made it out alive, though.
Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.
References
memprof: A Ruby level memory profiler

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.
What is memprof and why do I care?
memprof is a Ruby gem which supplies memory profiler functionality similar to bleak_house without patching the Ruby VM. You just install the gem, call a function or two, and off you go.
Where do I get it?
memprof is available on gemcutter, so you can just:
gem install memprof
Feel free to browse the source code at: http://github.com/ice799/memprof.
How do I use it?
Using memprof is simple. Before we look at some examples, let me explain more precisely what memprof is measuring.
memprof is measuring the number of objects created and not destroyed during a segment of Ruby code. The ideal use case for memprof is to show you where objects that do not get destroyed are being created:
- Objects are created and not destroyed when you create new classes. This is a good thing.
- Sometimes garbage objects sit around until
garbage_collecthas had a chance to run. These objects will go away. - Yet in other cases you might be holding a reference to a large chain of objects without knowing it. Until you remove this reference, the entire chain of objects will remain in memory taking up space.
memprof will show objects created in all cases listed above.
OK, now Let’s take a look at two examples and their output.
A simple program with an obvious memory “leak”:
require 'memprof'
@blah = Hash.new([])
Memprof.start
100.times {
@blah[1] << "aaaaa"
}
1000.times {
@blah[2] << "bbbbb"
}
Memprof.stats
Memprof.stop
This program creates 1100 objects which are not destroyed during the start and stop sections of the file because references are held for each object created.
Let's look at the output from memprof:
1000 test.rb:11:String
100 test.rb:7:String
In this example memprof shows the 1100 created, broken up by file, line number, and type.
Let's take a look at another example:
require 'memprof' Memprof.start require "stringio" StringIO.new Memprof.stats
This simple program is measuring the number of objects created when requiring stringio.
Let's take a look at the output:
108 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:__node__
14 test2.rb:3:String
2 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Class
1 test2.rb:4:StringIO
1 test2.rb:4:String
1 test2.rb:3:Array
1 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Enumerable
This output shows an internal Ruby interpreter type __node__ was created (these represent code), as well as a few Strings and other objects. Some of these objects are just garbage objects which haven't had a chance to be recycled yet.
What if nudge the garbage_collector along a little bit just for our example? Let's add the following two lines of code to our previous example:
GC.start Memprof.stats
We're now nudging the garbage collector and outputting memprof stats information again. This should show fewer objects, as the garbage collector will recycle some of the garbage objects:
108 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:__node__
2 test2.rb:3:String
2 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Class
1 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Enumerable
As you can see above, a few Strings and other objects went away after the garbage collector ran.
Which Rubies and systems are supported?
- Only unstripped binaries are supported. To determine if your Ruby binary is stripped, simply run:
file `which ruby`. If it is, consult your package manager's documentation. Most Linux distributions offer a package with an unstripped Ruby binary. - Only x86_64 is supported at this time. Hopefully, I'll have time to add support for i386/i686 in the immediate future.
- Linux Ruby Enterprise Edition (1.8.6 and 1.8.7) is supported.
- Linux MRI Ruby 1.8.6 and 1.8.7 built with --disable-shared are supported. Support for --enable-shared binaries is coming soon.
- Snow Leopard support is experimental at this time.
- Ruby 1.9 support coming soon.
How does it work?
If you've been reading my blog over the last week or so, you'd have noticed two previous blog posts (here and here) that describe some tricks I came up with for modifying a running binary image in memory.
memprof is a combination of all those tricks and other hacks to allow memory profiling in Ruby without the need for custom patches to the Ruby VM. You simply require the gem and off you go.
memprof works by inserting trampolines on object allocation and deallocation routines. It gathers metadata about the objects and outputs this information when the stats method is called.
What else is planned?
Myself, Jake Douglas, and Aman Gupta have lots of interesting ideas for new features. We don't want to ruin the surprise, but stay tuned. More cool stuff coming really soon :)
Thanks for reading and don't forget to subscribe (via RSS or e-mail) and follow me on twitter.
Hot patching inlined functions with x86_64 asm metaprogramming

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.
Disclaimer
The tricks, techniques, and ugly hacks in this article are PLATFORM SPECIFIC, DANGEROUS, and NOT PORTABLE.
This article will make reference to information in my previous article Rewrite your Ruby VM at runtime to hot patch useful features so be sure to check it out if you find yourself lost during this article.
Also, this might not qualify as metaprogramming in the traditional definition1, but this article will show how to generate assembly at runtime that works well with the particular instructions generated for a binary. In other words, the assembly is constructed based on data collected from the binary at runtime. When I explained this to Aman, he called it assembly metaprogramming.
TLDR
This article expands on a previous article by showing how to hook functions which are inlined by the compiler. This technique can be applied to other binaries, but the binary in question is Ruby Enterprise Edition 1.8.7. The use case is to build a memory profiler without requiring patches to the VM, but just a Ruby gem.
It’s on GitHub
The memory profiler is NOT DONE, yet. It will be soon. Stay tuned.
The code described here is incorporated into a Ruby Gem which can be found on github: http://github.com/ice799/memprof specifically at: http://github.com/ice799/memprof/blob/master/ext/memprof.c#L202-318
Overview of the plan of attack
The plan of attack is relatively straight forward:
- Find the inlined code.
- Overwrite part of it to redirect to a stub.
- Call out to a handler from the stub.
- Make sure the return path is sane.
As simple as this seems, implementing these steps is actually a bit tricky.
Finding pieces of inlined code
Before finding pieces of inlined code, let’s first examine the C code we want to hook. I’m going to be showing how to hook the inline function add_freelist.
The code for add_freelist is short:
static inline void
add_freelist(p)
RVALUE *p;
{
if (p->as.free.flags != 0)
p->as.free.flags = 0;
if (p->as.free.next != freelist)
p->as.free.next = freelist;
freelist = p;
}
There is one really important feature of this code which stands out almost immediately. freelist has (at least) compilation unit scope. This is awesome because freelist serves as a marker when searching for assembly instructions to overwrite. Since the freelist has compilation unit scope, it’ll live at some static memory location.
If we find writes to this static memory location, we find our inline function code.
Let’s take a look at the instructions generated from this C code (unrelated instructions snipped out):
437f21: 48 c7 00 00 00 00 00 movq $0x0,(%rax) . . . . . 437f2c: 48 8b 05 65 de 2d 00 mov 0x2dde65(%rip),%rax # 715d98 [freelist] . . . . . 437f48: 48 89 05 49 de 2d 00 mov %rax,0x2dde49(%rip) # 715d98 [freelist]
The last instruction above updates freelist, it is the instruction generated for the C statement freelist = p;.
As you can see from the instruction, the destination is freelist. This makes it insanely easy to locate instances of this inline function. Just need to write a piece of C code which scans the binary image in memory, searching for mov instructions where the destination is freelist and I’ve found the inlined instances of add_freelist.
Why not insert a trampoline by overwriting that last mov instruction?
Overwriting with a jmp
The mov instruction above is 7 bytes wide. As long as the instruction we’re going to implant is 7 bytes or thinner, everything is good to go. Using a callq is out of the question because we can’t ensure the stack is 16-byte aligned as per the x86_64 ABI2. As it turns out, a jmp instruction that uses a 32bit displacement from the instruction pointer only requires 5 bytes. We’ll be able to implant the instruction that’s needed, and even have room to spare.
I created a struct to encapsulate this short 7 byte trampoline. 5 bytes for the jmp, 2 bytes for NOPs. Let’s take a look:
struct tramp_inline tramp = {
.jmp = {'\xe9'},
.displacement = 0,
.pad = {'\x90', '\x90'},
};
Let’s fill in the displacement later, after actually finding the instruction that’s going to get overwritten.
So, to find the instruction that’ll be overwritten, just look for a mov opcode and check that the destination is freelist:
/* make sure it is a mov instruction */
if (byte[1] == '\x89') {
/* Read the REX byte to make sure it is a mov that we care about */
if ( (byte[0] == '\x48') ||
(byte[0] == '\x4c') ) {
/* Grab the target of the mov. REMEMBER: in this case the target is
* a 32bit displacment that gets added to RIP (where RIP is the adress of
* the next instruction).
*/
mov_target = *(uint32_t *)(byte + 3);
/* Sanity check. Ensure that the displacement from freelist to the next
* instruction matches the mov_target. If so, we know this mov is
* updating freelist.
*/
if ( (freelist - (void *)(byte+7) ) == mov_target) {
At this point we’ve definitely found a mov instruction with freelist as the destination. Let’s calculate the displacement to the stage 2 trampoline for our jmp instruction and write the instruction into memory.
/* Setup the stage 1 trampoline. Calculate the displacement to
* the stage 2 trampoline from the next instruction.
*
* REMEMBER!!!! The next instruction will be NOP after our stage 1
* trampoline is written. This is 5 bytes into the structure, even
* though the original instruction we overwrote was 7 bytes.
*/
tramp.displacement = (uint32_t)(destination - (void *)(byte+5));
/* Figure out what page the stage 1 tramp is gonna be written to, mark
* it WRITE, write the trampoline in, and then remove WRITE permission.
*/
aligned_addr = page_align(byte);
mprotect(aligned_addr, (void *)byte - aligned_addr + 10,
PROT_READ|PROT_WRITE|PROT_EXEC);
memcpy(byte, &tramp, sizeof(struct tramp_inline));
mprotect(aligned_addr, (void *)byte - aligned_addr + 10,
PROT_READ|PROT_EXEC);
Cool, all that’s left is to build the stage 2 trampoline which will set everything up for the C level handler.
An assembly stub to set the stage for our C handler
So, what does the assembly need to do to call the C handler? Quite a bit actually so let’s map it out, step by step:
- Replicate the instruction which was overwritten so that the object is actually added to the freelist.
- Save the value of
rdiregister. This register is where the first argument to a function lives and will store the obj that was added to the freelist for the C handler to do analysis on. - Load the object being added to the freelist into
rdi - Save the value of
rbxso that we can use the register as an operand for an absolute indirectcallqinstruction. - Save
rbpandrspto allow a way to undo the stack alignment later. - Align the stack to a 16-byte boundary to comply with the x86_64 ABI.
- Move the address of the handler into
rbx - Call the handler through
rbx. - Restore
rbp,rsp,rdi,rbx. - Jump back to the instruction after the instruction which was overwritten.
To accomplish this let’s build out a structure with as much set up as possible and fill in the displacement fields later. This “base” struct looks like this:
struct inline_tramp_tbl_entry inline_ent = {
.rex = {'\x48'},
.mov = {'\x89'},
.src_reg = {'\x05'},
.mov_displacement = 0,
.frame = {
.push_rdi = {'\x57'},
.mov_rdi = {'\x48', '\x8b', '\x3d'},
.rdi_source_displacement = 0,
.push_rbx = {'\x53'},
.push_rbp = {'\x55'},
.save_rsp = {'\x48', '\x89', '\xe5'},
.align_rsp = {'\x48', '\x83', '\xe4', '\xf0'},
.mov = {'\x48', '\xbb'},
.addr = error_tramp,
.callq = {'\xff', '\xd3'},
.leave = {'\xc9'},
.rbx_restore = {'\x5b'},
.rdi_restore = {'\x5f'},
},
.jmp = {'\xe9'},
.jmp_displacement = 0,
};
So, what’s left to do:
- Copy the REX and source register bytes of the instruction which was overwritten to replicate it.
- Calculate the displacement to
freelistto fully generate the overwrittenmov. - Calculate the displacement to
freelistso that it can be stored inrdias an argument to the C handler. - Fill in the absolute address for the handler.
- Calculate the displacement to the instruction after the stage 1 trampoline in order to
jmpback to resume execution as normal.
Doing that is relatively straight-forward. Let’s take a look at the C snippets that make this happen:
/* Before the stage 1 trampoline gets written, we need to generate
* the code for the stage 2 trampoline. Let's copy over the REX byte
* and the byte which mentions the source register into the stage 2
* trampoline.
*/
inl_tramp_st2 = inline_tramp_table + entry;
inl_tramp_st2->rex[0] = byte[0];
inl_tramp_st2->src_reg[0] = byte[2];
. . . . .
/* Finish setting up the stage 2 trampoline. */
/* calculate the displacement to freelist from the next instruction.
*
* This is used to replicate the original instruction we overwrote.
*/
inl_tramp_st2->mov_displacement = freelist - (void *)&(inl_tramp_st2->frame);
/* fill in the displacement to freelist from the next instruction.
*
* This is to arrange for the new value in freelist to be in %rdi, and as such
* be the first argument to the C handler. As per the amd64 ABI.
*/
inl_tramp_st2->frame.rdi_source_displacement = freelist -
(void *)&(inl_tramp_st2->frame.push_rbx);
/* jmp back to the instruction after stage 1 trampoline was inserted
*
* This can be 5 or 7, it doesn't matter. If its 5, we'll hit our 2
* NOPS. If its 7, we'll land directly on the next instruction.
*/
inl_tramp_st2->jmp_displacement = (uint32_t)((void *)(byte + 7) -
(void *)(inline_tramp_table + entry + 1));
/* write the address of our C level trampoline in to the structure */
inl_tramp_st2->frame.addr = freelist_tramp;
Awesome.
We’ve successfully patched the binary in memory, inserted an assembly stub which was generated at runtime, called a hook function, and ensured that execution can resume normally.
So, what’s the status on that memory profiler?
Almost done, stay tuned for more updates coming SOON.
Conclusion
- Hackery like this is unmaintainable, unstable, stupid, but also fun to work on and think about.
- Being able to hook
add_freelistlike this provides the last tool needed to implement a version of bleak_house (a Ruby memory profiler) without patching the Ruby VM. - x86_64 instruction set is a painful instruction set.
- Use the GNU assembler (gas) instead of trying to generate opcodes by reading the Intel instruction set PDFs if you value your sanity.
Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.
References
Debugging Ruby: Understanding and Troubleshooting the VM and your Application
Download the PDF here.

