Hot patching inlined functions with x86_64 asm metaprogramming

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.
Disclaimer
The tricks, techniques, and ugly hacks in this article are PLATFORM SPECIFIC, DANGEROUS, and NOT PORTABLE.
This article will make reference to information in my previous article Rewrite your Ruby VM at runtime to hot patch useful features so be sure to check it out if you find yourself lost during this article.
Also, this might not qualify as metaprogramming in the traditional definition1, but this article will show how to generate assembly at runtime that works well with the particular instructions generated for a binary. In other words, the assembly is constructed based on data collected from the binary at runtime. When I explained this to Aman, he called it assembly metaprogramming.
TLDR
This article expands on a previous article by showing how to hook functions which are inlined by the compiler. This technique can be applied to other binaries, but the binary in question is Ruby Enterprise Edition 1.8.7. The use case is to build a memory profiler without requiring patches to the VM, but just a Ruby gem.
It’s on GitHub
The memory profiler is NOT DONE, yet. It will be soon. Stay tuned.
The code described here is incorporated into a Ruby Gem which can be found on github: http://github.com/ice799/memprof specifically at: http://github.com/ice799/memprof/blob/master/ext/memprof.c#L202-318
Overview of the plan of attack
The plan of attack is relatively straight forward:
- Find the inlined code.
- Overwrite part of it to redirect to a stub.
- Call out to a handler from the stub.
- Make sure the return path is sane.
As simple as this seems, implementing these steps is actually a bit tricky.
Finding pieces of inlined code
Before finding pieces of inlined code, let’s first examine the C code we want to hook. I’m going to be showing how to hook the inline function add_freelist.
The code for add_freelist is short:
static inline void
add_freelist(p)
RVALUE *p;
{
if (p->as.free.flags != 0)
p->as.free.flags = 0;
if (p->as.free.next != freelist)
p->as.free.next = freelist;
freelist = p;
}
There is one really important feature of this code which stands out almost immediately. freelist has (at least) compilation unit scope. This is awesome because freelist serves as a marker when searching for assembly instructions to overwrite. Since the freelist has compilation unit scope, it’ll live at some static memory location.
If we find writes to this static memory location, we find our inline function code.
Let’s take a look at the instructions generated from this C code (unrelated instructions snipped out):
437f21: 48 c7 00 00 00 00 00 movq $0x0,(%rax) . . . . . 437f2c: 48 8b 05 65 de 2d 00 mov 0x2dde65(%rip),%rax # 715d98 [freelist] . . . . . 437f48: 48 89 05 49 de 2d 00 mov %rax,0x2dde49(%rip) # 715d98 [freelist]
The last instruction above updates freelist, it is the instruction generated for the C statement freelist = p;.
As you can see from the instruction, the destination is freelist. This makes it insanely easy to locate instances of this inline function. Just need to write a piece of C code which scans the binary image in memory, searching for mov instructions where the destination is freelist and I’ve found the inlined instances of add_freelist.
Why not insert a trampoline by overwriting that last mov instruction?
Overwriting with a jmp
The mov instruction above is 7 bytes wide. As long as the instruction we’re going to implant is 7 bytes or thinner, everything is good to go. Using a callq is out of the question because we can’t ensure the stack is 16-byte aligned as per the x86_64 ABI2. As it turns out, a jmp instruction that uses a 32bit displacement from the instruction pointer only requires 5 bytes. We’ll be able to implant the instruction that’s needed, and even have room to spare.
I created a struct to encapsulate this short 7 byte trampoline. 5 bytes for the jmp, 2 bytes for NOPs. Let’s take a look:
struct tramp_inline tramp = {
.jmp = {'\xe9'},
.displacement = 0,
.pad = {'\x90', '\x90'},
};
Let’s fill in the displacement later, after actually finding the instruction that’s going to get overwritten.
So, to find the instruction that’ll be overwritten, just look for a mov opcode and check that the destination is freelist:
/* make sure it is a mov instruction */
if (byte[1] == '\x89') {
/* Read the REX byte to make sure it is a mov that we care about */
if ( (byte[0] == '\x48') ||
(byte[0] == '\x4c') ) {
/* Grab the target of the mov. REMEMBER: in this case the target is
* a 32bit displacment that gets added to RIP (where RIP is the adress of
* the next instruction).
*/
mov_target = *(uint32_t *)(byte + 3);
/* Sanity check. Ensure that the displacement from freelist to the next
* instruction matches the mov_target. If so, we know this mov is
* updating freelist.
*/
if ( (freelist - (void *)(byte+7) ) == mov_target) {
At this point we’ve definitely found a mov instruction with freelist as the destination. Let’s calculate the displacement to the stage 2 trampoline for our jmp instruction and write the instruction into memory.
/* Setup the stage 1 trampoline. Calculate the displacement to
* the stage 2 trampoline from the next instruction.
*
* REMEMBER!!!! The next instruction will be NOP after our stage 1
* trampoline is written. This is 5 bytes into the structure, even
* though the original instruction we overwrote was 7 bytes.
*/
tramp.displacement = (uint32_t)(destination - (void *)(byte+5));
/* Figure out what page the stage 1 tramp is gonna be written to, mark
* it WRITE, write the trampoline in, and then remove WRITE permission.
*/
aligned_addr = page_align(byte);
mprotect(aligned_addr, (void *)byte - aligned_addr + 10,
PROT_READ|PROT_WRITE|PROT_EXEC);
memcpy(byte, &tramp, sizeof(struct tramp_inline));
mprotect(aligned_addr, (void *)byte - aligned_addr + 10,
PROT_READ|PROT_EXEC);
Cool, all that’s left is to build the stage 2 trampoline which will set everything up for the C level handler.
An assembly stub to set the stage for our C handler
So, what does the assembly need to do to call the C handler? Quite a bit actually so let’s map it out, step by step:
- Replicate the instruction which was overwritten so that the object is actually added to the freelist.
- Save the value of
rdiregister. This register is where the first argument to a function lives and will store the obj that was added to the freelist for the C handler to do analysis on. - Load the object being added to the freelist into
rdi - Save the value of
rbxso that we can use the register as an operand for an absolute indirectcallqinstruction. - Save
rbpandrspto allow a way to undo the stack alignment later. - Align the stack to a 16-byte boundary to comply with the x86_64 ABI.
- Move the address of the handler into
rbx - Call the handler through
rbx. - Restore
rbp,rsp,rdi,rbx. - Jump back to the instruction after the instruction which was overwritten.
To accomplish this let’s build out a structure with as much set up as possible and fill in the displacement fields later. This “base” struct looks like this:
struct inline_tramp_tbl_entry inline_ent = {
.rex = {'\x48'},
.mov = {'\x89'},
.src_reg = {'\x05'},
.mov_displacement = 0,
.frame = {
.push_rdi = {'\x57'},
.mov_rdi = {'\x48', '\x8b', '\x3d'},
.rdi_source_displacement = 0,
.push_rbx = {'\x53'},
.push_rbp = {'\x55'},
.save_rsp = {'\x48', '\x89', '\xe5'},
.align_rsp = {'\x48', '\x83', '\xe4', '\xf0'},
.mov = {'\x48', '\xbb'},
.addr = error_tramp,
.callq = {'\xff', '\xd3'},
.leave = {'\xc9'},
.rbx_restore = {'\x5b'},
.rdi_restore = {'\x5f'},
},
.jmp = {'\xe9'},
.jmp_displacement = 0,
};
So, what’s left to do:
- Copy the REX and source register bytes of the instruction which was overwritten to replicate it.
- Calculate the displacement to
freelistto fully generate the overwrittenmov. - Calculate the displacement to
freelistso that it can be stored inrdias an argument to the C handler. - Fill in the absolute address for the handler.
- Calculate the displacement to the instruction after the stage 1 trampoline in order to
jmpback to resume execution as normal.
Doing that is relatively straight-forward. Let’s take a look at the C snippets that make this happen:
/* Before the stage 1 trampoline gets written, we need to generate
* the code for the stage 2 trampoline. Let's copy over the REX byte
* and the byte which mentions the source register into the stage 2
* trampoline.
*/
inl_tramp_st2 = inline_tramp_table + entry;
inl_tramp_st2->rex[0] = byte[0];
inl_tramp_st2->src_reg[0] = byte[2];
. . . . .
/* Finish setting up the stage 2 trampoline. */
/* calculate the displacement to freelist from the next instruction.
*
* This is used to replicate the original instruction we overwrote.
*/
inl_tramp_st2->mov_displacement = freelist - (void *)&(inl_tramp_st2->frame);
/* fill in the displacement to freelist from the next instruction.
*
* This is to arrange for the new value in freelist to be in %rdi, and as such
* be the first argument to the C handler. As per the amd64 ABI.
*/
inl_tramp_st2->frame.rdi_source_displacement = freelist -
(void *)&(inl_tramp_st2->frame.push_rbx);
/* jmp back to the instruction after stage 1 trampoline was inserted
*
* This can be 5 or 7, it doesn't matter. If its 5, we'll hit our 2
* NOPS. If its 7, we'll land directly on the next instruction.
*/
inl_tramp_st2->jmp_displacement = (uint32_t)((void *)(byte + 7) -
(void *)(inline_tramp_table + entry + 1));
/* write the address of our C level trampoline in to the structure */
inl_tramp_st2->frame.addr = freelist_tramp;
Awesome.
We’ve successfully patched the binary in memory, inserted an assembly stub which was generated at runtime, called a hook function, and ensured that execution can resume normally.
So, what’s the status on that memory profiler?
Almost done, stay tuned for more updates coming SOON.
Conclusion
- Hackery like this is unmaintainable, unstable, stupid, but also fun to work on and think about.
- Being able to hook
add_freelistlike this provides the last tool needed to implement a version of bleak_house (a Ruby memory profiler) without patching the Ruby VM. - x86_64 instruction set is a painful instruction set.
- Use the GNU assembler (gas) instead of trying to generate opcodes by reading the Intel instruction set PDFs if you value your sanity.
Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.
References
-
Cjacktx
-
Matthieu
-
Andy Hefner
-
vidarh

