Archive for the ‘debugging’ Category
The Broken Promises of MRI/REE/YARV

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.
tl;dr
This post is going to explain a serious design flaw of the object system used in MRI/REE/YARV. This flaw causes seemingly random segfaults and other hard to track corruption. One popular incarnation of this bug is the “rake aborted! not in gzip format.”
theme song
This blog post was inspired by one of my favorite Papoose verses. If you don’t listen to this while reading, you probably won’t understand what I’m talking about: get in the zone.
rake aborted! not in gzip format
[BUG] Segmentation fault
If you’ve seen either of these error messages you are hitting a fundamental flaw of the object model in MRI/YARV. An example of a fix for a single instance of this bug can be seen in this patch. Let’s examine this specific patch so that we can gain some understanding of the general case.
FACT: What you are about to read is absolutely not a compiler bug.
A small, but important piece of background information
The amd64 ABI1 states that some registers are caller saved, while others are callee saved. In particular, the register rax is caller saved. The callee will overwrite the value in this register to store its return value for the caller so if the caller cares about what is stored in this register, it must be copied prior to a function call.
stare into the abyss part 1
Let’s look at the C code for gzfile_read_raw_ensure WITHOUT the fix from above:
#define zstream_append_input2(z,v)\
zstream_append_input((z), (Bytef*)RSTRING_PTR(v), RSTRING_LEN(v))
static int
gzfile_read_raw_ensure(struct gzfile *gz, int size)
{
VALUE str;
while (NIL_P(gz->z.input) || RSTRING_LEN(gz->z.input) < size) {
str = gzfile_read_raw(gz);
if (NIL_P(str)) return Qfalse;
zstream_append_input2(&gz->z, str);
}
return Qtrue;
}
It looks relatively sane at first glance, but to understand this bug we’ll need to examine the assembly generated for this thing. I’m going to rearrange the assembly a bit to make it easier to follow and add few comments a long the way.
First, the code begins by setting the stage:
push %rbp movslq %esi,%rbp # sign extend "size" into rbp push %rbx mov %rdi,%rbx # rbx = gz sub $0x8,%rsp # make room on the stack for "str"
The above is pretty basic. It is your typical amd64 prologue. After things are all setup, it is time to enter into the while loop in the C code above:
jmp 1180# JUMP IN to the loop
Next comes the NIL_P(gz->z.input) portion of the while-loop condition:
mov 0x18(%rbx),%rax # rax = gz->z.input cmp $0x4,%rax # in Ruby, nil is represented as 4. je 1190 [gzfile_read_raw_ensure+0x30] # if gz->z.input is nil, enter the loop
Now the RSTRING_LEN(gz->z.input) < size portion:
cmp %rbp,0x10(%rax) # compare size and gz->z.input->len
jge 11b0 [gzfile_read_raw_ensure+0x50] # jump out of loop
# if gz->z.input->len is >= size
Next comes the call to gzfile_read_raw and the NIL_P(str) check. If this check fails, the code just falls through and exits the loop:
mov %rbx,%rdi # rdi = gz, rdi holds the first argument to a function. callq 1090 [gzfile_read_raw] # call gzfile_read_raw cmp $0x4,%rax # compare return value (%rax) to nil jne 1170 [gzfile_read_raw_ensure+0x10] # if it is NOT nil jump to the good stuff
The return value of gzfile_read_raw_ensure (an address of a ruby object) is stored in rax.
And finally, the good stuff. The call to zstream_append_input:
mov 0x10(%rax),%rdx # RSTRING_LEN(v) as 3rd arg mov 0x18(%rax),%rsi # RSTRING_PTR(v) as 2nd arg mov %rbx,%rdi # set gz->z as the 1st arg callq 10e0 [zstream_append_input] # let it rip
Note that the arguments to zstream_append_input are moved into registers by offsetting from rax and that when the call to zstream_append occurs, the ruby object returned from gzfile_read_raw_ensure is still stored in rax and not written to it's slot on the stack because the extra write is unnecessary.
stare into the abyss part 2
Aright, so the patch changes the zstream_append_input2 macro to this:
#define zstream_append_input2(z,v)\
RB_GC_GUARD(v),\
zstream_append_input((z), (Bytef*)RSTRING_PTR(v), RSTRING_LEN(v))
And, RB_GC_GUARD is defined as:
#define RB_GC_GUARD_PTR(ptr) \
__extension__ ({volatile VALUE *rb_gc_guarded_ptr = (ptr); rb_gc_guarded_ptr;})
#define RB_GC_GUARD(v) (*RB_GC_GUARD_PTR(&(v)))
That code is just a hack to mark the memory location holding v with the volatile type qualifier. This tells the compiler that memory backing v acts in ways that the compiler is too stupid to understand, so the compiler must ensure that reads and writes to this location are not optimized out.
A common usage of this qualifier is for memory mapped registers. Reads from memory mapped registers should not be optimized away since a hardware device may update the value stored at that location. The compiler wouldn't know when these updates could happen so it must make sure to re-read the value from this memory location when it is needed. Similarly, writes to memory mapped registers may modify the state of a hardware device and should not be optimized away.
Most of the code generated with the patch applied is the same as without except for a few slight differences before zstream_append_input is called. Let's take a look:
mov %rax,-0x18(%rbp) # write str to the stack mov -0x18(%rbp),%rax # read the value in str back to rax mov 0x10(%rcx),%rdx # RSTRING_LEN(v) mov 0x18(%rcx),%rsi # RSTRING_PTR(v) mov %rbx,%rdi # z callq 1f60 [_zstream_append_input]
The key difference is that the return value of gz_file_read_raw is written back to it's memory location (which, in this case, happens to be on the stack and is called str).
the bug
The bug is triggered because:
- The address of the ruby object str is stored in a caller saved register,
rax. - The callee (
zstream_append_input) does not save the value ofrax(it is not required to) andraxis overwritten in the function, leaving no references to the ruby object returned bygzfile_read_raw. - The callee (
zstream_append_input) eventually callsrb_newobj.rb_newobjmay trigger a GC run, if there are no available objects on the freelist. - The GC run finds the object returned by
gzfile_read_rawbut sees no references to it and frees the memory associated with it. - The freed object is used as it were it were valid, and memory corruption occurs causing the VM to explode.
The patch prevents this bug from happening because:
- The address of the ruby object str is stored in a caller saved register,
rax. - The
volatiletype qualifier causes the compiler to generate code which writes the return value back into it's memory location on the stack. - The callee (
zstream_append_input) eventually callsrb_newobj.rb_newobjmay trigger a GC run, if there are no available objects on the freelist. - The GC run finds the object returned by
gzfile_read_rawand finds a reference to it and therefore does not free it. - Everyone is happy.
The general case
Given valid C code, gcc will generate machine instructions that correctly do what you want. Of course, there are bugs in gcc just like any other piece of software. The problem in this case is not gcc. The problem is that the object and garbage collection implementations in REE/MRI/YARV are not valid C code, so it is not possible for gcc to generate machine instructions that do the right thing. In other words, Ruby's object and GC implementations are breaking their contract with gcc.
The end result is the need for shit like RB_GC_GUARD in REE/MRI/YARV and also in Ruby gems to selectively paper over valid gcc optimizations. Having an API that might cause the Ruby VM to fucking explode unless you proactively mark things with RB_GC_GUARD is not on the path of least resistance toward building a maintainable, safe, and performant system. Very few people out there know that the volatile type qualifier exists, let alone what it does. Essentially, this means that authors of Ruby gems must understand how GC works in the VM to prevent their gems from causing GC to break the universe.
That is fucking beyond stupid.
How to detect this bug class
This could be detected by building a simple static analysis tool. You won't catch 100% of cases, and you will definitely have false positives, but it is better than nothing. Something like this should work:
- Build a call digraph of the VM and/or the set of gems you care about.
- Find all paths leading to the
rb_newobjsink. - Find all paths which call
rb_newobj, but do not saveraxprior to making another function call which is also on a path torb_newobj. - The functions found are very likely to be causing corruption. A human will need to examine the found cases to weed out false positives and to fix the code.
If you have found yourself wondering who the fuck would write such a test? it is important for you to note that rtld in Linux does not save the SSE registers (which are supposed to be caller saved) prior to entering the fixup function, however to ensure that such an optimization does not cause the fucking universe to come crashing down, a test ships with the code to run objdump after building the binary. The objdump output is then grepped for any instructions which might modify the SSE registers. As long as no one touches the SSE registers, there is no need to save and restore them.
If Ruby's object and GC subsystems want to prevent the universe from exploding, it must supply an equivalent test to ensure that corruption is impossible.
Conclusion
- MRI/YARV/REE are inherently fatally flawed.
- I'm never writing another Ruby-related blog post.
- I'm not a Ruby programmer.
No comments
I'm taking a page from the book of coda and disabling comments. If you got something to say, write a blog post.
If you enjoyed this article, subscribe (via RSS or e-mail) and follow me on twitter.
References
Slides from Defcon 18: Function hooking for OSX and Linux
Video from Def Con 18
Defcon 18: Function hooking for OSX and Linux from Daniel Hückmann on Vimeo.
Slides
GCC optimization flag makes your 64bit binary fatter and slower

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.
The intention of this post is to highlight a subtle GCC optimization bug that leads to slower and larger code being generated than would have been generated without the optimization flag.
UPDATED: Graphs are now 0 based on the y axis. Links in the tidbits section (below conclusion) for my ugly test harness and terminal session of the build of the test case in the bug report, objdump, and corresponding system information.
Hold the #gccfail tweets, son.
Everyone fucks up. The point of this post is not to rag on GCC. If writing a C compiler was easy then every asshole with a keyboard would write one for fun.
WARNING: THERE IS MATH, SCIENCE, AND GRAPHS BELOW.
Watch yourself.
The original bug report for -fomit-frame-pointer.
I stumbled across a bug report for GCC that was very interesting. It points out a very subtle bug that occurs when the -fomit-frame-pointer flag is passed to GCC. The bug report is for 32bit code, however after some testing I found that this bug also rears its head in 64bit code.
What is -fomit-frame-pointer supposed to do?
The -fomit-frame-pointer flag is intended to direct GCC to avoid saving and restoring the frame pointer (%ebp or %rbp). This is supposed to make function calls faster, since the function is doing less work each invocation. It should also make function code take fewer bytes since there are fewer instructions being executed.
A caveat of using -fomit-frame-pointer is that it may make debugging impossible on certain systems. To combat this on Linux, .debug_frame and .eh_frame sections are added to ELF binaries to assist in the stack unwinding process when the frame pointer is omitted.
What is the bug?
The bug is that when -fomit-frame-pointer is used, GCC erroneously uses the frame pointer register as a general purpose register when a different register could be used instead.
wat.
The amd64 and i386 ABIs1 2 specify a list of caller and callee saved registers.
- The frame pointer register is callee saved. That means that if a function is going to use the frame pointer register, it must save and restore the value in the register.
- The test case provided in the bug report shows that other caller saved registers were available for use.
- Had the function used a caller saved register instead, there would be no need for the additional save and restore instructions in the function.
- Removing those instructions would take fewer bytes and execute faster.
What are the consequences?
Let’s take a look at two potential pieces of code.
The first piece is the code that would be generated if -fomit-frame-pointer is not used:
test1:
pushq %rbp ; save frame pointer
movq %rsp,%rbp ; update frame pointer to the current stack pointer
; here is where your function would do work
leave ; restore the stack pointer and frame pointer
ret ; return
Size: 6 bytes.
The above assembly sequence uses the frame pointer.
Let’s take a look at the code that is generated by GCC when -fomit-frame-pointer is used:
sub $0x8, %rsp ; make room on the stack
movq %rbp, (%rsp) ; store rbp on the stack
; here is where your function would modify and use %rbp as needed
movq (%rsp), %rbp ; restore %rbp
add $0x8, %rsp ; get rid of the extra stack space
ret ; return
Size: 17 bytes.
The above assembly sequence is what is generated when GCC decides to use the frame pointer register as a general purpose register. Since it is callee saved, it must be saved before being modified and restored after being modified.
So -fomit-frame-pointer makes your binary fatter, but does it make it slower?
Only one way to find out: do science.
I built a simple (and very ugly) testing harness to test the above pieces of code to determine which piece of code is faster. Before we get into the benchmark results, I want to tell you why my benchmark is bullshit.
Yes, bullshit.
You see, it makes me sad when people post benchmarks and neglect to tell others why their benchmark may be inaccurate. So, lemme start the trend.
This benchmark is useless because:
- Reading the CPU cycle counter is unreliable (more on this below the conclusion). I also tracked wall clock time, too.
- I don’t have the ideal test environment. I ran this on bare metal hardware, and set the CPU affinity to keep the process pinned to a single CPU… BUT
- I could have done better if I had pinned
initto CPU0 (thereby forcing all children of init to be pinned to CPU0 – remember child processes inherit the affinity mask). I would have then had an entire CPU for nothing but my benchmark. - I could have done better if I forced the CPU running my benchmark program to not handle any IRQs.
- I only tested one version of GCC: (Debian 4.3.2-1.1) 4.3.2
- I could have taken more samples.
You can find more testing harness tidbits below the conclusion.
Benchmark Results
test 1 — Code sequence simulating using the frame pointer.
test 2 — Code sequence simulating using the frame pointer as a general purpose register.
64bit results
Using -fomit-frame-pointer is SLOWER (contrary to what you’d expect) than not using it!
| cycles test 1 | cycles test 2 | microsecs test 1 | microsecs test 2 | |
| mean | 3514422987.92 | 4559685515.66 | 1882707.27 | 2442663.94 |
| median | 3507007423.5 | 4562511684.5 | 1878721.5 | 2444171.5 |
| max | 3922780211 | 4672066854 | 2101457 | 2502869 |
| min | 3502194976 | 4327782795 | 1876113 | 2318452 |
| std dev | 31927179.5632 | 15449507.8196 | 17103.7755 | 8275.49788 |
| variance | 1.02E+15 | 238687291867021 | 292539135.936 | 68483865.11835 |
32bit results
Using -fomit-frame-pointer is FASTER (as it should be) than not using it! The binary is still fatter, though.
| cycles test 1 | cycles test 2 | microsecs test 1 | microsecs test 2 | |
| mean | 3502932799.49 | 3491263364.89 | 1876553.08 | 1870301.35 |
| median | 3501486586.5 | 3492013955.5 | 1875778 | 1870702.5 |
| max | 3905163528 | 3731985243 | 2092032 | 1999259 |
| min | 3500916510 | 3408834436 | 1875472 | 1826144 |
| std dev | 10066939.1113 | 7992367.6913 | 5393.0412 | 4281.5466 |
| variance | 101343263071403 | 63877941312996.4 | 29084893.2588 | 18331640.9459 |
Conclusion
- GCC is a really complex piece of software; this bug is very subtle and may have existed for a while.
- I’ve said this a few times, but knowing and understanding your system’s ABI is crucial for catching bugs like these.
- Math and science are cool now, much like computers. You should use both.
Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.
Testing harness tidbits
Each run of the benchmark executes either test1 or test2 (from above) 500,000,000 times. I did around 2500 runs for each test function.
You can get the testing harness, a build script, and a test script here: http://gist.github.com/483524
You can look at the terminal session where I build the test from the original bug report on my system: http://gist.github.com/483494
The code I used to read the CPU cycle counter looks like this:
static __inline__ unsigned long long rdtsc(void)
{
unsigned long hi = 0, lo = 0;
__asm__ __volatile__ ("lfence\n\trdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
The lfence instruction is a serializing instruction that ensures that all load instructions which were issued before the lfence instruction have been executed before proceeding. I did this to make sure that the cycle counter was being read after all operations in the test functions were executed.
The values returned by this function are misleading because CPU frequency may be scaled at any time. This is why I also measured wall clock time.
References
Garbage Collection and the Ruby Heap (from railsconf)
Dynamic Linking: ELF vs. Mach-O

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.
The intention of this post is to highlight some of the similarities and differences between ELF and Mach-O dynamic linking that I encountered while building memprof.
I hope to write more posts about similarities and differences in other aspects of Mach-O and ELF that I stumbled across to shed some light on what goes on down there and provide (in some cases) the only documentation.
Procedure Linkage Table
The procedure linkage table (PLT) is used to determine the absolute address of a function at runtime. Both Mach-O and ELF objects have PLTs that are generated at compile time. The initial table simply invokes the dynamic linker which finds the symbol you want. The way this works is very similar at a high level in ELF and Mach-O, but there are some implementation differences that I thought were worth mentioning.
Mach-O PLT arrangement
Mach-O objects have several different sections across different segments that are all involved to create a PLT entry for a specific symbol.
Consider the following assembly stub which calls out to the PLT entry for malloc:
# MACH-O calling a PLT entry (ELF is nearly identical) 0x000000010008c504 [str_new+52]: callq 0x10009ebbc [dyld_stub_malloc]
The dyld_stub prefix is added by GDB to let the user know that the callq instruction is calling a PLT entry and not malloc itself. The address 0x10009ebbc is the first instruction of malloc‘s PLT entry in this Mach-O object. In Mach-O terminology, the instruction at 0x10009ebbc is called a symbol stub. Symbol stubs in Mach-O objects are found in the __TEXT segment in the __symbol_stub1 section.
Let’s examine some instructions at the symbol stub address above:
# MACH-O "symbol stubs" for malloc and other functions 0x10009ebbc [dyld_stub_malloc]: jmpq *0x3ae46(%rip) # 0x1000d9a08 0x10009ebc2 [dyld_stub_realloc]: jmpq *0x3ae48(%rip) # 0x1000d9a10 0x10009ebc8 [dyld_stub_seekdir$INODE64]: jmpq *0x3ae4c(%rip) # 0x1000d9a20 . . . .
Each Mach-O symbol stub is just a single jmpq instruction. That jmpq instruction either:
- Invokes the dynamic linker to find the symbol and transfer execution there
- Transfers execution directly to the function.
OR
via an entry in a table.
In the example above, GDB is telling us that the address of the table entry for malloc is 0x1000d9a08. This table entry is stored in a section called the __la_symbol_ptr within the __DATA segment.
Before malloc has been resolved, the address in that table entry points to a helper function which (eventually) invokes the dynamic linker to find malloc and fill in its address in the table entry.
Let’s take a look at what a few entries of the helper functions look like:
# MACH-O stub helpers 0x1000a08d4 [stub helpers+6986]: pushq $0x3b73 0x1000a08d9 [stub helpers+6991]: jmpq 0x10009ed8a [stub helpers] 0x1000a08de [stub helpers+6996]: pushq $0x3b88 0x1000a08e3 [stub helpers+7001]: jmpq 0x10009ed8a [stub helpers] 0x1000a08e8 [stub helpers+7006]: pushq $0x3b9e 0x1000a08ed [stub helpers+7011]: jmpq 0x10009ed8a [stub helpers] . . . .
Each symbol that has a PLT entry has 2 instructions above; a pair of pushq and jmpq. This instruction sequence sets an ID for the desired function and then invokes the dynamic linker. The dynamic linker looks up this ID so it knows which function it should be looking for.
ELF PLT arrangement
ELF objects have the same mechanism, but organize each PLT entry into chunks instead of splicing them out across different sections. Let’s take a look at a PLT entry for malloc in an ELF object:
# ELF complete PLT entry for malloc 0x40f3d0 [malloc@plt]: jmpq *0x2c91fa(%rip) # 0x6d85d0 0x40f3d6 [malloc@plt+6]: pushq $0x2f 0x40f3db [malloc@plt+11]: jmpq 0x40f0d0 . . . .
Much like a Mach-O object, an ELF object uses a table entry to direct the flow of execution to either invoke the dynamic linker or transfer directly to the desired function if it has already been resolved.
Two differences to point out here:
- ELF puts the entire PLT entry together in nicely named section called
pltinstead of splicing it out across multiple sections. - The table entries indirected through with the initial
jmpqinstruction are stored in a section named:.got.plt.
Both invoke an assembly trampoline…
Both Mach-O and ELF objects are set up to invoke the runtime dynamic linker. Both need an assembly trampoline to bridge the gap between the application and the linker. On 64bit Intel based systems, linkers in both systems must comply to the same Application Binary Interace (ABI).
Strangely enough, the two linkers have slightly different assembly trampolines even though they share the same calling convention1 2.
Both trampolines ensure that the program stack is 16-byte aligned to comply with the amd64 ABI’s calling convention. Both trampolines also take care to save the “general purpose” caller-saved registers prior to invoking the dynamic link, but it turns out that the trampoline in Linux does not save or restore the SSE registers. It turns out that this “shouldn’t” matter, so long as glibc takes care not to use any of those registers in the dynamic linker. OSX takes a more conservative approach and saves and restores the SSE registers before and after calling out the dynamic linker.
I’ve included a snippet from the two trampolines below and some comments so you can see the differences up close.
Different trampolines for the same ABI
The OSX trampoline:
dyld_stub_binder: pushq %rbp movq %rsp,%rbp subq $STACK_SIZE,%rsp # at this point stack is 16-byte aligned because two meta-parameters where pushed movq %rdi,RDI_SAVE(%rsp) # save registers that might be used as parameters movq %rsi,RSI_SAVE(%rsp) movq %rdx,RDX_SAVE(%rsp) movq %rcx,RCX_SAVE(%rsp) movq %r8,R8_SAVE(%rsp) movq %r9,R9_SAVE(%rsp) movq %rax,RAX_SAVE(%rsp) movdqa %xmm0,XMMM0_SAVE(%rsp) movdqa %xmm1,XMMM1_SAVE(%rsp) movdqa %xmm2,XMMM2_SAVE(%rsp) movdqa %xmm3,XMMM3_SAVE(%rsp) movdqa %xmm4,XMMM4_SAVE(%rsp) movdqa %xmm5,XMMM5_SAVE(%rsp) movdqa %xmm6,XMMM6_SAVE(%rsp) movdqa %xmm7,XMMM7_SAVE(%rsp) movq MH_PARAM_BP(%rbp),%rdi # call fastBindLazySymbol(loadercache, lazyinfo) movq LP_PARAM_BP(%rbp),%rsi call __Z21_dyld_fast_stub_entryPvl
The OSX trampoline saves all the caller saved registers as well as the the %xmm0 - %xmm7 registers prior to invoking the dynamic linker with that last call instruction. These registers are all restored after the call instruction, but I left that out for the sake of brevity.
The Linux trampoline:
subq $56,%rsp cfi_adjust_cfa_offset(72) # Incorporate PLT movq %rax,(%rsp) # Preserve registers otherwise clobbered. movq %rcx, 8(%rsp) movq %rdx, 16(%rsp) movq %rsi, 24(%rsp) movq %rdi, 32(%rsp) movq %r8, 40(%rsp) movq %r9, 48(%rsp) movq 64(%rsp), %rsi # Copy args pushed by PLT in register. movq %rsi, %r11 # Multiply by 24 addq %r11, %rsi addq %r11, %rsi shlq $3, %rsi movq 56(%rsp), %rdi # %rdi: link_map, %rsi: reloc_offset call _dl_fixup # Call resolver.
The Linux trampoline doesn’t touch the SSE registers because it assumes that the dynamic linker will not modify them thus avoiding a save and restore.
Conclusion
- Tracing program execution from call site to the dynamic linker is pretty interesting and there is a lot to learn along the way.
- glibc not saving and restoring
%xmm0-%xmm7kind of scares me, but there is a unit test included that disassembles the built ld.so searching it to make sure that those registers are never touched. It is still a bit frightening. - Stay tuned for more posts explaining other interesting similarities and differences between Mach-O and ELF coming soon.
Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.

