GCC optimization flag makes your 64bit binary fatter and slower

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.
The intention of this post is to highlight a subtle GCC optimization bug that leads to slower and larger code being generated than would have been generated without the optimization flag.
UPDATED: Graphs are now 0 based on the y axis. Links in the tidbits section (below conclusion) for my ugly test harness and terminal session of the build of the test case in the bug report, objdump, and corresponding system information.
Hold the #gccfail tweets, son.
Everyone fucks up. The point of this post is not to rag on GCC. If writing a C compiler was easy then every asshole with a keyboard would write one for fun.
WARNING: THERE IS MATH, SCIENCE, AND GRAPHS BELOW.
Watch yourself.
The original bug report for -fomit-frame-pointer.
I stumbled across a bug report for GCC that was very interesting. It points out a very subtle bug that occurs when the -fomit-frame-pointer flag is passed to GCC. The bug report is for 32bit code, however after some testing I found that this bug also rears its head in 64bit code.
What is -fomit-frame-pointer supposed to do?
The -fomit-frame-pointer flag is intended to direct GCC to avoid saving and restoring the frame pointer (%ebp or %rbp). This is supposed to make function calls faster, since the function is doing less work each invocation. It should also make function code take fewer bytes since there are fewer instructions being executed.
A caveat of using -fomit-frame-pointer is that it may make debugging impossible on certain systems. To combat this on Linux, .debug_frame and .eh_frame sections are added to ELF binaries to assist in the stack unwinding process when the frame pointer is omitted.
What is the bug?
The bug is that when -fomit-frame-pointer is used, GCC erroneously uses the frame pointer register as a general purpose register when a different register could be used instead.
wat.
The amd64 and i386 ABIs1 2 specify a list of caller and callee saved registers.
- The frame pointer register is callee saved. That means that if a function is going to use the frame pointer register, it must save and restore the value in the register.
- The test case provided in the bug report shows that other caller saved registers were available for use.
- Had the function used a caller saved register instead, there would be no need for the additional save and restore instructions in the function.
- Removing those instructions would take fewer bytes and execute faster.
What are the consequences?
Let’s take a look at two potential pieces of code.
The first piece is the code that would be generated if -fomit-frame-pointer is not used:
test1:
pushq %rbp ; save frame pointer
movq %rsp,%rbp ; update frame pointer to the current stack pointer
; here is where your function would do work
leave ; restore the stack pointer and frame pointer
ret ; return
Size: 6 bytes.
The above assembly sequence uses the frame pointer.
Let’s take a look at the code that is generated by GCC when -fomit-frame-pointer is used:
sub $0x8, %rsp ; make room on the stack
movq %rbp, (%rsp) ; store rbp on the stack
; here is where your function would modify and use %rbp as needed
movq (%rsp), %rbp ; restore %rbp
add $0x8, %rsp ; get rid of the extra stack space
ret ; return
Size: 17 bytes.
The above assembly sequence is what is generated when GCC decides to use the frame pointer register as a general purpose register. Since it is callee saved, it must be saved before being modified and restored after being modified.
So -fomit-frame-pointer makes your binary fatter, but does it make it slower?
Only one way to find out: do science.
I built a simple (and very ugly) testing harness to test the above pieces of code to determine which piece of code is faster. Before we get into the benchmark results, I want to tell you why my benchmark is bullshit.
Yes, bullshit.
You see, it makes me sad when people post benchmarks and neglect to tell others why their benchmark may be inaccurate. So, lemme start the trend.
This benchmark is useless because:
- Reading the CPU cycle counter is unreliable (more on this below the conclusion). I also tracked wall clock time, too.
- I don’t have the ideal test environment. I ran this on bare metal hardware, and set the CPU affinity to keep the process pinned to a single CPU… BUT
- I could have done better if I had pinned
initto CPU0 (thereby forcing all children of init to be pinned to CPU0 – remember child processes inherit the affinity mask). I would have then had an entire CPU for nothing but my benchmark. - I could have done better if I forced the CPU running my benchmark program to not handle any IRQs.
- I only tested one version of GCC: (Debian 4.3.2-1.1) 4.3.2
- I could have taken more samples.
You can find more testing harness tidbits below the conclusion.
Benchmark Results
test 1 — Code sequence simulating using the frame pointer.
test 2 — Code sequence simulating using the frame pointer as a general purpose register.
64bit results
Using -fomit-frame-pointer is SLOWER (contrary to what you’d expect) than not using it!
| cycles test 1 | cycles test 2 | microsecs test 1 | microsecs test 2 | |
| mean | 3514422987.92 | 4559685515.66 | 1882707.27 | 2442663.94 |
| median | 3507007423.5 | 4562511684.5 | 1878721.5 | 2444171.5 |
| max | 3922780211 | 4672066854 | 2101457 | 2502869 |
| min | 3502194976 | 4327782795 | 1876113 | 2318452 |
| std dev | 31927179.5632 | 15449507.8196 | 17103.7755 | 8275.49788 |
| variance | 1.02E+15 | 238687291867021 | 292539135.936 | 68483865.11835 |
32bit results
Using -fomit-frame-pointer is FASTER (as it should be) than not using it! The binary is still fatter, though.
| cycles test 1 | cycles test 2 | microsecs test 1 | microsecs test 2 | |
| mean | 3502932799.49 | 3491263364.89 | 1876553.08 | 1870301.35 |
| median | 3501486586.5 | 3492013955.5 | 1875778 | 1870702.5 |
| max | 3905163528 | 3731985243 | 2092032 | 1999259 |
| min | 3500916510 | 3408834436 | 1875472 | 1826144 |
| std dev | 10066939.1113 | 7992367.6913 | 5393.0412 | 4281.5466 |
| variance | 101343263071403 | 63877941312996.4 | 29084893.2588 | 18331640.9459 |
Conclusion
- GCC is a really complex piece of software; this bug is very subtle and may have existed for a while.
- I’ve said this a few times, but knowing and understanding your system’s ABI is crucial for catching bugs like these.
- Math and science are cool now, much like computers. You should use both.
Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.
Testing harness tidbits
Each run of the benchmark executes either test1 or test2 (from above) 500,000,000 times. I did around 2500 runs for each test function.
You can get the testing harness, a build script, and a test script here: http://gist.github.com/483524
You can look at the terminal session where I build the test from the original bug report on my system: http://gist.github.com/483494
The code I used to read the CPU cycle counter looks like this:
static __inline__ unsigned long long rdtsc(void)
{
unsigned long hi = 0, lo = 0;
__asm__ __volatile__ ("lfence\n\trdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
The lfence instruction is a serializing instruction that ensures that all load instructions which were issued before the lfence instruction have been executed before proceeding. I did this to make sure that the cycle counter was being read after all operations in the test functions were executed.
The values returned by this function are misleading because CPU frequency may be scaled at any time. This is why I also measured wall clock time.
References
-
Geoff Langdale
-
Tony
-
Aaaa
-
Trypophobe
-
Wade Mealing
-
Joe Damato (ice799)
-
Fijal
-
locks
-
Killthepresidentofargentina
-
Arturoman
-
raggi

