<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>time to bleed by Joe Damato &#187; monitoring</title>
	<atom:link href="http://timetobleed.com/category/monitoring/feed/" rel="self" type="application/rss+xml" />
	<link>http://timetobleed.com</link>
	<description>technical ramblings from a wanna-be unix dinosaur</description>
	<lastBuildDate>Tue, 20 Jul 2010 21:03:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>memprof: A Ruby level memory profiler</title>
		<link>http://timetobleed.com/memprof-a-ruby-level-memory-profiler/</link>
		<comments>http://timetobleed.com/memprof-a-ruby-level-memory-profiler/#comments</comments>
		<pubDate>Fri, 11 Dec 2009 12:59:43 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[bugfix]]></category>
		<category><![CDATA[debugging]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[x86]]></category>
		<category><![CDATA[debug]]></category>
		<category><![CDATA[garbage collection]]></category>
		<category><![CDATA[GC]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[profiling]]></category>
		<category><![CDATA[system health]]></category>
		<category><![CDATA[x86_64]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=1398</guid>
		<description><![CDATA[If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter. What is memprof and why do I care? memprof is a Ruby gem which supplies memory profiler functionality similar to bleak_house without patching the Ruby VM. You just install the gem, call a function or two, and off you go. [...]]]></description>
			<content:encoded><![CDATA[<p><center><img src="http://timetobleed.com/images/memory.jpg" alt="" width="300" height="200" /></center><br />
If you enjoy this article, <a rel="alternate" type="application/rss+xml" href="http://feeds.feedburner.com/TimeToBleed">subscribe (via RSS or e-mail)</a> and <a href="http://twitter.com/joedamato">follow me on twitter.</a></p>
<h2>What is memprof and why do I care?</h2>
<p>memprof is a Ruby gem which supplies memory profiler functionality similar to bleak_house <b>without</b> patching the Ruby VM. You just install the gem, call a function or two, and off you go.</p>
<h2>Where do I get it?</h2>
<p>memprof is available on gemcutter, so you can just:</p>
<p><b><code>gem install memprof</code></b></p>
<p>Feel free to browse the source code at: <a href="http://github.com/ice799/memprof">http://github.com/ice799/memprof</a>.</p>
<h2>How do I use it?</h2>
<p>Using memprof is simple. Before we look at some examples, let me explain more precisely what memprof is measuring.</p>
<p>memprof is measuring the number of objects created and not destroyed during a segment of Ruby code. The ideal use case for memprof is to show you where objects that do not get destroyed are being created: </p>
<ul>
<li>Objects are created and not destroyed when you create new classes. This is a good thing.</li>
<li>Sometimes garbage objects sit around until <code>garbage_collect</code> has had a chance to run. These objects will go away.</li>
<li>Yet in other cases you might be holding a reference to a large chain of objects without knowing it. Until you remove this reference, the entire chain of objects will remain in memory taking up space.</li>
</ul>
<p>memprof will show objects created in all cases listed above.</p>
<p>OK, now Let&#8217;s take a look at two examples and their output.</p>
<p>A simple program with an obvious memory &#8220;leak&#8221;:</p>
<pre class="prettyprint">
require 'memprof'

@blah = Hash.new([])

Memprof.start
100.times {
  @blah[1] << "aaaaa"
}

1000.times {
   @blah[2] << "bbbbb"
}
Memprof.stats
Memprof.stop
</pre>
<p>
<p>
This program creates 1100 objects which are not destroyed during the <code>start</code> and <code>stop</code> sections of the file because references are held for each object created.</p>
<p>Let's look at the output from memprof:</p>
<pre>
   1000 test.rb:11:String
    100 test.rb:7:String
</pre>
<p>
<p>In this example memprof shows the 1100 created, broken up by file, line number, and type.</p>
<p>Let's take a look at another example:</p>
<pre class="prettyprint">
require 'memprof'
Memprof.start
require "stringio"
StringIO.new
Memprof.stats
</pre>
<p>
<p>This simple program is measuring the number of objects created when requiring <code>stringio</code>.</p>
<p>Let's take a look at the output:</p>
<pre>
    108 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:__node__
     14 test2.rb:3:String
      2 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Class
      1 test2.rb:4:StringIO
      1 test2.rb:4:String
      1 test2.rb:3:Array
      1 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Enumerable
</pre>
<p>
<p>This output shows an internal Ruby interpreter type <code>__node__</code> was created (these represent code), as well as a few <code>String</code>s and other objects. Some of these objects are just garbage objects which haven't had a chance to be recycled yet.</p>
<p>What if nudge the garbage_collector along a little bit just for our example? Let's add the following two lines of code to our previous example:</p>
<pre class="prettyprint">
GC.start
Memprof.stats
</pre>
<p>
<p>We're now nudging the garbage collector and outputting memprof stats information again. This should show fewer objects, as the garbage collector will recycle some of the garbage objects:</p>
<pre>
    108 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:__node__
      2 test2.rb:3:String
      2 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Class
      1 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Enumerable
</pre>
<p></p>
<p>As you can see above, a few <code>String</code>s and other objects went away after the garbage collector ran.</p>
<h2>Which Rubies and systems are supported?</h2>
<ul>
<li>Only <b>unstripped</b> binaries are supported. To determine if your Ruby binary is stripped, simply run: <code>file `which ruby`</code>. If it is, consult your package manager's documentation. Most Linux distributions offer a package with an unstripped Ruby binary.</li>
<li>Only <b>x86_64</b> is supported at this time. Hopefully, I'll have time to add support for i386/i686 in the immediate future.</li>
<li>Linux Ruby Enterprise Edition (1.8.6 and 1.8.7) is supported.</li>
<li>Linux MRI Ruby 1.8.6 and 1.8.7 built with --disable-shared are supported. Support for --enable-shared binaries is <b>coming soon.</b></li>
<li>Snow Leopard support is <b>experimental</b> at this time.</li>
<li>Ruby 1.9 support <b>coming soon</b>.</li>
</ul>
<h2>How does it work?</h2>
<p>If you've been reading my blog over the last week or so, you'd have noticed two previous blog posts (<a href="http://timetobleed.com/rewrite-your-ruby-vm-at-runtime-to-hot-patch-useful-features/">here</a> and <a href="http://timetobleed.com/hot-patching-inlined-functions-with-x86_64-asm-metaprogramming/">here</a>) that describe some tricks I came up with for modifying a running binary image in memory.</p>
<p>memprof is a combination of all those tricks and other hacks to allow memory profiling in Ruby without the need for custom patches to the Ruby VM. You simply require the gem and off you go.</p>
<p>memprof works by inserting trampolines on object allocation and deallocation routines. It gathers metadata about the objects and outputs this information when the <code>stats</code> method is called.</p>
<h2>What else is planned?</h2>
<p><a href="http://twitter.com/joedamato">Myself</a>, <a href="http://twitter.com/jakedouglas">Jake Douglas</a>, and <a href="http://www.twitter.com/tmm1">Aman Gupta</a> have lots of interesting ideas for new features. We don't want to ruin the surprise, but stay tuned. More cool stuff coming really soon :)</p>
<p>Thanks for reading and don't forget to <a rel="alternate" type="application/rss+xml" href="http://feeds.feedburner.com/TimeToBleed">subscribe (via RSS or e-mail)</a> and <a href="http://twitter.com/joedamato">follow me on twitter.</a></p>
]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/memprof-a-ruby-level-memory-profiler/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Rewrite your Ruby VM at runtime to hot patch useful features</title>
		<link>http://timetobleed.com/rewrite-your-ruby-vm-at-runtime-to-hot-patch-useful-features/</link>
		<comments>http://timetobleed.com/rewrite-your-ruby-vm-at-runtime-to-hot-patch-useful-features/#comments</comments>
		<pubDate>Mon, 23 Nov 2009 12:59:53 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[bugfix]]></category>
		<category><![CDATA[debugging]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[testing]]></category>
		<category><![CDATA[x86]]></category>
		<category><![CDATA[allocator]]></category>
		<category><![CDATA[debug]]></category>
		<category><![CDATA[garbage collection]]></category>
		<category><![CDATA[GC]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[x86_64]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=1253</guid>
		<description><![CDATA[If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter. Some notes before the blood starts flowin&#8217; CAUTION: What you are about to read is dangerous, non-portable, and (in most cases) stupid. The code and article below refer only to the x86_64 architecture. Grab some gauze. This is going to [...]]]></description>
			<content:encoded><![CDATA[<p><center><img src="http://timetobleed.com/images/tramp.png" alt="" width="400" height="300" /></center><br />
If you enjoy this article, <a rel="alternate" type="application/rss+xml" href="http://feeds.feedburner.com/TimeToBleed">subscribe (via RSS or e-mail)</a> and <a href="http://twitter.com/joedamato">follow me on twitter.</a></p>
<h2>Some notes before the blood starts flowin&#8217;</h2>
<ul>
<li><strong>CAUTION:</strong> What you are about to read is dangerous, non-portable, and (in most cases) stupid.</li>
<li>The code and article below refer only to the <strong>x86_64</strong> architecture.</li>
<li>Grab some gauze. This is going to get ugly.</li>
</ul>
<h2>TLDR</h2>
<p>This article shows off a Ruby gem which has the power to overwrite a Ruby binary <em>in memory</em> while <em>it is running</em> to allow your code to execute in place of internal VM functions. This is useful if you&#8217;d like to hook all object allocation functions to build a memory profiler.</p>
<h2>This gem is on GitHub</h2>
<p>Yes, it&#8217;s on GitHub: <a href="http://github.com/ice799/memprof">http://github.com/ice799/memprof</a>.</p>
<h2>I want a memory profiler for Ruby</h2>
<p>This whole science experiment started during <a href="http://rubyconf.org/">RubyConf</a> when <a href="http://twitter.com/tmm1">Aman</a> and I began brainstorming ways to build a memory profiling tool for Ruby.</p>
<p>The big problem in our minds was that for most tools we&#8217;d have to include patches to the Ruby VM. That process is <b>long and somewhat difficult</b>, so I started thinking about ways to do this without modifying the Ruby source code itself.</p>
<p> The memory profiler is <b>NOT DONE</b> just yet. I thought that the hack I wrote to let us build something without modifying Ruby source code was interesting enough that it warranted a blog post. So let&#8217;s get rolling.</p>
<h2>What is a trampoline?</h2>
<p>Let&#8217;s pretend you have 2 functions: <code>functionA()</code> and <code>functionB()</code>. Let&#8217;s assume that <code>functionA()</code> calls <code>functionB()</code>.</p>
<p>Now also imagine that you&#8217;d like to insert a piece of code to execute in between the call to <code>functionB()</code>. You can imagine inserting a piece of code that <i>diverts execution</i> elsewhere, creating a flow: <code>functionA()</code> &#8211;> <code>functionC()</code> &#8211;> <code>functionB()</code></p>
<p>You can accomplish this by <i>inserting a trampoline</i>.</p>
<p>A trampoline is a piece of code that program execution jumps into and then <i>bounces</i> out of and on to somewhere else<sup>1</sup>.</p>
<p>This hack relies on the use of multiple trampolines. We&#8217;ll see why shortly.</p>
<h2>Two different kinds of trampolines</h2>
<p>There are two different kinds of trampolines that I considered while writing this hack, let&#8217;s take a closer look at both.</p>
<p>
<h3>Caller-side trampoline</h3>
<p>A <i>caller-side</i> trampoline works by overwriting the <a href="http://en.wikipedia.org/wiki/Opcodes">opcodes</a> in the <i>.text</i> segment of the program in the calling function causing it to call a different function <i>at runtime</i>.</p>
</p>
<p>The <b>big pros</b> of this method are:
<ul>
<li>You aren&#8217;t overwriting any code, only the address operand of a <code>callq</code> instruction.</li>
<li>Since you are only changing an operand, you can hook any function. You don&#8217;t need to build custom trampolines for each function.</li>
</ul>
<p> This method also has some <b>big cons</b> too:
<ul>
<li>You&#8217;ll need to scan <i>the entire binary in memory</i> and find and <i>overwrite</i> all address operands of <code>callq</code>. This is problematic because if you overwrite any false-positives you might break your application.</li>
<li>You have to deal with the implications of <code>callq</code>, which can be painful as we&#8217;ll see soon.</li>
</ul>
<p><h3>Callee-side trampoline</h3>
<p>A <i>callee-side</i> trampoline works by overwriting the opcodes in the <i>.text</i> segment of the program in the called function, causing it to call another function immediately</p>
<p>The <b>big pro</b> of this method is:
<ul>
<li>You only need to overwrite code in <i>one</i> place and don&#8217;t need to worry about accidentally scribbling on bytes that you didn&#8217;t mean to.</li>
</ul>
<p> this method has some <b>big cons</b> too:
<ul>
<li>You&#8217;ll need to carefully construct your trampoline code to only overwrite as little of the function as possible (or some how restore opcodes), especially if you expect the original function to work as expected later.</li>
<li>You&#8217;ll need to special case each trampoline you build for different optimization levels of the binary you are hooking into.</ul>
<p>I went with a <i>caller-side</i> trampoline because I wanted to ensure that I can hook any function and not have to worry about different Ruby binaries causing problems when they are compiled with different optimization levels.</p>
<h2>The stage 1 trampoline</h2>
<p>To insert my trampolines I needed to <i>insert some binary into the process</i> and then overwrite <code>callq</code> instructions like this:</p>
<p><pre class="prettyprint">
  41150b:       e8 cc 4e 02 00         callq  4363dc [rb_newobj]
  411510:       48 89 45 f8             ....
</pre>
</p>
<p></p>
<p> In the above code snippet, the byte <code>e8</code> is the <code>callq</code> opcode and the bytes <code>cc 4e 02 00</code> are the distance to <code>rb_newobj</code> from the address of the next instruction, 0&#215;411510</p>
<p>All I need to do is change the 4 bytes following <code>e8</code> to equal the displacement between the next instruction, 0&#215;411510 in this case, and my trampoline.</p>
<p><b>Problem.</b></p>
<p>My first cut at this code lead me to an important realization: the <code>callq</code> instructions used expect a <i>32bit displacement</i> from the function I am calling and <i>not</i> absolute addresses. <b>But</b>, the 64bit address space is <i>very</i> large. The displacement between the code for the Ruby binary that lives in the <code>.text</code> segment is so far away from my Ruby gem that the displacement <b>cannot be represented with only 32bits</b>.</p>
<p><b>So what now?</b></p>
<p>Well, luckily <code>mmap</code> has a flag <code>MAP_32BIT</code> which maps a page in the first 2GB of the address space. If I map some code there, it should be well within the range of values whose displacement I can represent in 32bits.</p>
<p>So, why not map a <b>second trampoline</b> to that page which can contains code that can call an <i>absolute address</i>?</p>
<p>My stage 1 trampoline code looks something like this:</p>
<p>
<pre class="prettyprint">
  /* the struct below is just a sequence of bytes which represent the
    *  following bit of assembly code, including 3 nops for padding:
    *
    *  mov $address, %rbx
    *  callq *%rbx
    *  ret
    *  nop
    *  nop
    *  nop
    */
  struct tramp_tbl_entry ent = {
    .mov = {'\x48','\xbb'},
    .addr = (long long)&#038;error_tramp,
    .callq = {'\xff','\xd3'},
    .ret = '\xc3',
    .pad =  {'\x90','\x90','\x90'},
  };

  tramp_table = mmap(NULL, 4096, PROT_WRITE|PROT_READ|PROT_EXEC,
                                   MAP_32BIT|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
  if (tramp_table != MAP_FAILED) {
    for (; i < 4096/sizeof(struct tramp_tbl_entry); i ++ ) {
      memcpy(tramp_table + i, &#038;ent, sizeof(struct tramp_tbl_entry));
    }
  }
}
</pre>
<p>
<p>It <code>mmap</code>s a single page and writes a table of default trampolines (like a jump table) that all call an error trampoline by default. When a new trampoline is inserted, I just go to that entry in the table and insert the address that should be called.</p>
<p>To get around the displacement challenge described above, the addresses I insert into the stage 1 trampoline table are addresses for stage 2 trampolines.</p>
<h2>The stage 2 trampoline</h2>
<p>Setting up the stage 2 trampolines are pretty simple once the stage 1 trampoline table has been written to memory. All that needs to be done is update the address field in a free stage 1 trampoline to be the address of my stage 2 trampoline. These trampolines are written in C and live in my Ruby gem.</p>
<p>
<pre class="prettyprint">
static void
insert_tramp(char *trampee, void *tramp) {
  void *trampee_addr = find_symbol(trampee);
  int entry = tramp_size;
  tramp_table[tramp_size].addr = (long long)tramp;
  tramp_size++;
  update_image(entry, trampee_addr);
}
</pre>
</p>
<p>
<p>An example of a stage 2 trampoline for <code>rb_newobj</code> might be:</p>
<p>
<pre class="prettyprint">
static VALUE
newobj_tramp() {
  /* print the ruby source and line number where the allocation is occuring */
  printf("source = %s, line = %d\n", ruby_sourcefile, ruby_sourceline);

  /* call newobj like normal so the ruby app can continue */
  return rb_newobj();
}
</pre>
</p>
<h2>Programatically rewriting the Ruby binary in memory</h2>
<p>Overwriting the Ruby binary to cause my stage 1 trampolines to get hit is pretty simple, too. I can just scan the <code>.text</code> segment of the binary looking for bytes which look like <code>callq</code> instructions. Then, I can sanity check by reading the next 4 bytes which should be the displacement to the original function. Doing that sanity check should prevent false positives.</p>
<pre class="prettyprint">
static void
update_image(int entry, void *trampee_addr) {
  char *byte = text_segment;
  size_t count = 0;
  int fn_addr = 0;
  void *aligned_addr = NULL;

 /* check each byte in the .text segment */
  for(; count < text_segment_len; count++) {

    /* if it looks like a callq instruction... */
    if (*byte == '\xe8') {

      /* the next 4 bytes SHOULD BE the original displacement */
      fn_addr = *(int *)(byte+1);

      /* do a sanity check to make sure the next few bytes are an accurate displacement.
        * this helps to eliminate false positives.
        */
      if (trampee_addr - (void *)(byte+5) == fn_addr) {
        aligned_addr = (void*)(((long)byte+1)&#038;~(0xffff));

        /* mark the page in the .text segment as writable so it can be modified */
        mprotect(aligned_addr, (void *)byte+1 - aligned_addr + 10,
                       PROT_READ|PROT_WRITE|PROT_EXEC);

        /* calculate the new displacement and write it */
        *(int  *)(byte+1) = (uint32_t)((void *)(tramp_table + entry)
                                     - (void *)(byte + 5));

        /* disallow writing to this page of the .text segment again  */
        mprotect(aligned_addr, (((void *)byte+1) - aligned_addr) + 10,
                      PROT_READ|PROT_EXEC);
      }
    }
    byte++;
  }
}
</pre>
<p></p>
<h2>Sample output</h2>
<p>After requiring my ruby gem and running a test script which creates lots of objects, I see this output:</p>
<pre class="prettify">
...
source = test.rb, line = 8
source = test.rb, line = 8
source = test.rb, line = 8
source = test.rb, line = 8
source = test.rb, line = 8
source = test.rb, line = 8
source = test.rb, line = 8
...
</pre>
<p>
<p><b>Showing the file name and line number for each object getting allocated.</b> That should be a strong enough primitive to build a Ruby memory profiler without requiring end users to build a custom version of Ruby. It should also be possible to re-implement <a href="http://blog.evanweaver.com/articles/2007/04/28/bleak_house/">bleak_house</a> by using this gem (and maybe another trick or two).</p>
<p><b>Awesome.</b></p>
<h2>Conclusion</h2>
<ul>
<li>One step closer to building a memory profiler without requiring end users to find and use patches floating around the internet.</li>
<li>It is unclear whether cheap tricks like this are useful or harmful, but they are <b>fun</b> to write.</li>
<li>If you understand how your system works at an intimate level, nearly anything is possible. The work required to make it happen might be difficult though.</li>
</ul>
<p>
Thanks for reading and don't forget to <a rel="alternate" type="application/rss+xml" href="http://feeds.feedburner.com/TimeToBleed">subscribe (via RSS or e-mail)</a> and <a href="http://twitter.com/joedamato">follow me on twitter.</a></p>
<h2>References</h2>
<ol class="footnotes"><li id="footnote_0_1253" class="footnote"><a href="http://en.wikipedia.org/wiki/Trampoline_%28computers%29">http://en.wikipedia.org/wiki/Trampoline_(computers)</a></li></ol>]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/rewrite-your-ruby-vm-at-runtime-to-hot-patch-useful-features/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Extending ltrace to make your Ruby/Python/Perl/PHP apps faster</title>
		<link>http://timetobleed.com/extending-ltrace-to-make-your-rubypythonperlphp-apps-faster/</link>
		<comments>http://timetobleed.com/extending-ltrace-to-make-your-rubypythonperlphp-apps-faster/#comments</comments>
		<pubDate>Thu, 08 Oct 2009 11:59:56 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[debugging]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[x86]]></category>
		<category><![CDATA[debug]]></category>
		<category><![CDATA[ltrace]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[profiling]]></category>
		<category><![CDATA[strace]]></category>
		<category><![CDATA[system health]]></category>
		<category><![CDATA[x86_64]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=1058</guid>
		<description><![CDATA[If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter. A few days ago, Aman (@tmm1) was complaining to me about a slow running process: I want to see what is happening in userland and trace calls to extensions. Why doesn&#8217;t ltrace work for Ruby processes? I want to figure [...]]]></description>
			<content:encoded><![CDATA[<p><center><img src="http://timetobleed.com/images/trace.jpg" alt="" width="400" height="300" /></center><br />

<p>If you enjoy this article, <a rel="alternate" type="application/rss+xml" href="http://feeds.feedburner.com/TimeToBleed">subscribe (via RSS or e-mail)</a> and <a href="http://twitter.com/joedamato">follow me on twitter.</a></p>
<p>A few days ago, Aman (<a href="http://twitter.com/tmm1">@tmm1</a>) was complaining to me about a slow running process:</p>
<p>
<blockquote>I want to see what is happening in userland and trace calls to extensions. Why doesn&#8217;t ltrace work for Ruby processes? I want to figure out which MySQL queries are causing my app to be slow.</p></blockquote>
<p>It turns out that <b>ltrace did not have support for libraries loaded with libdl</b>. This is a problem for languages like Ruby, Python, PHP, Perl, and others because in many cases extensions, libraries, and plugins for these languages are loaded by the VM using libdl. This means that ltrace is somewhat useless for tracking down performance issues in dynamic languages.</p>
<p>A couple late nights of hacking and I <b>managed to finagle libdl support in ltrace.</b> Since most people probably don&#8217;t care about the technical details of how it was implemented, I&#8217;ll start with showing how to use the patch I wrote and what sort of output you can expect. <b>This patch has made tracking down slow queries (among other things) really easy and I hope others will find this useful.</b></p>
<h2>How to use ltrace:</h2>
<p>After you&#8217;ve applied my patch (below) and rebuilt ltrace, let&#8217;s say you&#8217;d like to trace MySQL queries and have ltrace tell you when the query was executed and how long it took. There are two steps:</p>
<ol>
<li>Give ltrace info so it can pretty print &#8211; echo &#8220;int mysql_real_query(addr,string,ulong);&#8221;  > custom.conf</li>
<li>Tell ltrace you want to hear about <code>mysql_real_query</code>: <b><code>ltrace -F custom.conf -ttTgx mysql_real_query -p &lt;pid&gt;</code></b></li>
</ol>
<p>Here&#8217;s what those arguments mean:</p>
<ul>
<li><b>-F</b>  use a custom config file when pretty-printing (default: /etc/ltrace.conf, add your stuff there to avoid -F if you wish).</li>
<li><b>-tt</b>  print the time (including microseconds) when the call was executed</li>
<li><b>-T</b>   time the call and print how long it took</li>
<li><b>-x</b>   tells ltrace the name of the function you care about</li>
<li><b>-g</b>  <i>avoid</i> placing breakpoints on all library calls except the ones you specify with -x. <b>This is optional</b>, but it makes ltrace produce much less output and is a lot easier to read if you only care about your one function.</li>
</ul>
<h2>PHP</h2>
<h3>Test script</h3>
<pre class="prettyprint">
mysql_connect("localhost", "root");
while(true){
    mysql_query("SELECT sleep(1)");
}
</pre>
<p></p>
<h3>ltrace output</h3>
<pre class="prettyprint">
22:31:50.507523 zend_hash_find(0x025dc3a0, "mysql_query", 12) = 0 <0.000029>
22:31:50.507781 mysql_real_query(0x027bc540, "SELECT sleep(1)", 15) = 0 <1.000600>
22:31:51.508531 zend_hash_find(0x025dc3a0, "mysql_query", 12) = 0 <0.000025>
22:31:51.508675 mysql_real_query(0x027bc540, "SELECT sleep(1)", 15) = 0 <1.000926>
</pre>
<p></p>
<h3><code>ltrace</code> command</h3>
<pre class="prettyprint">
ltrace -ttTg -x zend_hash_find -x mysql_real_query -p [pid of script above]</pre>
<h2>Python</h2>
<h3>Test script</h3>
<pre class="prettyprint">
import MySQLdb
db = MySQLdb.connect("localhost", "root", "", "test")
cursor = db.cursor()
sql = """SELECT sleep(1)"""
while True:
	cursor.execute(sql)
	data = cursor.fetchone()
db.close()
</pre>
<p></p>
<h3><code>ltrace</code> output</h3>
<pre class="prettyprint">
22:24:39.104786 PyEval_SaveThread() = 0x21222e0 <0.000029>
22:24:39.105020 PyEval_SaveThread() = 0x21222e0 <0.000024>
22:24:39.105210 PyEval_SaveThread() = 0x21222e0 <0.000024>
22:24:39.105303 mysql_real_query(0x021d01d0, "SELECT sleep(1)", 15) = 0 <1.002083>
22:24:40.107553 PyEval_SaveThread() = 0x21222e0 <0.000026>
22:24:40.107713 PyEval_SaveThread()= 0x21222e0 <0.000024>
22:24:40.107909 PyEval_SaveThread() = 0x21222e0 <0.000025>
22:24:40.108013 mysql_real_query(0x021d01d0, "SELECT sleep(1)", 15) = 0 <1.001821>
</pre>
<p></p>
<h3><code>ltrace</code> command</h3>
<pre class="prettyprint">
ltrace -ttTg -x PyEval_SaveThread -x mysql_real_query -p [pid of script above]</pre>
<h2>Perl</h2>
<h3>Test script</h3>
<pre class="prettyprint">
#!/usr/bin/perl
use DBI;

$dsn = "DBI:mysql:database=test;host=localhost";
$dbh = DBI->connect($dsn, "root", "");
$drh = DBI->install_driver("mysql");
@databases = DBI->data_sources("mysql");
$sth = $dbh->prepare("SELECT SLEEP(1)");

while (1) {
  $sth->execute;
}
</pre>
<p></p>
<h3><code>ltrace</code> output</h3>
<pre class="prettyprint">
22:42:11.194073 Perl_push_scope(0x01bd3010) = <void> <0.000028>
22:42:11.194299 mysql_real_query(0x01bfbf40, "SELECT SLEEP(1)", 15) = 0 <1.000876>
22:42:12.195302 Perl_push_scope(0x01bd3010) = <void> <0.000024>
22:42:12.195408 mysql_real_query(0x01bfbf40, "SELECT SLEEP(1)", 15) = 0 <1.000967>
</pre>
<p></p>
<h3><code>ltrace</code> command</h3>
<pre class="prettyprint">
ltrace -ttTg -x mysql_real_query -x Perl_push_scope -p [pid of script above]</pre>
<h2>Ruby</h2>
<h3>Test script</h3>
<pre class="prettyprint">
require 'rubygems'
require 'sequel'

DB = Sequel.connect('mysql://root@localhost/test')

while true
  p DB['select sleep(1)'].select.first
  GC.start
end
</pre>
<p></p>
<h3>snip of <code>ltrace</code> output</h3>
<pre class="prettyprint">
22:10:00.195814 garbage_collect()  = 0 <0.022194>
22:10:00.218438 mysql_real_query(0x02740000, "select sleep(1)", 15) = 0 <1.001100>
22:10:01.219884 garbage_collect() = 0 <0.021401>
22:10:01.241679 mysql_real_query(0x02740000, "select sleep(1)", 15) = 0 <1.000812>
</pre>
<p></p>
<h3><code>ltrace</code> command used:</h3>
<pre class="prettyprint">
ltrace -ttTg -x garbage_collect -x mysql_real_query -p [pid of script above]</pre>
<h2>Where to get it</h2>
<ul>
<li>On github: <a href="http://github.com/ice799/ltrace/tree/libdl">http://github.com/ice799/ltrace/tree/libdl</a></li>
<li>Raw patch (<strong>NOTE:</strong> This should apply cleanly against ltrace 0.5.3): <a href="http://timetobleed.com/files/ltrace.patch">ltrace.patch</a></li>
</ul>
<h2>How ltrace works normally</h2>
<p><code>ltrace</code> works by setting <b>software breakpoints</b> on entries in a process&#8217; <b>Procedure Linkage Table</b> (PLT).</p>
<h2>What is a software breakpoint</h2>
<p>A software breakpoint is just a series of bytes (<code>0xcc</code> on the x86 and x86_64) that raise a debug interrupt (interrupt 3 on the x86 and x86_64). When interrupt 3 is raised, the CPU executes a handler installed by the kernel. The kernel then sends a signal to the process that generated the interrupt. (Want to know more about how signals and interrupts work? Check out an earlier blog post: <a href="http://timetobleed.com/a-few-things-you-didnt-know-about-signals-in-linux-part-1/">here</a>)</p>
<h2>What is a PLT and how does it work?</h2>
<p>A PLT is a table of absolute addresses to functions. It is used because the link editor doesn&#8217;t know where functions in shared objects will be located. Instead, a table is created so that the program and the dynamic linker can work together to find and execute functions in shared objects. I&#8217;ve simplified the explanation a bit<sup>1</sup>, but at a high level:</p>
<ol>
<li>Program calls a function in a shared object, the link editor makes sure that the program jumps to a slot in the PLT.</li>
<li>The program sets some data up for the dynamic linker and then hands control over to it.</li>
<li>The dynamic linker looks at the info set up by the program and fills in the absolute address of the function that was called in the PLT.</li>
<li>Then the dynamic linker calls the function.</li>
<li>Subsequent calls to the same function jump to the same slot in the PLT, but every time after the first call the absolute address is already in the PLT (because when the dynamic linker is invoked the first time, it fills in the absolute address in the PLT).</p>
</ol>
<p>Since all calls to library functions occur via the PLT, <code>ltrace</code> sets breakpoints on each PLT entry in a program.</p>
<h2>Why ltrace didn&#8217;t work with libdl loaded libraries</h2>
<p>Libraries loaded with libdl are loaded at run time and functions (and other symbols) are accessed by querying the dynamic linker (by calling <code>dlsym()</code>). The compiler and link editor don&#8217;t know anything about libraries loaded this way (they may not even exist!) and as such no PLT entries are created for them.</p>
<p>Since no PLT entries exist, <code>ltrace</code> can&#8217;t trace these functions.</p>
<h2>What needed to be done to make ltrace libdl-aware</h2>
<p>OK, so we understand the problem. <code>ltrace</code> only sets breakpoints on PLT entries and libdl loaded libraries don&#8217;t have PLT entries. How can this be fixed?</p>
<p><b>Luckily, the dynamic linker and ELF all work together to save your ass.</b></p>
<p>Executable and Linking Format (ELF) is a file format for executables, shared libraries, and more<sup>2</sup>. The file format can get a bit complicated, but all you really need to know is: ELF consists of different sections which hold different types of entries. There is a section called <code>.dynamic</code> which has an entry named <code>DT_DEBUG</code>. This entry stores the address of a debugging structure <i>in the address space of the process</i>. In Linux, this struct has type <code>struct r_debug</code>.</p>
<h2>How to use struct r_debug to win the game</h2>
<p>The debug structure is <b>updated by the dynamic linker at runtime</b> to reflect the current state of shared object loading. The structure contains 3 things that will help us in our quest:</p>
<ol>
<li>state &#8211; the current state of the mapping change taking place (begin add, begin delete, consistent)</li>
<li>brk &#8211; the address of a function <i>internal to the dynamic linker</i> that will be called when the linker maps, unmaps, or has completed mapping a shared object.</li>
<li>link map &#8211; Pointer to the start of a list of currently loaded objects. This list is called the <b>link map</b> and is represented as a <code>struct link_map</code> in Linux.</li>
</ol>
<h2>Tie it all together and bring it home</h2>
<p>To add support for libdl loaded libraries to <code>ltrace</code>, the steps are:</p>
<ol>
<li>Find the address of the debug structure in the <code>.dynamic</code> section of the program.</li>
<li>Set a software breakpoint on <code>brk</code>.</li>
<li>When the dynamic linker updates the link map, it will trigger the software breakpoint.</li>
<li>When the breakpoint is triggered, check <code>state</code> in the debug structure.</li>
<li>If a new library has been added, walk the link map and figure out what was added.</li>
<li>Search the added library&#8217;s symbol table for the symbols we care about.</li>
<li>Set a software breakpoints on whatever is found.</li>
<li>Steps 3-8 repeat.</li>
</ol>
<p>That isn&#8217;t too hard all thanks to the dynamic linker providing a way for us to hook into its internal events.</p>
<h2>Conclusion</h2>
<ul>
<li>Read the System V ABI for your CPU. It is filled with insanely useful information that can help you be a better programmer.</li>
<li>Use the source. A few times while hacking on this patch I looked through the source for GDB and glibc to help me figure out what was going on.</li>
<li>Understanding how things work at a low-level can help you build tools to solve your high-level problems.</li>
</ul>
<p>Thanks for reading and don&#8217;t forget to <a rel="alternate" type="application/rss+xml" href="http://feeds.feedburner.com/TimeToBleed">subscribe (via RSS or e-mail)</a> and <a href="http://twitter.com/joedamato">follow me on twitter.</a></p>
<h2>References</h2>
<ol class="footnotes"><li id="footnote_0_1058" class="footnote"><a href="http://www.x86-64.org/documentation/abi.pdf">System V Application Binary Interface AMD64 Architecture Processor Supplement, p 78</a></li><li id="footnote_1_1058" class="footnote"><a href="http://refspecs.freestandards.org/elf/elf.pdf">Executable and Linking Format (ELF) Specification</a></li></ol>]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/extending-ltrace-to-make-your-rubypythonperlphp-apps-faster/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>Useful kernel and driver performance tweaks for your Linux server</title>
		<link>http://timetobleed.com/useful-kernel-and-driver-performance-tweaks-for-your-linux-server/</link>
		<comments>http://timetobleed.com/useful-kernel-and-driver-performance-tweaks-for-your-linux-server/#comments</comments>
		<pubDate>Tue, 28 Jul 2009 10:20:21 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[BIOS]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[scaling]]></category>
		<category><![CDATA[system health]]></category>
		<category><![CDATA[x86]]></category>
		<category><![CDATA[x86_64]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=1000</guid>
		<description><![CDATA[This article is going to address some kernel and driver tweaks that are interesting and useful. We use several of these in production with excellent performance, but you should proceed with caution and do research prior to trying anything listed below. Tickless System The tickless kernel feature allows for on-demand timer interrupts. This means that [...]]]></description>
			<content:encoded><![CDATA[<p><center><img src="http://timetobleed.com/images/kernel.jpg" width="400" height="300"/></center><br />

<p>This article is going to address some kernel and driver tweaks that are interesting and useful. We use several of these in production with <i>excellent</i> performance, but you should <b>proceed with caution</b> and do research <b>prior to trying anything listed below.</b></p>
<h2>Tickless System</h2>
<p>The tickless kernel feature allows for on-demand timer interrupts. This means that during idle periods, fewer timer interrupts will fire, which should lead to power savings, cooler running systems, and fewer useless context switches.</p>
<p><b>Kernel option:</b> CONFIG_NO_HZ=y</p>
<h2>Timer Frequency</h2>
<p>You can select the rate at which timer interrupts in the kernel will fire. When a timer interrupt fires on a CPU, the process running on that CPU is interrupted while the timer interrupt is handled. Reducing the rate at which the timer fires allows for fewer interruptions of your running processes.  This option is particularly useful for servers with multiple CPUs where processes are not running interactively.</p>
<p><b>Kernel options:</b> CONFIG_HZ_100=y  and CONFIG_HZ=100</p>
<h2>Connector</h2>
<p>The connector module is a kernel module which reports process events such as <code>fork</code>, <code>exec</code>, and <code>exit</code> to userland. This is <b>extremely</b> useful for process monitoring. You can build a simple system (or use an existing one like <a href="http://god.rubyforge.org/"/>god</a>) to watch mission-critical processes. If the processes die due to a signal (like <code>SIGSEGV</code>, or <code>SIGBUS</code>) or exit unexpectedly you&#8217;ll get an asynchronous notification from the kernel. The processes can then be restarted by your monitor keeping downtime to a minimum when unexpected events occur.</p>
<p><b>Kernel options:</b> CONFIG_CONNECTOR=y and CONFIG_PROC_EVENTS=y</p>
<h2>TCP segmentation offload (TSO)</h2>
<p>A popular feature among newer NICs is TCP segmentation offload (TSO). This feature allows the kernel to offload the work of dividing large packets into smaller packets to the NIC. This frees up the CPU to do more useful work and reduces the amount of overhead that the CPU passes along the bus. If your NIC supports this feature, you can enable it with <code>ethtool</code>:</p>
<p><pre class="prettyprint lang-sh">
[joe@timetobleed]% sudo ethtool -K eth1 tso on
</pre>
</p>
<p></p>
<p>Let&#8217;s quickly verify that this worked:</p>
<pre class="prettyprint lang-sh">
[joe@timetobleed]% sudo ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: on
large receive offload: off

[joe@timetobleed]% dmesg | tail -1
[892528.450378] 0000:04:00.1: eth1: TSO is Enabled
</pre>
</p>
<p></p>
<h2>Intel I/OAT DMA Engine</h2>
<p>This kernel option enables the Intel I/OAT DMA engine that is present in recent Xeon CPUs. This option increases network throughput as the DMA engine allows the kernel to offload network data copying from the CPU to the DMA engine. This frees up the CPU to do more useful work.</p>
<p>Check to see if it&#8217;s enabled:</p>
<p>
<pre class="prettyprint lang-sh">
[joe@timetobleed]% dmesg | grep ioat
ioatdma 0000:00:08.0: setting latency timer to 64
ioatdma 0000:00:08.0: Intel(R) I/OAT DMA Engine found, 4 channels, device version 0x12, driver version 3.64
ioatdma 0000:00:08.0: irq 56 for MSI/MSI-X
</pre>
<p></p>
<p>There&#8217;s also a sysfs interface where you can get some statistics about the DMA engine. Check the directories under <code>/sys/class/dma/</code>.</p>
<p><b>Kernel options</b>: CONFIG_DMADEVICES=y and CONFIG_INTEL_IOATDMA=y and CONFIG_DMA_ENGINE=y and CONFIG_NET_DMA=y and CONFIG_ASYNC_TX_DMA=y</p>
</p>
<h2>Direct Cache Access (DCA)</h2>
<p>Intel&#8217;s I/OAT also includes a feature called Direct Cache Access (DCA). DCA allows a driver to warm a CPU cache. A few NICs support DCA, the most popular (to my knowledge) is the Intel 10GbE driver (<code>ixgbe</code>). Refer to your NIC driver documentation to see if your NIC supports DCA. To enable DCA, a switch in the BIOS must be flipped. Some vendors supply machines that support DCA, but don&#8217;t expose a switch for DCA. If that is the case, see my last blog post for how to <a href="http://timetobleed.com/enabling-bios-options-on-a-live-server-with-no-rebooting/">enable DCA manually</a>.</p>
<p>You can check if DCA is enabled:</p>
<p>
<pre class="prettyprint lang-sh">
[joe@timetobleed]% dmesg | grep dca
dca service started, version 1.8
</pre>
<p></p>
<p>If DCA is possible on your system but disabled you&#8217;ll see:</p>
<p>
<pre class="prettyprint lang-sh">
ioatdma 0000:00:08.0: DCA is disabled in BIOS
</pre>
<p>
<p>
Which means you&#8217;ll need to enable it in the BIOS or manually.</p>
<p><b>Kernel option:</b> CONFIG_DCA=y</p>
<h2>NAPI</h2>
<p>The &#8220;New API&#8221; (NAPI) is a rework of the packet processing code in the kernel to improve performance for high speed networking. NAPI provides two major features<sup>1</sup>:</p>
<blockquote><p><b>Interrupt mitigation:</b> High-speed networking can create thousands of interrupts per second, all of which tell the system something it already knew: it has lots of packets to process. NAPI allows drivers to run with (some) interrupts disabled during times of high traffic, with a corresponding decrease in system load.</p>
<p><b>Packet throttling:</b> When the system is overwhelmed and must drop packets, it&#8217;s better if those packets are disposed of before much effort goes into processing them. NAPI-compliant drivers can often cause packets to be dropped in the network adaptor itself, before the kernel sees them at all. </p></blockquote>
<p>Many recent NIC drivers automatically support NAPI, so you don&#8217;t need to do anything. Some drivers need you to explicitly specify NAPI in the kernel config or on the command line when compiling the driver. If you are unsure, check your driver documentation. A good place to look for docs is in your kernel source under Documentation, available on the web here: <a href="http://lxr.linux.no/linux+v2.6.30/Documentation/networking/">http://lxr.linux.no/linux+v2.6.30/Documentation/networking/</a> but <b>be sure to select the correct kernel version, first!</b></p>
<p><b>Older e1000 drivers (newer drivers, do nothing)</b>: <code>make CFLAGS_EXTRA=-DE1000_NAPI install</code></p>
<h2>Throttle NIC Interrupts</h2>
<p>Some drivers allow the user to specify the rate at which the NIC will generate interrupts. The <code>e1000e</code> driver allows you to pass a command line option <code>InterruptThrottleRate</code></p>
<p> when loading the module with <code>insmod</code>. For the <code>e1000e</code> there are two dynamic interrupt throttle mechanisms, specified on the command line as 1 (dynamic) and 3 (dynamic conservative). The adaptive algorithm traffic into different classes and adjusts the interrupt rate appropriately. The difference between dynamic and dynamic conservative is the the rate for the &#8220;Lowest Latency&#8221; traffic class, dynamic (1) has a much more aggressive interrupt rate for this traffic class.</p>
<p>As always, check your driver documentation for more information.</p>
<p><b>With modprobe:</b><code> insmod e1000e.o InterruptThrottleRate=1</code></p>
<h2>Process and IRQ affinity</h2>
<p>Linux allows the user to specify which CPUs processes and interrupt handlers are bound.</p>
<ul>
<li><b>Processes</b> You can use <code>taskset</code> to specify which CPUs a process can run on</li>
<li><b>Interrupt Handlers</b> The interrupt map can be found in /proc/interrupts, and the affinity for each interrupt can be set in the file smp_affinity in the directory for each interrupt under /proc/irq/</li>
</ul>
<p>This is useful because you can pin the interrupt handlers for your NICs to specific CPUs so that when a shared resource is touched (a lock in the network stack) and loaded to a CPU cache, the next time the handler runs, it will be put on the <i>same</i> CPU avoiding costly cache invalidations that can occur if the handler is put on a different CPU.</p>
<p>However, reports<sup>2</sup> of up to a <b>24% improvement</b> can be had if processes and the IRQs for the NICs the processes get data from are pinned to the same CPUs. Doing this ensures that the data loaded into the CPU cache by the interrupt handler can be used (without invalidation) by the process; extremely high cache locality is achieved.</p>
<h2>oprofile</h2>
<p>oprofile is a system wide profiler that can profile both kernel and application level code. There is a kernel driver for oprofile which generates collects data in the x86&#8242;s Model Specific Registers (MSRs) to give very detailed information about the performance of running code. oprofile can also <b>annotate source code</b> with performance information to make fixing bottlenecks easy. See oprofile&#8217;s <a href="http://oprofile.sourceforge.net/examples/">homepage</a> for more information.</p>
<p><b>Kernel options:</b> CONFIG_OPROFILE=y and CONFIG_HAVE_OPROFILE=y</p>
<h2><code>epoll</code></h2>
<p><code>epoll(7)</code> is useful for applications which must watch for events on large numbers of file descriptors. The <code>epoll</code> interface is designed to easily scale to large numbers of file descriptors. <code>epoll</code> is <b>already enabled in most recent kernels</b>, but some strange distributions (which will remain nameless) have this feature disabled.</p>
<p><b>Kernel option:</b> CONFIG_EPOLL=y</p>
<h2>Conclusion</h2>
<ul>
<li>There are a lot of useful levers that can be pulled when trying to squeeze every last bit of performance out of your system</li>
<li>It is <b>extremely</b> important to read and understand your hardware documentation if you hope to achieve the maximum throughput your system can achieve</li>
<li>You can find documentation for your kernel online at the <a href="http://lxr.linux.no/linux+v2.6.30/Documentation/">Linux LXR</a>. <b>Make sure to select the correct kernel version</b> because docs change as the source changes!</li>
</ul>
<p>Thanks for reading and don&#8217;t forget to <a href="http://feeds.feedburner.com/TimeToBleed" rel="alternate" type="application/rss+xml">subscribe (via RSS or e-mail)</a> and <a href="http://twitter.com/joedamato">follow me on twitter.</a></p>
<h2>References</h2>
<ol class="footnotes"><li id="footnote_0_1000" class="footnote"><a href="http://www.linuxfoundation.org/en/Net:NAPI">http://www.linuxfoundation.org/en/Net:NAPI</a></li><li id="footnote_1_1000" class="footnote"><a href="http://software.intel.com/en-us/articles/improved-linux-smp-scaling-user-directed-processor-affinity/">http://software.intel.com/en-us/articles/improved-linux-smp-scaling-user-directed-processor-affinity/</a></li></ol>]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/useful-kernel-and-driver-performance-tweaks-for-your-linux-server/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Fix a bug in Ruby&#8217;s configure.in and get a ~30% performance boost.</title>
		<link>http://timetobleed.com/fix-a-bug-in-rubys-configurein-and-get-a-30-performance-boost/</link>
		<comments>http://timetobleed.com/fix-a-bug-in-rubys-configurein-and-get-a-30-performance-boost/#comments</comments>
		<pubDate>Tue, 05 May 2009 08:20:29 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[bugfix]]></category>
		<category><![CDATA[debugging]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[scaling]]></category>
		<category><![CDATA[testing]]></category>
		<category><![CDATA[x86]]></category>
		<category><![CDATA[debug]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[patch]]></category>
		<category><![CDATA[patches]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[strace]]></category>
		<category><![CDATA[syscall]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[threading]]></category>
		<category><![CDATA[threads]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=615</guid>
		<description><![CDATA[Special thanks&#8230; Going out to Jake Douglas for pushing the initial investigation and getting the ball rolling. The whole --enable-pthread thing Ask any Ruby hacker how to easily increase performance in a threaded Ruby application and they&#8217;ll probably tell you: Yo dude&#8230; Everyone knows you need to configure Ruby with --disable-pthread. And it&#8217;s true; configure [...]]]></description>
			<content:encoded><![CDATA[<p><center><img src="http://timetobleed.com/images/ruby_bug.jpg"/></center><br />
</p>
<p>
<h2>Special thanks&#8230;</h2>
<p>Going out to <a href="http://twitter.com/jakedouglas">Jake Douglas</a> for pushing the initial investigation and getting the ball rolling.</p>
<p><h2>The whole <code>--enable-pthread</code> thing</h2>
<p>Ask any Ruby hacker how to easily increase performance in a threaded Ruby application and they&#8217;ll probably tell you:<br />
<b><br />
Yo dude&#8230; <i>Everyone</i> knows you need to <code>configure</code> Ruby with <code>--disable-pthread</code>.<br />
</b><br />
And it&#8217;s true; <code>configure</code> Ruby with <code>--disable-pthread</code> and you get a ~30% performance boost. But&#8230; <b><i>why?</i></b></p>
<p> For this, we&#8217;ll have to turn to our handy tool <a href="http://timetobleed.com/hello-world/">strace</a>. We&#8217;ll also need a simple Ruby program to this one. How about something like this:</p>
<p>
<pre class="prettyprint lang-rb">
def make_thread
  Thread.new {
    a = []
    10_000_000.times {
      a << "a"
      a.pop
    }
  }
end

t = make_thread
t1 = make_thread 

t.join
t1.join</pre>
<p></p>
<p>Now, let's run <code>strace</code> on a version of Ruby <code>configure</code>'d with <code>--enable-pthread</code> and point it at our test script. The output from <code>strace</code> looks like this:</p>
<p>
<pre class="prettyprint lang-c">
22:46:16.706136 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706177 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706218 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706259 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000005>
22:46:16.706301 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706342 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706383 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706425 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706466 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004></pre>
<p></p>
<p><b>Pages and pages and pages</b> of sigprocmask system calls (Actually, running with <code>strace -c</code>, I get about <b>20,054,180</b> calls to <code>sigprocmask</code>, <b>WOW</b>). Running the <i>same test script</i> against a Ruby built with <code>--disable-pthread</code> and the output does <b>not</b> have pages and pages of <code>sigprocmask</code> calls (only <b>3</b> times, a <b>HUGE</b> reduction).
</p>
<p><h2>OK, so let's just set a breakpoint in GDB... right?</h2>
<p>OK, so we should just be able to set a <code>breakpoint</code> on <code>sigprocmask</code> and figure out who is calling it.</p>
<p><b>Well, not exactly.</b> You can try it, but the breakpoint <b>won't trigger</b> (we'll see why a little bit later).</p>
<p>Hrm, that kinda sucks and is confusing. This will make it harder to track down who is calling <code>sigprocmask</code> in the threaded case.</p>
<p> Well, we know that when you run <code>configure</code> the script creates a <code>config.h</code> with a bunch of <code>define</code>s that Ruby uses to decide which functions to use for what. So let's compare <code>./configure --enable-pthread</code> with <code>./configure --disable-pthread</code>:</p>
<pre class="prettyprint lang-bsh">
[joe@mawu:/home/joe/ruby]% diff config.h config.h.pthread
> #define _REENTRANT 1
> #define _THREAD_SAFE 1
> #define HAVE_LIBPTHREAD 1
> #define HAVE_NANOSLEEP 1
> #define HAVE_GETCONTEXT 1
> #define HAVE_SETCONTEXT 1</pre>
</p>
<p>
<br />
OK, now if we <code>grep</code> the Ruby source code, we see that whenever <code>HAVE_[SG]ETCONTEXT</code> are set, Ruby uses the system calls <code>setcontext()</code> and <code>getcontext()</code> to save and restore state for context switching and for exception handling (via the <code>EXEC_TAG</code>). </p>
<p>What about when <code>HAVE_[SG]ETCONTEXT</code> are <b>not</b> <code>define</code>'d? Well in that case, Ruby uses <code>_setjmp/_longjmp</code>.</p>
<p><b>Bingo!</b></p>
<p>That's what's going on! From the <code>_setjmp/_longjmp</code> man page:</p>
<blockquote><p>... The _longjmp()  and  _setjmp()  functions  shall  be  equivalent  to  longjmp() and setjmp(), respectively, with the additional restriction that _longjmp() and _setjmp() shall not manipulate the signal mask...</p></blockquote>
<p>And from the <code>[sg]etcontext</code> man page:</p>
<blockquote><p>... uc_sigmask is the set of signals blocked in this context (see sigprocmask(2)) ...</p></blockquote>
<p>
<br />The issue is that <code>getcontext</code> calls <code>sigprocmask</code> on <b>every invocation</b> but <code>_setjmp</code> does not.</p>
<p><b>BUT WAIT</b> if that's true why didn't <code>GDB</code> hit a <code>sigprocmask</code> breakpoint before?</p>
<p><h2>x86_64 assembly FTW, again</h2>
</p>
<p>
Let's fire up <code>gdb</code> and figure out this breakpoint-not-breaking thing. First, let's start by disassembling <code>getcontext</code> (snipped for brevity):<br />
<code><br />
(gdb) p getcontext<br />
$1 = {<text variable, no debug info>} 0x7ffff7825100 <getcontext><br />
(gdb) disas getcontext<br />
...<br />
0x00007ffff782517f <getcontext+127>:	mov    $0xe,%rax<br />
0x00007ffff7825186 <getcontext+134>:	syscall<br />
...<br />
</code></p>
<p>Yeah, that's pretty weird. I'll explain why in a minute, but let's look at the disassembly of <code>sigprocmask</code> first:<br />
<code><br />
(gdb) p sigprocmask<br />
$2 = {<text variable, no debug info>} 0x7ffff7817340 <__sigprocmask><br />
(gdb) disas sigprocmask<br />
...<br />
0x00007ffff7817383 <__sigprocmask+67>:	mov    $0xe,%rax<br />
0x00007ffff7817388 <__sigprocmask+72>:	syscall<br />
...<br />
</code><br />
Yeah, this is a bit confusing, but here's the deal.</p>
<p>
Recent Linux kernels implement a shiny new method for calling system calls called <code>sysenter/sysexit</code>. This new way was created because the old way (<code>int $0x80</code>) turned out to be pretty slow. So Intel created some new instructions to execute system calls without such huge overhead.</p>
<p> All you need to know right now (I'll try to blog more about this in the future) is that the <code>%rax</code> register holds the system call number. The <code>syscall</code> instruction transfers control to the kernel and the kernel figures out which syscall you wanted by checking the value in <code>%rax</code>. Let's just make sure that <code>sigprocmask</code> is actually 0xe:</p>
<pre class="prettyprint lang-c">
[joe@pluto:/usr/include]% grep -Hrn "sigprocmask" asm-x86_64/unistd.h
asm-x86_64/unistd.h:44:#define __NR_rt_sigprocmask                     14</pre>
<p>
<br />
<b>Bingo. It's calling <code>sigprocmask</code> (albeit a bit obscurely).</b></p>
<p>
OK, so <code>getcontext</code> isn't calling <code>sigprocmask</code> directly, instead it replicates a bunch of code that <code>sigprocmask</code> has in its function body. That's why we didn't hit the <code>sigprocmask</code> breakpoint; <code>GDB</code> was going to break if you landed on the address <code>0x7ffff7817340</code> but <b>you didn't</b>. </p>
<p>Instead, <code>getcontext</code> reimplements the wrapper code for <code>sigprocmask</code> itself and <code>GDB</code> is none the wiser. </p>
<p><b>Mystery solved</b>.</p>
<p><h2>The patch</h2>
</p>
<p>
Get it <b><a href="http://github.com/ice799/matzruby/commit/0b9b69f9653782a33aee2b8937d405eae245b60c">HERE</a></b></p>
<p>
The patch works by adding a new configure flag called <code>--disable-ucontext</code> to allow you to specifically disable <code>[sg]etcontext</code> from being called, you <b>use this in conjunction with</b> <code>--enable-pthread</code>, like this:<br />
<code><br />
./configure --disable-ucontext --enable-pthread</code><br />
<br />
After you build Ruby configured like that, its performance is on par with (and sometimes slightly faster) than Ruby built with <code>--disable-pthread</code> for about a 30% performance boost when compared to <code>--enable-pthread</code>.</p>
<p>I added the switch because I wanted to preserve the original Ruby behavior, if you just pass <code>--enable-pthread</code> <b>without</b> <code>--disable-ucontext</code></b> Ruby will do the old thing and generate piles of sigprocmasks.</p>
<h2>Conclusion</h2>
<ol>
<li> Things aren't always what they seem - GDB may lie to you. Be careful. </li>
<li> Use the source, Luke. Libraries can do unexpected things, debug builds of libc can help!</li>
<li> I know I keep saying this, assembly is useful. Start learning it today!</li>
</ol>
<p>
If you enjoyed this blog post, consider <a href="http://feeds.feedburner.com/TimeToBleed" rel="alternate" type="application/rss+xml">subscribing (via RSS)</a> or <a href="http://twitter.com/joedamato">following (via twitter)</a>.</p>
<p><b>You'll want to stay tuned; <a href="http://twitter.com/tmm1">tmm1</a> and I have been on a roll the past week. Lots of cool stuff coming out!</b></p>
]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/fix-a-bug-in-rubys-configurein-and-get-a-30-performance-boost/feed/</wfw:commentRss>
		<slash:comments>43</slash:comments>
		</item>
		<item>
		<title>5 Things You Don&#8217;t Know About User IDs That Will Destroy You</title>
		<link>http://timetobleed.com/5-things-you-dont-know-about-user-ids-that-will-destroy-you/</link>
		<comments>http://timetobleed.com/5-things-you-dont-know-about-user-ids-that-will-destroy-you/#comments</comments>
		<pubDate>Mon, 13 Apr 2009 17:06:33 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[monitoring]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[testing]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[privilege escalation]]></category>
		<category><![CDATA[privileges]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[vulnerability]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=419</guid>
		<description><![CDATA[*nix user and group IDs are complicated, confusing, and often misused. Look at this code snippet from the popular Ruby project, Starling: def drop_privileges &#160; Process.egid = options&#91;:group&#93; if options&#91;:group&#93; &#160; Process.euid = options&#91;:user&#93; if options&#91;:user&#93; end At quick first glance, you might think this code looks OK. But you&#8217;d be wrong. Let&#8217;s take a [...]]]></description>
			<content:encoded><![CDATA[<p><center><img src="http://timetobleed.com/images/jail.jpg"/></center><br />
</p>
<p>
*nix user and group IDs are complicated, confusing, and often misused. Look at this code snippet from the popular Ruby project, <a href="http://github.com/starling/starling/blob/e958961c4d92e8c30d23ee4c6759021748d7578c/lib/starling/server_runner.rb#L184">Starling</a>: </p>
<div class="dean_ch" style="white-space: wrap;">
<span class="kw1">def</span> drop_privileges<br />
&nbsp; <span class="kw4">Process</span>.<span class="me1">egid</span> = options<span class="br0">&#91;</span><span class="re3">:group</span><span class="br0">&#93;</span> <span class="kw1">if</span> options<span class="br0">&#91;</span><span class="re3">:group</span><span class="br0">&#93;</span><br />
&nbsp; <span class="kw4">Process</span>.<span class="me1">euid</span> = options<span class="br0">&#91;</span><span class="re3">:user</span><span class="br0">&#93;</span> <span class="kw1">if</span> options<span class="br0">&#91;</span><span class="re3">:user</span><span class="br0">&#93;</span><br />
<span class="kw1">end</span></div>
<p></p>
<p>At quick first glance, you might think this code looks OK. But you&#8217;d be <b>wrong</b>.</p>
<p>Let&#8217;s take a look at 5 things you <i>probably</i> don&#8217;t know about user and group IDs that can lead you to your downfall.</p>
<ol>
<h2>
<li>The difference between real, effective, and saved IDs</li>
</h2>
<p>This is always a bit confusing, but without a solid understanding of this concept you are doomed later.</p>
<ul>
<li>Real ID &#8211; The real ID is the ID of the process that created the current process. So, let&#8217;s say you log in to your box as <i>joe</i>, your shell is then launched with its real ID set to <i>joe</i>. All processes you start from your shell will inherit the real ID <i>joe</i> as their real ID.</li>
<li>Effective ID &#8211; The effective ID is the ID that the system uses to determine whether a process can take a particular action. There are two popular ways to change your effective ID:</li>
<ul>
<li><code>su</code> &#8211; the <code>su</code> program changes your effective, real, and saved IDs to the ID of the user you are switching to.</li>
<li>set ID upon execute (abbreviated setuid) &#8211; You can mark a program&#8217;s <i>set uid upon execute bit</i> so that the program runs with its effective and saved ID set to the <i>owner</i> of the program (which may not necessarily be you). The real ID will remain untouched. For example, if you have a program:
<p>
<div class="dean_ch" style="white-space: wrap;">
&#8230;<br />
<span class="me1">rv</span> = getresuid<span class="br0">&#40;</span>&amp;ruid, &amp;euid, &amp;suid<span class="br0">&#41;</span>;<br />
&#8230;<br />
<a href="http://www.opengroup.org/onlinepubs/009695399/functions/printf.html"><span class="kw3">printf</span></a><span class="br0">&#40;</span><span class="st0">&quot;ruid %d, euid %d, suid %d<span class="es0">\n</span>&quot;</span>, ruid, euid, suid<span class="br0">&#41;</span>;</div>
</p>
<p>
<p>If you then <code>chown</code> the program as root and <code>chmod +s</code> (which turns on the setuid bit), the program will print:</p>
<pre>ruid 1000, euid 0, suid 0</pre>
</p>
<p>
<p>
when it is run (assuming your user ID is 1000).</p>
</li>
</ul>
<li>Saved ID &#8211; The saved ID is set to the effective ID when the program starts. This exists so that a program can regain its original effective ID after it drops its effective ID to an unprivileged ID. This use-case can cause problems (as we&#8217;ll see soon) if it is not correctly managed.</li>
</ul>
<ul>
<li>If you start a program as yourself, and it does not have its <i>set ID upon execute bit</i> set, then the program will start running with its real, effective, and saved IDs set to your user ID.</li>
<li>If you run a setuid program, your real ID remains unchanged, but your effective and saved IDs are set to the owner of the file.</li>
<li><code>su</code> does the same as running a setuid program, but it also changes your real ID.</li>
</ul>
<h2>
<li>Don&#8217;t use Process.euid= in Ruby; stay as far away as possible</li>
</h2>
<ul>
<li>Process.euid= is <b>EXTREMELY platform specific</b>. It might do any of the following:</p>
<ul>
<li>Set just your effective ID</li>
<li>Set your effective, real, and saved ID.</li>
</ul>
<p>On most recent Linux kernels, Process.euid= changes <b><u>ONLY</u></b> the Effective ID. In most cases, this is <u><b>NOT</b></u> what you want. Check out <a href="https://gist.github.com/4acbe14306e86001e193">this sample Ruby script</a>. What would happen if you ran this script as root? </li>
<div class="dean_ch" style="white-space: wrap;">
<span class="kw1">def</span> write_file<br />
&nbsp; <span class="kw1">begin</span><br />
&nbsp; &nbsp; <span class="kw4">File</span>.<span class="kw3">open</span><span class="br0">&#40;</span><span class="st0">&quot;/test&quot;</span>, <span class="st0">&quot;w+&quot;</span><span class="br0">&#41;</span> <span class="kw1">do</span> |f|<br />
&nbsp; &nbsp; &nbsp; f.<span class="me1">write</span><span class="br0">&#40;</span><span class="st0">&quot;hello!<span class="es0">\n</span>&quot;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; f.<span class="me1">close</span><br />
&nbsp; &nbsp; <span class="kw1">end</span><br />
&nbsp; &nbsp; <span class="kw3">puts</span> <span class="st0">&quot;wrote test file&quot;</span><br />
&nbsp; <span class="kw1">rescue</span> <span class="re2">Errno::EACCES</span><br />
&nbsp; &nbsp; <span class="kw3">puts</span> <span class="st0">&quot;could not write test file&quot;</span><br />
&nbsp; <span class="kw1">end</span><br />
<span class="kw1">end</span><br />
&nbsp;<br />
<span class="kw3">puts</span> <span class="st0">&quot;ok, set uid to nobody&quot;</span><br />
<span class="kw4">Process</span>.<span class="me1">euid</span> = Etc.<span class="me1">getpwnam</span><span class="br0">&#40;</span><span class="st0">&quot;nobody&quot;</span><span class="br0">&#41;</span>.<span class="me1">uid</span><br />
&nbsp;<br />
<span class="kw3">puts</span> <span class="st0">&quot;going to try to write to / now&#8230;&quot;</span><br />
&nbsp;<br />
write_file<br />
&nbsp;<br />
<span class="kw3">puts</span> <span class="st0">&quot;restoring back to root&quot;</span><br />
&nbsp;<br />
<span class="kw4">Process</span>.<span class="me1">euid</span> = <span class="nu0">0</span><br />
&nbsp;<br />
<span class="kw3">puts</span> <span class="st0">&quot;now writing file&quot;</span><br />
&nbsp;<br />
write_file</div>
<p></p>
<p>This might surprise you, but the script <b>regains <code>root</code>&#8216;s ID</b> after it has dropped itself down to <code>nobody</code>.
</p>
<li> Why does this work? </li>
<p>Well as we just said, Process.euid= doesn&#8217;t touch the Saved ID, only the Effective ID. <b>As a result, the effective ID can be set back to the saved ID at any time. The only way to avoid this is to call a different Ruby function as we&#8217;ll see in #4 below.</b>
</ul>
<h2>
<li>Buggy native code running as <code>nobody</code> can execute arbitrary code as <code>root</code> in 8 bytes</li>
</h2>
<ul>
<li>Imagine a Ruby script much like the one above. The script is run as <code>root</code> to do something special (maybe bind to port 80).</li>
<li>The process then drops privileges to <code>nobody</code>.</li>
<li>Afterward, your application interacts with buggy native code in the Ruby interpreter, a Ruby extension, or a Ruby gem.</li>
<li>If that buggy native code can be &#8220;tricked&#8221; into executing arbitrary code, a malicious user can elevate the process up from nobody to root <b>in just 8 bytes.</b> Those 8 bytes are: <i>\x31\xdb\x8d\x43\x17\x99\xcd\x80</i> &#8211; which is a binary representation of setuid(0).</li>
<li>At this point, a malicious user can execute <i>arbitrary code</i> as the <i><code>root</code> user</i></ul>
</ul>
<p>Let&#8217;s take a look at an (abbreviated) code snippet (<a href="http://gist.github.com/92980">full</a> here):</p>
<div class="dean_ch" style="white-space: wrap;">
<span class="co1">## we&#8217;re using a buggy gem</span><br />
<span class="kw3">require</span> <span class="st0">&#8216;badgem&#8217;</span></p>
<p><span class="co1"># do some special operations here as the privileged user</span><br />
&#8230;</p>
<p><span class="co1"># ok, now let&#8217;s (incorrectly) drop to nobody</span><br />
<span class="kw4">Process</span>.<span class="me1">euid</span> = Etc.<span class="me1">getpwnam</span><span class="br0">&#40;</span><span class="st0">&quot;nobody&quot;</span><span class="br0">&#41;</span>.<span class="me1">uid</span></p>
<p><span class="co1"># let&#8217;s take some user input</span><br />
s = <span class="re2">MyModule::GetUserInput</span></p>
<p><span class="co1"># let&#8217;s assume the user is malicious and supplies something like:</span><br />
<span class="co1"># &quot;\x6a\x17\x58\x31\xdb\xcd\x80\x6a\x0b\x58\x99\x52&quot; +</span><br />
<span class="co1"># &quot;\x68//sh\x68/bin\x89\xe3\x52\x53\x89\xe1\xcd\x80&quot;</span><br />
<span class="co1"># as the string. </span><br />
<span class="co1"># That string is x86_32 linux shellcode for running</span><br />
<span class="co1"># setuid(0); and execve(&quot;/bin/sh&quot;, 0, 0) !</span></p>
<p><span class="co1"># pass that to a buggy Ruby Gem</span><br />
BadGem::bad<span class="br0">&#40;</span>s<span class="br0">&#41;</span></p>
<p><span class="co1"># the user is now sitting in a root shell!!</span></div>
<p>
<p>
	This is obviously <b><u>NOT GOOD.</u></b>
</p>
<h2>
<li> How to change the real, effective, and saved IDs</li>
</h2>
<p>In the list below, I&#8217;m going to list the functions as <code>syscall - RubyFunction</code></p>
<ul>
<li><code>setuid(uid_t uid) - Process::Sys.setuid(integer)</code></li>
<p>This pair of functions always sets the real, effective, and saved user IDs to the value passed in. This is a useful function for permanently dropping privileges, as we&#8217;ll see soon. <b>This is a POSIX function. Use this when possible.</b></p>
<li><code>setresuid(uid_t ruid, uid_t euid, uid_t suid) - Process::Sys.setresuid(rid, eid, sid)</code></li>
<p>This pair of functions allows you to set the real, effective, saved User IDs to arbitrary values, assuming you have a privileged effective ID. Unfortunately, this function <b>is NOT POSIX</b> and is <b>not portable</b>. It does exist on Linux and some BSDs, though.</p>
<li><code>setreuid(uid_t ruid, uid_t eid) - Process::Sys.setreuid(rid, eid)</code></li>
<p>This pair of functions allows you to set the real and effective user IDs to the values passed in. On Linux:
<ul>
<li>A process running with an unprivileged effective ID will only have the ability to set the real ID to the real ID or to the effective ID.</li>
<li>A process running with a privileged effective ID will have its saved ID set to the new effective ID <b>if</b> the real or effective IDs are set to a value which was not the previous real ID.</li>
</ul>
<p><b>This is a POSIX function, but has lots of cases with undefined behavior. Be careful.</b></p>
<li><code>seteuid(uid_t eid) - Process::Sys.seteuid(eid)</code></li>
<p>This pair of functions sets the effective ID of the process but leaves the <b>real and saved IDs unchanged. IMPORTANT: Any process (including those with unprivileged effective IDs) may change their effective ID to their real or saved ID.</b> This is exactly the behavior we saw with the Ruby script in #2 above.<b> This is a POSIX function.</b>
</ul>
<h2>
<li>How to correctly and permanently drop privileges</li>
</h2>
<p>You should use either the:</p>
<ul>
<li><code>setuid(uid_t uid) - Process::Sys.setuid(integer)</code></li>
<p>
or</p>
<li><code>setresuid(uid_t ruid, uid_t euid, uid_t suid) - Process::Sys.setresuid(rid, eid, sid)</code></li>
</ul>
<p>pair of functions to set the real, effective, and saved IDs to the lowest privileged ID possible. On many systems, this is the ID of the user <code>nobody</code>. </p>
<p>For the truly paranoid, it is recommended to check that dropping privileges <i>was actually</i> successful before continuing. For example:</p>
<div class="dean_ch" style="white-space: wrap;">
<span class="kw3">require</span> <span class="st0">&#8216;etc&#8217;</span></p>
<p><span class="kw1">def</span> test_drop<br />
&nbsp; <span class="kw1">begin</span><br />
&nbsp; &nbsp; <span class="re2">Process::Sys</span>.<span class="me1">setuid</span><span class="br0">&#40;</span><span class="nu0">0</span><span class="br0">&#41;</span><br />
&nbsp; <span class="kw1">rescue</span> <span class="re2">Errno::EPERM</span><br />
&nbsp; &nbsp; <span class="kw2">true</span><br />
&nbsp; <span class="kw1">else</span><br />
&nbsp; &nbsp; <span class="kw2">false</span><br />
&nbsp; <span class="kw1">end</span><br />
<span class="kw1">end</span></p>
<p>uid = Etc.<span class="me1">getpwnam</span><span class="br0">&#40;</span><span class="st0">&quot;nobody&quot;</span><span class="br0">&#41;</span>.<span class="me1">uid</span><br />
<span class="re2">Process::Sys</span>.<span class="me1">setuid</span><span class="br0">&#40;</span>uid<span class="br0">&#41;</span></p>
<p><span class="kw1">if</span> !test_drop<br />
&nbsp; <span class="kw3">puts</span> <span class="st0">&quot;Failed!&quot;</span><br />
&nbsp; <span class="co1">#handle error</span><br />
<span class="kw1">end</span></div>
</ol>
<h2>Conclusion</h2>
<p>*nix user and group ID management is confusing, difficult, and extremely error prone. It is a difficult system with many nuances, gotchas, and caveats. It is no wonder so many people make mistakes when trying to write secure code. The major things to keep in mind from this article are:</p>
<ul>
<li>Avoid Process.euid= at <b>all costs.</b></li>
<li>Drop privileges as soon as possible in your application.</li>
<li>Drop those privileges <i>permanently</i>.</li>
<li>Ensure that privileges were correctly dropped.</li>
<li><b>Carefully</b> read and re-read <code>man</code> pages when using the functions listed above.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/5-things-you-dont-know-about-user-ids-that-will-destroy-you/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>a/b test mallocs against your memory footprint</title>
		<link>http://timetobleed.com/ab-test-mallocs-against-your-memory-footprint/</link>
		<comments>http://timetobleed.com/ab-test-mallocs-against-your-memory-footprint/#comments</comments>
		<pubDate>Tue, 17 Mar 2009 01:39:42 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[debugging]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[scaling]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[testing]]></category>
		<category><![CDATA[allocator]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[malloc]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[profiling]]></category>
		<category><![CDATA[system health]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=317</guid>
		<description><![CDATA[The other day at Kickball Labs we were discussing whether linking Ruby against tcmalloc (or ptmalloc3, nedmalloc, or any other malloc) would have any noticeable effect on application latency. After taking a side in the argument, I started wondering how we could test this scenario. We had a couple different ideas about testing: Look at [...]]]></description>
			<content:encoded><![CDATA[<p><center><img src="http://timetobleed.com/images/brain.jpg"/></center><br />
</p>
<p>The other day at Kickball Labs we were discussing whether linking Ruby against tcmalloc (or ptmalloc3, nedmalloc, or any other malloc) would have any noticeable effect on application latency. After taking a side in the argument, I started wondering how we could test this scenario.</p>
<p>We had a couple different ideas about testing:</p>
<ul>
<li><b>Look at other people&#8217;s benchmarks</b><br />BUT do the memory workloads tested in the benchmarks actually match our own workload at all?</li>
<li><b>Run different allocators on different Ruby backends</b><br />BUT different backends will get different users who will use the system differently and cause different allocation patterns</li>
<li><b>Try to recreate our applications memory footprint and test that against different mallocs</b><br />
BUT how?</li>
</ul>
<p>I decided to explore <strong>the last option</strong> and came up with an interesting solution. Let&#8217;s dive into how to do this.</p>
<h2>Get the code:</h2>
<p><a href="http://github.com/ice799/malloc_wrap/tree/master">http://github.com/ice799/malloc_wrap/tree/master</a><br />
</p>
<h2>Step 1: We need to get a memory footprint of our process</h2>
<p>So we have some random binary  (in this case it happens to be a Ruby interpreter, but it could be anything) and we&#8217;d like to track when it calls malloc/realloc/calloc and free (from now on I&#8217;ll refer to all of these as malloc-family for brevity). There are two ways to do this, the right way and the wrong/hacky/unsafe way.</p>
<ul>
<li>
<h3>The &#8220;right&#8221; way to do this, with libc malloc hooks:</h3>
<p>Edit your application code to use the malloc debugging hooks provided by libc. When a malloc-family function is called, your hook executes and outputs to a file which function was called and what arguments were passed to it.</li>
<li>
<h3>The &#8220;wrong/hacky/unsafe&#8221; way to do this, with LD_PRELOAD:</h3>
<p>Create a shim library and point LD_PRELOAD at it. The shim exports the malloc-family symbols, and when your application calls one of those functions, the shim code gets executed. The shim logs which function was called and with what arguments. The shim then calls the libc version of the function (so that memory is actually allocated/freed) and returns control to the application.</li>
</ul>
<p>I chose to do it <strong>the second way</strong>, because I like living on the edge. <strong>The second way is unsafe because you can&#8217;t call any functions which use a malloc-family function before your hooks are setup. If you do, you can end up in an infinite loop and crash the application.</strong></p>
<p>You can check out my implementation for the shim library here: <a href="http://github.com/ice799/malloc_wrap/blob/master/malloc_wrap.c">malloc_wrap.c</a></p>
<h3> Why does your shim output such weirdly formatted data?</h3>
<p>Answer is sort of complicated, but let&#8217;s keep it simple: I originally had a different idea about how I was going to use the output. When that first try failed, I tried something else and translated the data to the format I needed it in, instead of re-writing the shim. What can I say, I&#8217;m a lazy programmer.</p>
<p>OK, so once you&#8217;ve built the shim (<b>gcc -O2 -Wall -ldl -fPIC -o malloc_wrap.so -shared malloc_wrap.c</b>), you can launch your binary like this:</p>
<div class="dean_ch" style="white-space: wrap;">% <span class="re2">LD_PRELOAD=</span>/path/to/shim/malloc_wrap.so /path/to/your/binary -your -args</div>
<p>You should now see output in /tmp/malloc-footprint.pid</p>
<h2>Step 2: Translate the data into a more usable format</h2>
<p>Yeah, I should have went back and re-written the shim, but nothing happens exactly as planned. So, I wrote a quick ruby script to convert my output into a more usable format. The script sorts through the output and renames memory addresses to unique integer ids starting at 1 (0 is hardcoded to NULL).</p>
<p>The format is pretty simple. The first line of the file has the number of calls to malloc-family functions, followed by a blank line, and then the memory footprint. Each line of the memory footprint has 1 character which represents the function called followed by a few arguments. For the free() function, there is only one argument, the ID of the memory block to free. malloc/calloc/realloc have different arguments, but the first argument following the one character is always the ID of the return value. The next arguments are the arguments usually passed to malloc/calloc/realloc in the same order.</p>
<p>Have a look at my ruby script here: <a href="http://github.com/ice799/malloc_wrap/blob/master/build_trace_file.rb">build_trace_file.rb</a></p>
<p>It might take a while to convert your data to this format, I suggest running this in a <a href="http://www.gnu.org/software/screen/"> screen</a> session, especially if your memory footprint data is large. Just as a warning, we collected 15 *gigabytes* of data over a 10 hour period. This script took *10 hours* to convert the data. We ended up with a 7.8 gigabyte file.</p>
<div class="dean_ch" style="white-space: wrap;">% ruby /path/to/script/build_trace_file.rb /path/to/raw/malloc-footprint.PID /path/to/converted/my-memory-footprint</div>
<h2>Step 3: Replay the allocation data with different allocators and measure time, memory usage.</h2>
<p>OK, so we now have a file which represents the memory footprint of our application. It&#8217;s time to build the replayer, link against your malloc implementation of choice, fire it up and start measuring time spent in allocator functions and memory usage.</p>
<p>Have a look at the replayer here: <a href="http://github.com/ice799/malloc_wrap/blob/master/alloc_tester.c">alloc_tester.c</a><br />
Build the replayer: <b>gcc -ggdb -Wall -ldl -fPIC -o tester alloc_tester.c</b></p>
<h3>Use ltrace</h3>
<p>ltrace is similar to <a href="http://timetobleed.com/hello-world/">strace</a>, but for library calls. You can use ltrace -c to sum the amount of time spent in different library calls and output a cool table at the end, it will look something like this:</p>
<pre>
% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
86.70   37.305797          62    600003 fscanf
10.64    4.578968          33    138532 malloc
2.36    1.014294          18     55263 free
0.25    0.109550          18      5948 realloc
0.03    0.011407          45       253 printf
0.02    0.010665          42       252 puts
0.00    0.000167          20         8 calloc
0.00    0.000048          48         1 fopen
------ ----------- ----------- --------- --------------------
100.00   43.030896                800260 total</pre>
<h2>Conclusion</h2>
<p>Using a different malloc implementation can provide a speed/memory increases depending on your allocation patterns. Hopefully the code provided will help you test different allocators to determine whether or not swapping out the default libc allocator is the right choice for you. Our results are still pending; we had a lot of allocator data (15g!) and it takes several hours to replay the data with just one malloc implementation. Once we&#8217;ve gathered some data about the different implementations and their effects, I&#8217;ll post the results and some analysis. As always, stay tuned and thanks for reading!</p>
]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/ab-test-mallocs-against-your-memory-footprint/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>It&#8217;s 10PM: Do you know your RAID/BBU/consistency status?</title>
		<link>http://timetobleed.com/its-10pm-do-you-know-your-raid-status/</link>
		<comments>http://timetobleed.com/its-10pm-do-you-know-your-raid-status/#comments</comments>
		<pubDate>Mon, 12 Jan 2009 01:28:42 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[monitoring]]></category>
		<category><![CDATA[scaling]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[failure recovery]]></category>
		<category><![CDATA[RAID]]></category>
		<category><![CDATA[storage]]></category>
		<category><![CDATA[system health]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=192</guid>
		<description><![CDATA[Huh? RAID status? Consistency status? The status of your RAID array tells you if your RAID array has degraded and which disk(s) are the culprit. Most RAID statuses will include more information like temperature, installed memory amount, and more. You also need to run consistency checks to ensure that data on bad blocks will either [...]]]></description>
			<content:encoded><![CDATA[<p><center><img src="http://timetobleed.com/images/raid.gif" alt="" width="400" height="300" /></center></p>
<h2>Huh? RAID status? Consistency status?</h2>
<p>The status of your RAID array tells you if your RAID array has degraded and which disk(s) are the culprit. Most RAID statuses will include more information like temperature, installed memory amount, and more.</p>
<p>You also need to run consistency checks to ensure that data on bad blocks will either be moved or rewritten to good blocks. Why is this important? Consider the following scenario: You have a RAID 10 array. One disk dies, say disk A of stripe set 1. You now replace that disk and start a rebuild of the array. You never ran a consistency check and it turns out that there were bad blocks on disk B of stripe set 1 that were never reallocated to good blocks. When data is written to the replacement disk, disk B may not be able to read data from its bad blocks. Corrupt data then gets written to the replacement disk and <strong> you likely won&#8217;t notice a problem until the box crashes or you are missing data due to corruption </strong></p>
<h3>Whoa that is pretty serious. How can I keep track of all that?</h3>
<p>The two common failure notifications for a logical failure I&#8217;ve seen are alarms and RAID status changes.</p>
<p>In my opinion, alarms are generally useless unless you are sitting near your server. What good is an alarm if you don&#8217;t hear it? While I wouldn&#8217;t rely on an alarm as the first line of defense against a RAID failure, it can definitely grab the attention of a nearby tech in the data center when a problem arises.</p>
<p>RAID status changes are probably the most useful way to determine when a RAID array degrades.</p>
<p>For physical disk failures, you&#8217;ll only know when a consistency check is run or when you lose data or the box dies. Some RAID adapters can be set up to automatically run consistency checks, others need to be invoke each time.</p>
<h2> Speaking of consistency, don&#8217;t forget about that battery backup unit (BBU)!</h2>
<p>A battery backup unit is necessary for a RAID array which has its write cache enabled. This is because if write requests are in the cache and power is lost to the system, the BBU will provide power so that the outstanding writes can be synced to the array. If you have the write cache enabled, but don&#8217;t have a BBU when power is lost to the system, the data on the system could be corrupt because the writes in the cache may not be written to disk.  </p>
<h2>How do I check my RAID/BBU status?</h2>
<p>Checking your RAID/BBU status is very vendor specific. Each vendor has their own method, but the most common method by far is to expose a management interface (in the form of a character device) which listens for different queries from userspace via an ioctl interface.</p>
<p>Most hardware RAID vendors include a small binary or script which will send ioctls to the management interface and give you detailed information about the status of your device. I&#8217;ve listed the names of the management apps for Adaptec and 3ware RAID devices below and included a sample output from an aacraid device at the bottom of this post.</p>
<p>Adaptec aacraid &#8211; /usr/StorMan/arcconf</p>
<p>3WARE raid &#8211; /usr/bin/tw_cli</p>
<p>You can write a script that runs as a cron job, parses the output of the management binary, and sends an email/page when a status change occurs.</p>
<h2>How can I run consistency checks?</h2>
<p>This is also incredibly vendor specific. The consistency check can usually be run/scheduled via the CLI. You should check the documentation for the CLI tool. With an aacraid controller, a consistency check can be run by using the datascrub command:</p>
<p>/usr/StorMan/arcconf datascrub 1 period 10</p>
<p>This will perform a consistency check in the background that has 10 days to complete.</p>
<h2>How can I protect myself from a single disk failure?</h2>
<p>There are <strong>many </strong>different RAID configurations, but the most common ones which can protect you from a single disk failure are:</p>
<ul>
<li>RAID 1</li>
<li>RAID 5</li>
<li>RAID 6</li>
<li>RAID 10</li>
</ul>
<h2>What about a multiple disk failure?</h2>
<p>Again, there are many different RAID configurations, but there are two major ways to survive multiple disk failure. Unfortunately, one way involves being <em>really</em> lucky.</p>
<ul>
<li>RAID 10 &#8211; You have to be pretty lucky here. As long as there is one working disk on each stripe set, you should be OK.</li>
<li>Double Parity RAID 6 &#8211; This configuration can survive a failure of any two disks.</li>
</ul>
<h2>Conclusion</h2>
<p>Read your RAID device documentation carefully and follow any relevant suggestions. If you don&#8217;t have RAID status monitoring set up, do it now. The minimal time investment to set this up can save you down the road when a hardware failure occurs.</p>
<p>You should also set up and run a consistency check as soon as possible and schedule them to run at regular intervals. Check your RAID docs for more info about how to run a consistency check.</p>
<p><strong>Sample output from an aacraid device that doesn&#8217;t have consistency checks running:</strong></p>
<p>sudo /usr/StorMan/arcconf getconfig 1 AD</p>
<pre>----------------------------------------------------------------------
Controller information
----------------------------------------------------------------------
   Controller Status                        : Optimal
   Channel description                      : SAS/SATA
   Controller Model                         : Adaptec 3405
   Controller Serial Number                 : 7C391118F8E
   Physical Slot                            : 2
   Temperature                              : 43 C/ 109 F (Normal)
   Installed memory                         : 128 MB
   Copyback                                 : Disabled
   Background consistency check             : Disabled
   Automatic Failover                       : Enabled
   Defunct disk drive count                 : 0
   Logical devices/Failed/Degraded          : 1/0/0
   --------------------------------------------------------
   Controller Version Information
   --------------------------------------------------------
   BIOS                                     : 5.2-0 (15753)
   Firmware                                 : 5.2-0 (15753)
   Driver                                   : 1.1-5 (2456)
   Boot Flash                               : 5.2-0 (15753)
   --------------------------------------------------------
   Controller Battery Information
   --------------------------------------------------------
   Status                                   : Optimal
   Over temperature                         : No
   Capacity remaining                       : 100 percent
   Time remaining (at current draw)         : 3 days, 1 hours, 31 minutes</pre>
]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/its-10pm-do-you-know-your-raid-status/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
