<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>time to bleed by Joe Damato &#187; scaling</title>
	<atom:link href="http://timetobleed.com/category/scaling/feed/" rel="self" type="application/rss+xml" />
	<link>http://timetobleed.com</link>
	<description>technical ramblings from a wanna-be unix dinosaur</description>
	<lastBuildDate>Tue, 20 Jul 2010 21:03:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Garbage Collection and the Ruby Heap (from railsconf)</title>
		<link>http://timetobleed.com/garbage-collection-and-the-ruby-heap-from-railsconf/</link>
		<comments>http://timetobleed.com/garbage-collection-and-the-ruby-heap-from-railsconf/#comments</comments>
		<pubDate>Tue, 08 Jun 2010 16:38:20 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[debugging]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[scaling]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[x86]]></category>
		<category><![CDATA[debug]]></category>
		<category><![CDATA[garbage collection]]></category>
		<category><![CDATA[GC]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[ltrace]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[profiling]]></category>
		<category><![CDATA[x86_64]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=1787</guid>
		<description><![CDATA[Download as PDF (15mb) Garbage Collection and the Ruby Heap]]></description>
			<content:encoded><![CDATA[<p><a style="float:right" href="http://dl.dropbox.com/u/1681973/gc-railsconf.pdf">Download as PDF (15mb)</a><br />
<a title="View Garbage Collection and the Ruby Heap on Scribd" href="http://www.scribd.com/doc/32718051/Garbage-Collection-and-the-Ruby-Heap" style="margin: 12px auto 6px auto; font-family: Helvetica,Arial,Sans-serif; font-style: normal; font-variant: normal; font-weight: normal; font-size: 14px; line-height: normal; font-size-adjust: none; font-stretch: normal; -x-system-font: none; display: block; text-decoration: underline;">Garbage Collection and the Ruby Heap</a> <object id="doc_179903367382288" name="doc_179903367382288" height="600" width="100%" type="application/x-shockwave-flash" data="http://d1.scribdassets.com/ScribdViewer.swf" style="outline:none;" ><param name="movie" value="http://d1.scribdassets.com/ScribdViewer.swf"><param name="wmode" value="opaque"><param name="bgcolor" value="#ffffff"><param name="allowFullScreen" value="true"><param name="allowScriptAccess" value="always"><param name="FlashVars" value="document_id=32718051&#038;access_key=key-1hl4d18vocqmc9ilk9a&#038;page=1&#038;viewMode=slideshow"><embed id="doc_179903367382288" name="doc_179903367382288" src="http://d1.scribdassets.com/ScribdViewer.swf?document_id=32718051&#038;access_key=key-1hl4d18vocqmc9ilk9a&#038;page=1&#038;viewMode=slideshow" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" height="600" width="100%" wmode="opaque" bgcolor="#ffffff"></embed></object></p>
]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/garbage-collection-and-the-ruby-heap-from-railsconf/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Descent into Darkness: Understanding your system&#8217;s binary interface is the only way out</title>
		<link>http://timetobleed.com/descent-into-darkness-understanding-your-systems-binary-interface-is-the-only-way-out/</link>
		<comments>http://timetobleed.com/descent-into-darkness-understanding-your-systems-binary-interface-is-the-only-way-out/#comments</comments>
		<pubDate>Mon, 15 Mar 2010 19:11:19 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[bugfix]]></category>
		<category><![CDATA[debugging]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[scaling]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[x86]]></category>
		<category><![CDATA[debug]]></category>
		<category><![CDATA[garbage collection]]></category>
		<category><![CDATA[GC]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[syscall]]></category>
		<category><![CDATA[x86_64]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=1602</guid>
		<description><![CDATA[Download as PDF (3mb) Descent into Darkness: Understanding your system&#8217;s binary interface is the only way out.]]></description>
			<content:encoded><![CDATA[<p><a style="float:right" href="http://dl.dropbox.com/u/1681973/abi.pdf">Download as PDF (3mb)</a><br />
<a title="View Descent into Darkness: Understanding your system's binary interface is the only way out. on Scribd" href="http://www.scribd.com/doc/28264000/Descent-into-Darkness-Understanding-your-system-s-binary-interface-is-the-only-way-out" style="margin: 12px auto 6px auto; font-family: Helvetica,Arial,Sans-serif; font-style: normal; font-variant: normal; font-weight: normal; font-size: 14px; line-height: normal; font-size-adjust: none; font-stretch: normal; -x-system-font: none; display: block; text-decoration: underline;">Descent into Darkness: Understanding your system&#8217;s binary interface is the only way out.</a> <object id="doc_50009547124029" name="doc_50009547124029" height="600" width="100%" type="application/x-shockwave-flash" data="http://d1.scribdassets.com/ScribdViewer.swf" style="outline:none;" ><param name="movie" value="http://d1.scribdassets.com/ScribdViewer.swf"><param name="wmode" value="opaque"><param name="bgcolor" value="#ffffff"><param name="allowFullScreen" value="true"><param name="allowScriptAccess" value="always"><param name="FlashVars" value="document_id=28264000&#038;access_key=key-nywmlzldrcxb47d7tv9&#038;page=1&#038;viewMode=slideshow"><embed id="doc_50009547124029" name="doc_50009547124029" src="http://d1.scribdassets.com/ScribdViewer.swf?document_id=28264000&#038;access_key=key-nywmlzldrcxb47d7tv9&#038;page=1&#038;viewMode=slideshow" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" height="600" width="100%" wmode="opaque" bgcolor="#ffffff"></embed></object>	</p>
]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/descent-into-darkness-understanding-your-systems-binary-interface-is-the-only-way-out/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>EventMachine: scalable non-blocking i/o in ruby</title>
		<link>http://timetobleed.com/eventmachine-scalable-non-blocking-io-in-ruby/</link>
		<comments>http://timetobleed.com/eventmachine-scalable-non-blocking-io-in-ruby/#comments</comments>
		<pubDate>Fri, 12 Mar 2010 20:07:39 +0000</pubDate>
		<dc:creator>Aman Gupta</dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[scaling]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[x86]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[x86_64]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=1574</guid>
		<description><![CDATA[Download as PDF (40mb) EventMachine: scalable non-blocking i/o in ruby]]></description>
			<content:encoded><![CDATA[<p><a style="float:right" href="http://dl.dropbox.com/u/635/em_export.pdf">Download as PDF (40mb)</a><br />
<a title="View EventMachine: scalable non-blocking i/o in ruby on Scribd" href="http://www.scribd.com/doc/28253878/EventMachine-scalable-non-blocking-i-o-in-ruby" style="margin: 12px auto 6px auto; font-family: Helvetica,Arial,Sans-serif; font-style: normal; font-variant: normal; font-weight: normal; font-size: 14px; line-height: normal; font-size-adjust: none; font-stretch: normal; -x-system-font: none; display: block; text-decoration: underline;">EventMachine: scalable non-blocking i/o in ruby</a> <object id="doc_298923438833050" name="doc_298923438833050" height="600" width="100%" type="application/x-shockwave-flash" data="http://d1.scribdassets.com/ScribdViewer.swf" style="outline:none;" ><param name="movie" value="http://d1.scribdassets.com/ScribdViewer.swf"><param name="wmode" value="opaque"><param name="bgcolor" value="#ffffff"><param name="allowFullScreen" value="true"><param name="allowScriptAccess" value="always"><param name="FlashVars" value="document_id=28253878&#038;access_key=key-1rb2iijpl7bew7i1f04i&#038;page=1&#038;viewMode=slideshow"><embed id="doc_298923438833050" name="doc_298923438833050" src="http://d1.scribdassets.com/ScribdViewer.swf?document_id=28253878&#038;access_key=key-1rb2iijpl7bew7i1f04i&#038;page=1&#038;viewMode=slideshow" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" height="600" width="100%" wmode="opaque" bgcolor="#ffffff"></embed></object></p>
]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/eventmachine-scalable-non-blocking-io-in-ruby/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Enabling BIOS options on a live server with no rebooting</title>
		<link>http://timetobleed.com/enabling-bios-options-on-a-live-server-with-no-rebooting/</link>
		<comments>http://timetobleed.com/enabling-bios-options-on-a-live-server-with-no-rebooting/#comments</comments>
		<pubDate>Mon, 06 Jul 2009 13:00:16 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[scaling]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[x86]]></category>
		<category><![CDATA[BIOS]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[x86_64]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=888</guid>
		<description><![CDATA[This blog post is going to describe a C program that toggles some CPU and chipset registers directly to enable Direct Cache Access without needing a reboot or a switch in the BIOS. A very fun hack to write and investigate. Special thanks&#8230; Special thanks going out to Roman Nurik for helping me make the [...]]]></description>
			<content:encoded><![CDATA[<p><center><img src="http://timetobleed.com/images/bios.gif" height=200 width=300/></center><br />

<p>This blog post is going to describe a C program that toggles some CPU and chipset registers directly to enable Direct Cache Access without needing a reboot or a switch in the BIOS. A very fun hack to write and investigate.</p>
<h2>Special thanks&#8230; </h2>
<p>Special thanks going out to <a href="http://twitter.com/romannurik">Roman Nurik</a> for helping me make the code CSS much, much prettier and easier to read.</p>
<p>Special thanks going out to <a href="http://twitter.com/jakedouglas">Jake Douglas</a> for convincing me that I shouldn&#8217;t use a stupid sensationalist title for this blog article :)</p>
<h2>Intel I/OAT and Direct Cache Access (DCA)</h2>
<p>From the Linux Foundation I/OAT project page<sup>1</sup>:</p>
<blockquote><p>I/OAT (I/O Acceleration Technology) is the name for a collection of techniques by Intel to improve network throughput. The most significant of these is the DMA engine. The DMA engine is meant to offload from the CPU the copying of  [socket buffer] data to the user buffer. This is not a zero-copy receive, but does allow the CPU to do other work while the copy operations are performed by the DMA engine.</p></blockquote>
<p></p>
<p><b>Cool!</b> So by using I/OAT the network stack in the Linux kernel can offload copy operations to increase throughput. I/OAT also includes a feature called Direct Cache Access (DCA) which can <i>deliver data directly into processor caches</i>. This is particularly cool because when a network interrupt arrives and data is copied to system memory, the CPU which will access this data will <b>not</b> cause a cache-miss on the CPU because DCA has already put the data it needs in the cache. Sick.</p>
<p>Measurements from the Linux Foundation project<sup>2</sup> indicate a 10% reduction in CPU usage, while the Myri-10G NIC website claims they&#8217;ve measured a <i>40%</i> reduction in CPU usage<sup>3</sup>. For more information describing the performance benefits of DCA see this incredibly detailed paper: <a href="http://www.stanford.edu/group/comparch/papers/huggahalli05.pdf">Direct Cache Access for High Bandwidth Network I/O</a>.</p>
<h2>How to get I/OAT and DCA</h2>
<p>To get I/OAT and DCA you need a few things:</p>
<ul>
<li>Intel XEON CPU(s)</li>
<li>A NIC(s) which has DCA support</li>
<li>A chipset which supports DCA</li>
<li>The <code>ioatdma</code> and <code>dca</code> Linux kernel modules</li>
<li>And last but not least, a switch in your BIOS to turn DCA on</li>
</ul>
<p>That last item can actually be a bit more tricky than it sounds for several reasons:</p>
<ul>
<li>some BIOSes <i>don&#8217;t expose a way to turn DCA on even though it is supported by the CPU, chipset, and NIC!</i></li>
<li>Your hosting provider may not allow BIOS access</li>
<li>Your system might be up and running and you don&#8217;t want to reboot to enter the BIOS to enable DCA</li>
</ul>
<p><b>Let&#8217;s see what you can do to coerce DCA into working on your system if one of the above applies to you</b>.</p>
<h2>Build <code>ioatdma</code> kernel module</h2>
<p>This is pretty easy, just <code>make menuconfig</code> and toggle I/OAT as a module. You <b>must</b> build it as a module if you cannot or do not want to enable DCA in your BIOS.</p>
<p>The option can be found in <code>Device Drivers -> DMA Engine Support -> Intel I/OAT DMA Support</code>.</p>
<p>Toggling that option will build the <code>ioatdma</code> and <code>dca</code> modules. Build and install the new module.</p>
<h2>Enabling DCA without a reboot or BIOS access: Hack overview</h2>
<p>In order to enable DCA a few special registers need to be touched.</p>
<ul>
<li>The DCA capability bit in the PCI Express Control Register 4 in the configuration space for the PCI bridge your NIC(s) are attached to.</li>
<li>The DCA Model Specific Register on your CPU(s)</li>
</ul>
<p>Let&#8217;s take a closer look at each stage of the hack.</p>
<h2>Enable DCA in PCI Configuration Space</h2>
<p><b>PCI configuration space</b> is a memory region where control registers for PCI devices live. By changing register values, you can enable/disable specific features of that PCI device. The configuration space is addressable if you know the PCI bus, device, and function bits for a specific PCI device and the feature you care about.</p>
<p>To find the DCA register for the  Intel 5000, 5100, and 7300 chipsets, we need to consult the documentation<sup>4</sup>:</p>
<p><center><img src="http://timetobleed.com/images/dca_pci.png"/></center><br />

<p>Cool, so the register needed lives at offset 0&#215;64. To enable DCA, bit 6 needs to be set to 1.</p>
<p>Toggling these register can be a bit cumbersome, but luckily there is <code>libpci</code> which provides some simple APIs to scan for PCI devices and accessing configuration space registers.</p>
<pre class="prettyprint lang-c">
#define INTEL_BRIDGE_DCAEN_OFFSET   0x64
#define INTEL_BRIDGE_DCAEN_BIT      6
#define PCI_HEADER_TYPE_BRIDGE     1
#define PCI_VENDOR_ID_INTEL        0x8086 /* lol @ intel */
#define PCI_HEADER_TYPE             0x0e
#define MSR_P6_DCA_CAP             0x000001f8

void check_dca(struct pci_dev *dev)
{
  /* read DCA status */
  u32 dca = pci_read_long(dev, INTEL_BRIDGE_DCAEN_OFFSET);

  /* if it's not enabled */
  if (!(dca &#038; (1 << INTEL_BRIDGE_DCAEN_BIT))) {
    printf("DCA disabled, enabling now.\n");

    /* enable it */
    dca |= 1 << INTEL_BRIDGE_DCAEN_BIT;

    /* write it back */
    pci_write_long(dev, INTEL_BRIDGE_DCAEN_OFFSET, dca);
  } else {
    printf("DCA already enabled!\n");
  }
}

int main(void)
{
  struct pci_access *pacc;
  struct pci_dev *dev;
  u8 type;

  pacc = pci_alloc();
  pci_init(pacc);

  /* scan the PCI bus */
  pci_scan_bus(pacc);

  /* for each device */
  for (dev = pacc->devices; dev; dev=dev->next) {
    pci_fill_info(dev, PCI_FILL_IDENT | PCI_FILL_BASES);

    /* if it's an intel device */
    if (dev->vendor_id == PCI_VENDOR_ID_INTEL) {

        /* read the header byte */
        type = pci_read_byte(dev, PCI_HEADER_TYPE);

        /* if its a PCI bridge, check and enable DCA */
        if (type == PCI_HEADER_TYPE_BRIDGE) {
          check_dca(dev);
        }
    }
  }

  msr_dca_enable();
  return 0;
}
</pre>
<h2>Enable DCA in the CPU MSR</h2>
<p>A <b>model specific register (MSR)</b> is a control register that is provided by a CPU to enable a feature that exists on a specific CPU. In this case, we care about the DCA MSR. In order to find it&#8217;s address, let&#8217;s consult the Intel Developer’s Manual 3B<sup>5</sup>.<br />
<center><img src="http://timetobleed.com/images/dca_msr.png"></center></p>
<p></p>
<p>This register lives at offset 0x1f8. We just need to set it to 1 and we should be good to go.</p>
<p>Thankfully, there are device files in <code>/dev</code> for the MSRs of each CPU:</p>
<pre class="prettyprint lang-c">
#define MSR_P6_DCA_CAP      0x000001f8
void msr_dca_enable(void)
{
  char msr_file_name[64];
  int fd = 0, i = 0;
  u64 data;

  /* for each CPU */
  for (;i < NUM_CPUS; i++) {
    sprintf(msr_file_name, "/dev/cpu/%d/msr", i);

    /* open the MSR device file */
    fd = open(msr_file_name, O_RDWR);
    if (fd < 0) {
      perror("open failed!");
      exit(1);
    }

    /* read the current DCA status */
    if (pread(fd, &#038;data, sizeof(data), MSR_P6_DCA_CAP) != sizeof(data)) {
      perror("reading msr failed!");
      exit(1);
    }

    printf("got msr value: %*llx\n", 1, (unsigned long long)data);

    /* if DCA is not enabled */
    if (!(data &#038; 1)) {

      /* enable it */
      data |= 1;

      /* write it back */
      if (pwrite(fd, &#038;data, sizeof(data), MSR_P6_DCA_CAP) != sizeof(data)) {
        perror("writing msr failed!");
        exit(1);
      }
    } else {
      printf("msr already enabled for CPU %d\n", i);
    }
  }
}
</pre>
<h2>Code for the hack is on github</h2>
<p>Get it here: <a href="http://github.com/ice799/dca_force/tree/master">http://github.com/ice799/dca_force/tree/master</a></p>
<h2>Putting it all together to get your speed boost</h2>
<p>
<ol>
<li>Checkout the hack from github: <code>git clone git://github.com/ice799/dca_force.git</code></li>
<li>Build the hack: <code>make NUM_CPUS=whatever</code></li>
<li>Run it: <code>sudo ./dca_force</code></li>
<li>Load the kernel module: <code>sudo modprobe ioatdma</code></li>
<li>Check your dmesg: <code>dmesg | tail </code></li>
</ol>
<p>You should see:</p>
<p><pre>
[   72.782249] dca service started, version 1.8
[   72.838853] ioatdma 0000:00:08.0: setting latency timer to 64
[   72.838865] ioatdma 0000:00:08.0: Intel(R) I/OAT DMA Engine found, 4 channels, device version 0x12, driver version 3.64
[   72.904027]   alloc irq_desc for 56 on cpu 0 node 0
[   72.904030]   alloc kstat_irqs on cpu 0 node 0
[   72.904039] ioatdma 0000:00:08.0: irq 56 for MSI/MSI-X
</pre>
</p>
<p></p>
<p>in your dmesg.</p>
<p>You should <b>NOT SEE</b></p>
<p><pre>
[    8.367333] ioatdma 0000:00:08.0: DCA is disabled in BIOS
</pre>
</p>
<p></p>
<p>You can now enjoy the DCA performance boost your BIOS or hosting provider didn't want you to have!</p>
<h2>Conclusion</h2>
<ul>
<li>Intel I/OAT and DCA is pretty cool, and enabling it can give pretty substantial performance wins</li>
<li>Cool features are sometimes stuffed away in the BIOS</li>
<li>If you don't have access to your BIOS, you should ask you provider nicely to do it for you</li>
<li>If your BIOS doesn't have a toggle switch for the feature you need, do a BIOS update</li>
<li>If all else fails and you know what you are doing, you can sometimes pull off nasty hacks like this in userland to get what you want</li>
</ul>
<p>Thanks for reading and don't forget to <a href="http://feeds.feedburner.com/TimeToBleed" rel="alternate" type="application/rss+xml">subscribe (via RSS or e-mail)</a> and <a href="http://twitter.com/joedamato">follow me on twitter.</a></p>
<h2>P.S.</h2>
<p>I know, I know. I skipped Part 2 of the signals post (<a href="http://timetobleed.com/a-few-things-you-didnt-know-about-signals-in-linux-part-1/">here's Part 1</a> if you missed it). Part 2 is coming soon!</p>
<h2>References</h2>
<ol class="footnotes"><li id="footnote_0_888" class="footnote"><a href="http://www.linuxfoundation.org/en/Net:I/OAT">http://www.linuxfoundation.org/en/Net:I/OAT</a></li><li id="footnote_1_888" class="footnote"><a href="http://www.linuxfoundation.org/en/Net:I/OAT">http://www.linuxfoundation.org/en/Net:I/OAT</a></li><li id="footnote_2_888" class="footnote"><a href="http://www.myri.com/serve/cache/626.html">http://www.myri.com/serve/cache/626.html</a></li><li id="footnote_3_888" class="footnote"><a href="www.intel.com/assets/pdf/designguide/318086.pdf">Intel® 7300 Chipset Memory Controller Hub (MCH) Datasheet, Section 4.8.12.6</a></li><li id="footnote_4_888" class="footnote"><a href="http://www.intel.com/Assets/PDF/manual/253669.pdf">Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide, Part 2, Appendix B-19</a></li></ol>]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/enabling-bios-options-on-a-live-server-with-no-rebooting/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Fixing Threads in Ruby 1.8: A 2-10x performance boost</title>
		<link>http://timetobleed.com/fixing-threads-in-ruby-18-a-2-10x-performance-boost/</link>
		<comments>http://timetobleed.com/fixing-threads-in-ruby-18-a-2-10x-performance-boost/#comments</comments>
		<pubDate>Mon, 18 May 2009 10:00:50 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[bugfix]]></category>
		<category><![CDATA[debugging]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[scaling]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[x86]]></category>
		<category><![CDATA[fibers]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[patch]]></category>
		<category><![CDATA[patches]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[threading]]></category>
		<category><![CDATA[threads]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=685</guid>
		<description><![CDATA[Quick notes before things get crazy OK, things might get a little crazy in this blog post so let&#8217;s clear a few things up before we get moving. I like the gritty details, and this article in particular has a lot of gritty info. To reduce the length of the article for the casual reader, [...]]]></description>
			<content:encoded><![CDATA[<p><center><img src="http://timetobleed.com/images/ruby-threads.jpg" alt="" width="400" height="300" /></center></p>
<h2>Quick notes before things get crazy</h2>
<p>OK, things might get a little crazy in this blog post so let&#8217;s clear a few things up before we get moving.</p>
<ul>
<li>I like the gritty details, and this article in particular has a lot of gritty info. To reduce the length of the article for the casual reader, I&#8217;ve put a portion of the really gritty stuff in the Epilogue below. Definitely check it out if that is your thing.</li>
<li>This article, the code, and the patches below are for Linux and OSX for the x86 and x86_64 platforms, only.</li>
<li>Even though there are code paths for both x86 and x86_64, I&#8217;m going to use the 64bit register names and (briefly) mention the 64bit binary interface.</li>
<li>Let&#8217;s assume the binary is built with -fno-omit-frame-pointer, the patches don&#8217;t care, but it&#8217;ll make the explanation a bit simpler later.</li>
<li>If you don&#8217;t know what the above two things mean, don&#8217;t worry; I got your back chief.</li>
</ul>
<h2>How threads work in Ruby</h2>
<p>Ruby 1.8 implements pre-emptible userland threads, also known as &#8220;green threads.&#8221; (Want to know more about threading models? See <a href="http://timetobleed.com/threading-models-so-many-different-ways-to-get-stuff-done/">this post</a>.) The major performance killer in Ruby&#8217;s implementation of green threads is that the <strong>entire thread stack is copied</strong> to and from the heap <strong>every context switch</strong>. Let&#8217;s take a look at a high level what happens when you:</p>
<pre class="prettyprint lang-rb">Thread.new{
	10000.times {
		a &lt;&lt; "a"
		a.pop
	}
}</pre>
<p>

<ol>
<li>A thread control block (tcb) is allocated in Ruby.</li>
<li>The infamous thread timer is initialized, either as a pthread or as an itimer.</li>
<li>Ruby scope information is copied to the heap.</li>
<li>The new thread is added to the list of threads.</li>
<li>The current thread is set as the new thread.</li>
<li>rb_thread_yield is called to yield to the block you passed in.</li>
<li>Your block starts executing.</li>
<li>The timer interrupts the executing thread.</li>
<li>The current thread&#8217;s state is stored:
<ul>
<li><code>memcpy()</code> #1 (sometimes): If the stack has grown since the last save, <code>realloc</code> is called. If the allocator cannot extend the size of the current block in place, it may decide to move the data to a new block that is large enough. If that happens <code>memcpy()</code> is called to move the data over.</li>
<li><code>memcpy()</code> #2 (always): A copy of this thread&#8217;s <strong>entire stack</strong> (starting from the top of the interpreter&#8217;s stack) is put on the heap.</li>
</ul>
</li>
<li>The next thread&#8217;s state is restored.
<ul>
<li><code>memcpy()</code> #3 (always): A copy of this thread&#8217;s <strong>entire stack</strong> is placed on the stack.</li>
</ul>
</li>
</ol>
<p>Steps 9 and 10 <strong>crush performance</strong> when even small amounts of Ruby code are executed.</p>
<p>Many of the functions the interpreter uses to evaluate code are <em>massive</em>. They allocate a large number of local variables creating stack frames up to <strong>4 kilobytes</strong> per function call. Those functions also call themselves recursively many times in a single expression. This leads to huge stacks, huge <code>memcpy()s</code>, and an incredible performance penalty.</p>
<p>If we can eliminate the <code>memcpy()s</code> we can get a lot of performance back. So, let&#8217;s do it.</p>
<h2>Increase performance by putting thread stacks on the heap</h2>
<p><strong>[Remember: we are only talking about x86_64]</strong></p>
<h3>How stacks work &#8211; a refresher</h3>
<p>Stacks grow <strong>downward</strong> from high addresses to low addresses. As data is <code>push</code>ed on to the stack, it grows downward. As stuff is <code>pop</code>ped, it shrinks upward. The register <code>%rsp</code> serves as a pointer to the bottom of the stack. When it is decremented or incremented the stack grows or shrinks, respectively. The <strong>special property</strong> of the program stack is that <strong>it will grow</strong> until you run out of memory (or are killed by the OS for being bad). The operating system handles the automatic growth. See the Epilogue for some more information about this.</p>
<h3>How to actually switch stacks</h3>
<p>The <code>%rsp</code> register can be (and is) changed and adjusted directly by user code. So all we have to do is put the address of our stack in <code>%rsp</code>, and we&#8217;ve switched stacks. Then we can just call our thread start function. Pretty easy. A small blob of inline assembly should do the trick:</p>
<pre class="prettyprint lang-c">__asm__ __volatile__ ("movq %0, %%rsp\n\t"
                      "callq *%1\n"
                      :: "r" (th-&gt;stk_base),
                         "r" (rb_thread_start_2));</pre>
<p>
<p>
Two instructions, not too bad.</p>
<ol>
<li><code>movq %0, %%rsp</code> moves a quad-word (th-&gt;stk_base) into the %rsp. <em>Quad-word</em> is Intel speak for 4 words, where 1 Intel word is 2 bytes.</li>
<li><code>callq *%1</code> calls a function at the address &#8220;rb_thread_start_2.&#8221; This has a side-effect or two, which I&#8217;ll mention in the Epilogue below, for those interested in a few more details.</li>
</ol>
<p>The above code is called <em>once per thread</em>. Calling <code>rb_thread_start_2</code> spins up your thread and it never returns.</p>
<h3>Where do we get stack space from?</h3>
<p>When the tcb is created, we&#8217;ll allocate some space with <code>mmap</code> and set a pointer to it.</p>
<pre class="prettyprint lang-c">/* error checking omitted for brevity, but exists in the patch =] */
stack_area = mmap(NULL, total_size, PROT_READ | PROT_WRITE | PROT_EXEC,
			MAP_PRIVATE | MAP_ANON, -1, 0);

th-&gt;stk_ptr = th-&gt;stk_pos = stack_area;
th-&gt;stk_base = th-&gt;stk_ptr + (total_size - sizeof(int))/sizeof(VALUE *);</pre>
<p>
<p>
Remember, stacks <strong>grow downward</strong> so that last line: <code>th-&gt;stk_base = ... </code> is necessary because the base of the stack is actually at the <em>top</em> of the memory region return by <code>mmap()</code>. The ugly math in there is for alignment, to comply with the x86_64 binary interface. Those curious about more gritty details should see the Epilogue below.</p>
<p><strong>BUT WAIT, I thought stacks were supposed to grow automatically?</strong></p>
<p>Yeah, the OS does that for the normal program stack. Not gonna happen for our <code>mmap</code>&#8216;d regions. The best we can do is pick a good default size and export a tuning lever so that advanced users can adjust the stack size as they see fit.</p>
<p><strong>BUT WAIT, isn&#8217;t that dangerous? If you fall off your stack, wouldn&#8217;t you just overwrite memory below?</strong></p>
<p>Yep, but there is a fix for that too. It&#8217;s called a guard page. We&#8217;ll create a guard page below each stack that has its permission bits set to <code>PROT_NONE</code>. This means, if a thread falls off the bottom of its stack and tries to read, write, or execute the memory below the thread stack, a signal (usually <code>SIGSEGV</code> or <code>SIGBUS</code>) will be sent to the process.</p>
<p>The code for the guard page is pretty simple, too:</p>
<pre class="prettyprint lang-c">/* omit error checking for brevity */
mprotect(th-&gt;stk_ptr, getpagesize(), PROT_NONE);</pre>
<p>
<p>
Cool, let&#8217;s modify the SIGSEGV and SIGBUS signal handlers to check for stack overflow:</p>
<pre class="prettyprint lang-c">/* if the address which generated the fault is within the current thread's guard page... */
  if(fault_addr &lt;= (caddr_t)rb_curr_thread-&gt;guard &#038;&#038;
     fault_addr &gt;= (caddr_t)rb_curr_thread-&gt;stk_ptr) {
  /* we hit the guard page, print out a warning to help app developers */
  rb_bug("Thread stack overflow! Try increasing it!");
}</pre>
<p>
<p>
See the epilogue for more details about this signal handler trick.</p>
<h2>Patches</h2>
<p><strong>As always, this is super-alpha software.</strong></p>
<table style="height: 60px;" border="0" cellspacing="1" cellpadding="1" width="300" summary="”&quot;">
<tbody>
<tr>
<td>Ruby 1.8.6</td>
<td><a href="http://github.com/ice799/matzruby/tree/heap_stacks_186">github</a></td>
<td><a href="http://timetobleed.com/files/186-hs.patch">raw .patch</a></td>
</tr>
<tr>
<td>Ruby 1.8.7</td>
<td><a href="http://github.com/ice799/matzruby/tree/heap_stacks">github</a></td>
<td><a href="http://timetobleed.com/files/187-hs.patch">raw .patch</a></td>
</tr>
</tbody>
</table>
<h2>Benchmarks</h2>
<p>The <a href="http://shootout.alioth.debian.org/">computer language shootout</a> has a thread test called thread-ring; let&#8217;s start with that.</p>
<pre class="prettyprint lang-rb">require 'thread'
THREAD_NUM = 403
number = ARGV.first.to_i

threads = []
for i in 1..THREAD_NUM
   threads &lt;&lt; Thread.new(i) do |thr_num|
      while true
         Thread.stop
         if number &gt; 0
            number -= 1
         else
            puts thr_num
            exit 0
         end
      end
   end
end

prev_thread = threads.last
while true
   for thread in threads
      Thread.pass until prev_thread.stop?
      thread.run
      prev_thread = thread
   end
end</pre>
<p>
<p>
Results (ARGV[0] = 50000000):</p>
<table style="height: 60px;" border="0" cellspacing="1" cellpadding="1" width="300" summary="”&quot;">
<tbody>
<tr>
<td>Ruby 1.8.6</td>
<td>1389.52s</td>
</tr>
<tr>
<td>Ruby 1.8.6 w/ heap stacks</td>
<td>793.06s</td>
</tr>
<tr>
<td>Ruby 1.9.1</td>
<td>752.44s</td>
</tr>
</tbody>
</table>
<p>
<p>
A <strong>speed up of about 2.3x</strong> compared to Ruby 1.8.6. A bit slower than Ruby 1.9.1.
</p>
<p>
<p>
That is a pretty strong showing, for sure. Let&#8217;s modify the test slightly to illustrate the true power of this implementation.</p>
<p>
<p>Since our implementation does no <code>memcpy</code>()s we <i>expect</i> the cost of context switching to stay constant regardless of thread stack size. Moreover, the unmodified Ruby 1.8.6 should perform worse as thread stack size increases (therefore increasing the amount of time the CPU is doing <code>memcpy</code>()s).</p>
<p>
<p>Let&#8217;s <b>test this hypothesis</b> by modifying thread-ring slightly so that it increases the size of the stack after spawning threads.</p>
<pre class="prettyprint lang-rb">def grow_stack n=0, &#038;blk
  unless n &gt; 100
    grow_stack n+1, &#038;blk
  else
    yield
  end
end

require 'thread'
THREAD_NUM = 403
number = ARGV.first.to_i

threads = []
for i in 1..THREAD_NUM
  threads &lt;&lt; Thread.new(i) do |thr_num|
    grow_stack do
      while true
        Thread.stop
        if number &gt; 0
          number -= 1
        else
          puts thr_num
          exit 0
        end
      end
    end
  end
end

prev_thread = threads.last
while true
   for thread in threads
      Thread.pass until prev_thread.stop?
      thread.run
      prev_thread = thread
   end
end</pre>
<p>
<p>
Results (ARGV[0] = 50000000):</p>
<table style="height: 60px;" border="0" cellspacing="1" cellpadding="1" width="300" summary="”&quot;">
<tbody>
<tr>
<td>Ruby 1.8.6</td>
<td>7493.50s</td>
</tr>
<tr>
<td>Ruby 1.8.6 w/ heap stacks</td>
<td>799.52s</td>
</tr>
<tr>
<td>Ruby 1.9.1</td>
<td>680.92s</td>
</tr>
</tbody>
</table>
<p>
<p>
A <strong>speed up of about 9.4x</strong> compared to Ruby 1.8.6. A bit slower than Ruby 1.9.1.</p>
<p>Now, lets benchmark mongrel+sinatra.</p>
<pre class="prettyprint lang-rb">
require 'rubygems'
require 'sinatra'

disable :reload

set :server, 'mongrel' 

get '/' do
  'hi'
end
</pre>
<p>
<p>
Results:</p>
<table style="height: 60px;" border="0" cellspacing="1" cellpadding="1" width="400" summary="”&quot;">
<tbody>
<tr>
<td>Ruby 1.8.6</td>
<td>1395.43 request/sec</td>
</tr>
<tr>
<td>Ruby 1.8.6 w/ heap stacks</td>
<td>1770.26 request/sec</td>
</tr>
</tbody>
</table>
<p>
<p>
An <b>increase of about 1.26x</b> in the <i>most naive case possible</i>.</p>
<p>
<p> Of course, if the handler did anything more than simply write &#8220;hi&#8221; (like use memcache or make sql queries) there would be more function calls, more context switches, and <b>a much greater savings.</b></p>
<h2>Conclusion</h2>
<p>A couple lessons learned this time:</p>
<ul>
<li>Hacking a VM like Ruby is kind of like hacking a kernel. Some subset of the tricks used in kernel hacking are useful in userland.</li>
<li>The x86_64 ABI is a <em>must read</em> if you plan on doing any low-level hacking.</li>
<li>Keep your CPU manuals close by, they come in handy even in userland.</li>
<li>Installing your own signal handlers is really useful for debugging, even if they are dumping architecture specific information.</li>
</ul>
<p>Hope everyone enjoyed this blog post. I&#8217;m always looking for things to blog about. If there is something you want explained or talked about, send me an email or a tweet!</p>
<p>Don&#8217;t forget to <a href="http://feeds.feedburner.com/TimeToBleed">subscribe</a> and <a href="http://twitter.com/joedamato">follow me</a> and <a href="http://twitter.com/tmm1">Aman</a> on twitter.</p>
<h2>Epilogue</h2>
<h3>Automatic stack growth</h3>
<p>This can be achieved pretty easily with a little help from virtual memory and the programmable interrupt controller (PIC). The idea is pretty simple. When you (or your shell on your behalf) calls <code>exec()</code> to execute a binary, the OS will map a bunch of pages of memory for the stack and set the stack pointer of the process to the top of the memory. Once the stack space is exhausted, and the stack pointer is <code>push</code>ed onto un-mapped memory, a page fault will be generated.</p>
<p>The OS&#8217;s page fault handler (installed via the PIC) will fire. The OS can then check the address that generated the exception and see that you fell off the bottom of your stack. This works very similarly to the guard page idea we added to protect Ruby thread stacks. It can then just map more memory to that area, and tell your process to continue executing. Your process doesn&#8217;t know anything bad happened.</p>
<p>I hope to chat a little bit about interrupt and exception handlers in an upcoming blog post. Stay tuned!</p>
<h3><code>callq</code> side-effects</h3>
<p>When a <code>callq</code> instruction is executed, the CPU pushes the return address on to the stack and then begins executing the function that was called. This is important because when the function you are calling executes a <code>ret</code> instruction, a quad-word is popped from the stack and put into the instruction pointer (<code>%rip</code>).</p>
<h3>x86_64 Application Binary Interface</h3>
<p>The x86_64 ABI is an extension of the x86 ABI. It specifies architecture programming information such as the fundamental types, caller and callee saved registers, alignment considerations and more. It is a really important document for any programmer messing with x86_64 architecture specific code.<br />
The particular piece of information relevant for this blog post is found buried in section 3.2.2</p>
<blockquote><p>The end of the input argument area shall be aligned on a 16 &#8230; byte boundary.</p></blockquote>
<p>This is important to keep in mind when constructing thread stacks. We decided to avoid messing with alignment issues. As such we did not pass any arguments to rb_thread_start_2. We wanted to avoid mathematical error that could happen if we try to align the memory ourselves after pushing some data. We also wanted to avoid writing more assembly than we had to, so we avoided passing the arguments in registers, too.</p>
<h3>Signal handler trick</h3>
<p>The signal handler &#8220;trick&#8221; to check if you have hit the guard page is made possible by the <code>sigaltstack()</code> system call and the POSIX <code>sa_sigaction</code> interface.</p>
<p><code>sigaltstack()</code> lets us specify a memory region to be used as the stack when a signal is delivered. This extremely important for the signal handler trick because once we fall off our thread stack, we certainly cannot expect to handle a signal using that stack space.</p>
<p>POSIX provides two ways for signals to be handled:</p>
<ul>
<li>sa_handler interface: calls your handler and passes in the signal number.</li>
<li>sa_sigaction interface: calls your handler and passes in the signal number, a <code>siginfo_t</code> struct, and a <code>ucontext_t</code>. The <code>siginfo_t</code> struct contains (among other things), the address which generated the fault. Simply check this address to see if its in the guard page and if so let the user know they just overflowed their stack. Another useful, but <em>extremely non-portable</em> modification that was added to Ruby&#8217; signal handlers was a dump of the contents in <code>ucontext_t</code> to provide useful debugging information. This structure contains the register state at the time of signal. Dumping it can help debugging by showing which values are in what registers.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/fixing-threads-in-ruby-18-a-2-10x-performance-boost/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
		<item>
		<title>Fix a bug in Ruby&#8217;s configure.in and get a ~30% performance boost.</title>
		<link>http://timetobleed.com/fix-a-bug-in-rubys-configurein-and-get-a-30-performance-boost/</link>
		<comments>http://timetobleed.com/fix-a-bug-in-rubys-configurein-and-get-a-30-performance-boost/#comments</comments>
		<pubDate>Tue, 05 May 2009 08:20:29 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[bugfix]]></category>
		<category><![CDATA[debugging]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[scaling]]></category>
		<category><![CDATA[testing]]></category>
		<category><![CDATA[x86]]></category>
		<category><![CDATA[debug]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[patch]]></category>
		<category><![CDATA[patches]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[strace]]></category>
		<category><![CDATA[syscall]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[threading]]></category>
		<category><![CDATA[threads]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=615</guid>
		<description><![CDATA[Special thanks&#8230; Going out to Jake Douglas for pushing the initial investigation and getting the ball rolling. The whole --enable-pthread thing Ask any Ruby hacker how to easily increase performance in a threaded Ruby application and they&#8217;ll probably tell you: Yo dude&#8230; Everyone knows you need to configure Ruby with --disable-pthread. And it&#8217;s true; configure [...]]]></description>
			<content:encoded><![CDATA[<p><center><img src="http://timetobleed.com/images/ruby_bug.jpg"/></center><br />
</p>
<p>
<h2>Special thanks&#8230;</h2>
<p>Going out to <a href="http://twitter.com/jakedouglas">Jake Douglas</a> for pushing the initial investigation and getting the ball rolling.</p>
<p><h2>The whole <code>--enable-pthread</code> thing</h2>
<p>Ask any Ruby hacker how to easily increase performance in a threaded Ruby application and they&#8217;ll probably tell you:<br />
<b><br />
Yo dude&#8230; <i>Everyone</i> knows you need to <code>configure</code> Ruby with <code>--disable-pthread</code>.<br />
</b><br />
And it&#8217;s true; <code>configure</code> Ruby with <code>--disable-pthread</code> and you get a ~30% performance boost. But&#8230; <b><i>why?</i></b></p>
<p> For this, we&#8217;ll have to turn to our handy tool <a href="http://timetobleed.com/hello-world/">strace</a>. We&#8217;ll also need a simple Ruby program to this one. How about something like this:</p>
<p>
<pre class="prettyprint lang-rb">
def make_thread
  Thread.new {
    a = []
    10_000_000.times {
      a << "a"
      a.pop
    }
  }
end

t = make_thread
t1 = make_thread 

t.join
t1.join</pre>
<p></p>
<p>Now, let's run <code>strace</code> on a version of Ruby <code>configure</code>'d with <code>--enable-pthread</code> and point it at our test script. The output from <code>strace</code> looks like this:</p>
<p>
<pre class="prettyprint lang-c">
22:46:16.706136 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706177 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706218 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706259 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000005>
22:46:16.706301 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706342 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706383 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706425 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706466 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004></pre>
<p></p>
<p><b>Pages and pages and pages</b> of sigprocmask system calls (Actually, running with <code>strace -c</code>, I get about <b>20,054,180</b> calls to <code>sigprocmask</code>, <b>WOW</b>). Running the <i>same test script</i> against a Ruby built with <code>--disable-pthread</code> and the output does <b>not</b> have pages and pages of <code>sigprocmask</code> calls (only <b>3</b> times, a <b>HUGE</b> reduction).
</p>
<p><h2>OK, so let's just set a breakpoint in GDB... right?</h2>
<p>OK, so we should just be able to set a <code>breakpoint</code> on <code>sigprocmask</code> and figure out who is calling it.</p>
<p><b>Well, not exactly.</b> You can try it, but the breakpoint <b>won't trigger</b> (we'll see why a little bit later).</p>
<p>Hrm, that kinda sucks and is confusing. This will make it harder to track down who is calling <code>sigprocmask</code> in the threaded case.</p>
<p> Well, we know that when you run <code>configure</code> the script creates a <code>config.h</code> with a bunch of <code>define</code>s that Ruby uses to decide which functions to use for what. So let's compare <code>./configure --enable-pthread</code> with <code>./configure --disable-pthread</code>:</p>
<pre class="prettyprint lang-bsh">
[joe@mawu:/home/joe/ruby]% diff config.h config.h.pthread
> #define _REENTRANT 1
> #define _THREAD_SAFE 1
> #define HAVE_LIBPTHREAD 1
> #define HAVE_NANOSLEEP 1
> #define HAVE_GETCONTEXT 1
> #define HAVE_SETCONTEXT 1</pre>
</p>
<p>
<br />
OK, now if we <code>grep</code> the Ruby source code, we see that whenever <code>HAVE_[SG]ETCONTEXT</code> are set, Ruby uses the system calls <code>setcontext()</code> and <code>getcontext()</code> to save and restore state for context switching and for exception handling (via the <code>EXEC_TAG</code>). </p>
<p>What about when <code>HAVE_[SG]ETCONTEXT</code> are <b>not</b> <code>define</code>'d? Well in that case, Ruby uses <code>_setjmp/_longjmp</code>.</p>
<p><b>Bingo!</b></p>
<p>That's what's going on! From the <code>_setjmp/_longjmp</code> man page:</p>
<blockquote><p>... The _longjmp()  and  _setjmp()  functions  shall  be  equivalent  to  longjmp() and setjmp(), respectively, with the additional restriction that _longjmp() and _setjmp() shall not manipulate the signal mask...</p></blockquote>
<p>And from the <code>[sg]etcontext</code> man page:</p>
<blockquote><p>... uc_sigmask is the set of signals blocked in this context (see sigprocmask(2)) ...</p></blockquote>
<p>
<br />The issue is that <code>getcontext</code> calls <code>sigprocmask</code> on <b>every invocation</b> but <code>_setjmp</code> does not.</p>
<p><b>BUT WAIT</b> if that's true why didn't <code>GDB</code> hit a <code>sigprocmask</code> breakpoint before?</p>
<p><h2>x86_64 assembly FTW, again</h2>
</p>
<p>
Let's fire up <code>gdb</code> and figure out this breakpoint-not-breaking thing. First, let's start by disassembling <code>getcontext</code> (snipped for brevity):<br />
<code><br />
(gdb) p getcontext<br />
$1 = {<text variable, no debug info>} 0x7ffff7825100 <getcontext><br />
(gdb) disas getcontext<br />
...<br />
0x00007ffff782517f <getcontext+127>:	mov    $0xe,%rax<br />
0x00007ffff7825186 <getcontext+134>:	syscall<br />
...<br />
</code></p>
<p>Yeah, that's pretty weird. I'll explain why in a minute, but let's look at the disassembly of <code>sigprocmask</code> first:<br />
<code><br />
(gdb) p sigprocmask<br />
$2 = {<text variable, no debug info>} 0x7ffff7817340 <__sigprocmask><br />
(gdb) disas sigprocmask<br />
...<br />
0x00007ffff7817383 <__sigprocmask+67>:	mov    $0xe,%rax<br />
0x00007ffff7817388 <__sigprocmask+72>:	syscall<br />
...<br />
</code><br />
Yeah, this is a bit confusing, but here's the deal.</p>
<p>
Recent Linux kernels implement a shiny new method for calling system calls called <code>sysenter/sysexit</code>. This new way was created because the old way (<code>int $0x80</code>) turned out to be pretty slow. So Intel created some new instructions to execute system calls without such huge overhead.</p>
<p> All you need to know right now (I'll try to blog more about this in the future) is that the <code>%rax</code> register holds the system call number. The <code>syscall</code> instruction transfers control to the kernel and the kernel figures out which syscall you wanted by checking the value in <code>%rax</code>. Let's just make sure that <code>sigprocmask</code> is actually 0xe:</p>
<pre class="prettyprint lang-c">
[joe@pluto:/usr/include]% grep -Hrn "sigprocmask" asm-x86_64/unistd.h
asm-x86_64/unistd.h:44:#define __NR_rt_sigprocmask                     14</pre>
<p>
<br />
<b>Bingo. It's calling <code>sigprocmask</code> (albeit a bit obscurely).</b></p>
<p>
OK, so <code>getcontext</code> isn't calling <code>sigprocmask</code> directly, instead it replicates a bunch of code that <code>sigprocmask</code> has in its function body. That's why we didn't hit the <code>sigprocmask</code> breakpoint; <code>GDB</code> was going to break if you landed on the address <code>0x7ffff7817340</code> but <b>you didn't</b>. </p>
<p>Instead, <code>getcontext</code> reimplements the wrapper code for <code>sigprocmask</code> itself and <code>GDB</code> is none the wiser. </p>
<p><b>Mystery solved</b>.</p>
<p><h2>The patch</h2>
</p>
<p>
Get it <b><a href="http://github.com/ice799/matzruby/commit/0b9b69f9653782a33aee2b8937d405eae245b60c">HERE</a></b></p>
<p>
The patch works by adding a new configure flag called <code>--disable-ucontext</code> to allow you to specifically disable <code>[sg]etcontext</code> from being called, you <b>use this in conjunction with</b> <code>--enable-pthread</code>, like this:<br />
<code><br />
./configure --disable-ucontext --enable-pthread</code><br />
<br />
After you build Ruby configured like that, its performance is on par with (and sometimes slightly faster) than Ruby built with <code>--disable-pthread</code> for about a 30% performance boost when compared to <code>--enable-pthread</code>.</p>
<p>I added the switch because I wanted to preserve the original Ruby behavior, if you just pass <code>--enable-pthread</code> <b>without</b> <code>--disable-ucontext</code></b> Ruby will do the old thing and generate piles of sigprocmasks.</p>
<h2>Conclusion</h2>
<ol>
<li> Things aren't always what they seem - GDB may lie to you. Be careful. </li>
<li> Use the source, Luke. Libraries can do unexpected things, debug builds of libc can help!</li>
<li> I know I keep saying this, assembly is useful. Start learning it today!</li>
</ol>
<p>
If you enjoyed this blog post, consider <a href="http://feeds.feedburner.com/TimeToBleed" rel="alternate" type="application/rss+xml">subscribing (via RSS)</a> or <a href="http://twitter.com/joedamato">following (via twitter)</a>.</p>
<p><b>You'll want to stay tuned; <a href="http://twitter.com/tmm1">tmm1</a> and I have been on a roll the past week. Lots of cool stuff coming out!</b></p>
]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/fix-a-bug-in-rubys-configurein-and-get-a-30-performance-boost/feed/</wfw:commentRss>
		<slash:comments>43</slash:comments>
		</item>
		<item>
		<title>6 Line EventMachine Bugfix = 2x faster GC, +1300% requests/sec</title>
		<link>http://timetobleed.com/6-line-eventmachine-bugfix-2x-faster-gc-1300-requestssec/</link>
		<comments>http://timetobleed.com/6-line-eventmachine-bugfix-2x-faster-gc-1300-requestssec/#comments</comments>
		<pubDate>Wed, 29 Apr 2009 06:36:09 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[bugfix]]></category>
		<category><![CDATA[debugging]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[scaling]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[x86]]></category>
		<category><![CDATA[debug]]></category>
		<category><![CDATA[garbage collection]]></category>
		<category><![CDATA[GC]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[patch]]></category>
		<category><![CDATA[patches]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[threading]]></category>
		<category><![CDATA[threads]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=554</guid>
		<description><![CDATA[Nothing is possible without lunch So Aman Gupta (tmm1) and I were eating lunch at the Oaxacan Kitchen on Tuesday and as usual, we were talking about scaling Ruby. We got into a small debate about which phase of garbage collection took the most CPU time. Aman&#8217;s claim: The mark phase, specifically the stack marking [...]]]></description>
			<content:encoded><![CDATA[<p><center><br />
<img src="http://timetobleed.com/images/oaxacan.jpg"/><br />
</center><br />
</p>
<p><h2>Nothing is possible without lunch</h2>
<p>So Aman Gupta (<a href="http://twitter.com/tmm1">tmm1</a>) and I were eating lunch at the <a href="http://www.theoaxacankitchen.com/">Oaxacan Kitchen</a> on Tuesday and as usual, we were talking about scaling Ruby. We got into a small debate about which phase of garbage collection took the most CPU time.</p>
<p>Aman&#8217;s claim:</p>
<ul>
<li>The mark phase, specifically the stack marking phase because of the huge stack frames created by rb_eval</li>
</ul>
<p>My claim:</p>
<ul>
<li>The sweep phase, because every single object has to be touched and some freeing happens.</li>
</ul>
<p>I told Aman that I didn&#8217;t believe the stack frames were that large, and we bet on how big we thought they would be. Couldn&#8217;t be more than a couple kilobytes, could it? <b>Little did we know how wrong our estimates were.</b>
</p>
<h2>Quick note about Ruby&#8217;s GC</h2>
<p>Ruby MRI has a mark-and-sweep garbage collector. As part of the mark phase, it <b>scans the process stack</b>. This is required because a pointer to a Ruby object can be passed to a C extension (like Eventmachine, or Hpricot, or whatever). If that happens, it isn&#8217;t safe to free the object yet. So Ruby does a simple scan and checks if <b>each word on the stack</b> is a pointer to the Ruby heap, if so, that item cannot be freed.<br />
</p>
<h2>GDB to the rescue</h2>
<p>We get back from lunch, launch our application, attach GDB and set a breakpoint. The breakpoint gets triggered and we see this seemingly innocuous stack trace [Note: To help with debugging, we compiled the EventMachine gem with -fno-omit-frame-pointer]:<br />
<code><br />
#0  0x00007ffff77629ac in epoll_wait () from /lib/libc.so.6<br />
#1  0x00007ffff6c0b220 in EventMachine_t::_RunEpollOnce (this=0x158d7e0) at em.cpp:461<br />
#2  0x00007ffff6c0b86c in EventMachine_t::_RunOnce (this=0x158d7e0) at em.cpp:423<br />
#3  0x00007ffff6c0bbd6 in EventMachine_t::Run (this=0x158d7e0) at em.cpp:404<br />
#4  0x00007ffff6c06638 in evma_run_machine () at cmain.cpp:83<br />
#5  0x00007ffff6c1897f in t_run_machine_without_threads (self=26066936) at rubymain.cpp:154<br />
#6  0x000000000041d598 in call_cfunc (func=0x7ffff6c1896e <t_run_machine_without_threads>, recv=26066936, len=0, argc=0, argv=0x0) at eval.c:5759<br />
#7  0x000000000041c92f in rb_call0 (klass=26065816, recv=26066936, id=29417, oid=29417, argc=0, argv=0x0, body=0x18dba10, flags=0) at eval.c:5911<br />
#8  0x000000000041e0ad in rb_call (klass=26065816, recv=26066936, mid=29417, argc=0, argv=0x0, scope=2, self=26066936) at eval.c:6158<br />
#9  0x00000000004160d5 in rb_eval (self=26066936, n=0x1940330) at eval.c:3514<br />
#10 0x00000000004150b7 in rb_eval (self=26066936, n=0x1941018) at eval.c:3357<br />
#11 0x000000000041d196 in rb_call0 (klass=26065816, recv=26066936, id=5393, oid=5393, argc=0, argv=0x0, body=0x1941018, flags=0) at eval.c:6062<br />
#12 0x000000000041e0ad in rb_call (klass=26065816, recv=26066936, mid=5393, argc=0, argv=0x0, scope=0, self=47127864) at eval.c:6158<br />
#13 0x0000000000415d01 in rb_eval (self=47127864, n=0x2cf5298) at eval.c:3493<br />
#14 0x00000000004148b2 in rb_eval (self=47127864, n=0x2cf4380) at eval.c:3223<br />
#15 0x000000000041d196 in rb_call0 (klass=47127808, recv=47127864, id=5313, oid=5313, argc=0, argv=0x0, body=0x2cf4380, flags=0) at eval.c:6062<br />
#16 0x000000000041e0ad in rb_call (klass=47127808, recv=47127864, mid=5313, argc=0, argv=0x0, scope=0, self=9606072) at eval.c:6158<br />
#17 0x0000000000415d01 in rb_eval (self=9606072, n=0x194b2a0) at eval.c:3493<br />
#18 0x00000000004148b2 in rb_eval (self=9606072, n=0x19587b0) at eval.c:3223<br />
#19 0x000000000041072c in eval_node (self=9606072, node=0x19587b0) at eval.c:1437<br />
#20 0x0000000000410dff in ruby_exec_internal () at eval.c:1642<br />
#21 0x0000000000410e4f in ruby_exec () at eval.c:1662<br />
#22 0x0000000000410e72 in ruby_run () at eval.c:1672<br />
#23 0x000000000040e78a in main (argc=3, argv=0x7fffffffebd8, envp=0x7fffffffebf8) at main.c:48<br />
</code><br />
Looks pretty normal, nothing to worry about, <i>right</i>?</p>
<p>We started checking the rb_eval frames because we assumed that those would be the largest stack frames. The rb_eval function inlines other functions and call itself recursively. So how big is one of the rb_eval frames?<br />
<code><br />
(gdb) frame 10<br />
#10 0x00000000004150b7 in rb_eval (self=26066936, n=0x1941018) at eval.c:3357<br />
3357		    result = rb_eval(self, node->nd_head);<br />
(gdb) p $rbp-$rsp<br />
$2 = 1904<br />
</code><br />
1,904 bytes &#8211; pretty large. If all the stack frames are that large, we are looking at <i>around</i> 47,600 bytes. Pretty serious. Let&#8217;s verify that Ruby thinks the stack is a sane size. There is a global in the Ruby interpreter called <code>rb_gc_stack_start</code>. It gets set when the Ruby stack is created in <code>Init_stack()</code>. When Ruby calculates the stack size it subtracts the current stack pointer from <code>rb_gc_stack_start</code> [<b>remember</b> on x86_64, the stack grows from high addresses to low addresses]. Let&#8217;s do that and see how big Ruby thinks the stack is.<br />
<code><br />
(gdb) p (unsigned int)rb_gc_stack_start - (unsigned int)$rsp<br />
$3 = 802688<br />
</code><br />
<b>Wait, wait, wait. 802,688 bytes with only 23 stack frames? WTF?!</b> Something is wrong. We started at the top and checked <i>all the rb_eval stack frames</i>, but none of them are larger than 2kb. We did find something <b>quite a bit larger than 2kb</b>, though.<br />
<code><br />
(gdb) frame 1<br />
#1  0x00007ffff6c0b220 in EventMachine_t::_RunEpollOnce (this=0x158d7e0) at em.cpp:461<br />
461		s = epoll_wait (epfd, ev, MaxEpollDescriptors, timeout == 0 ? 5 : timeout);<br />
(gdb) p $rbp-$rsp<br />
$28 = 786816<br />
</code><br />
Uh, the RunEpollOnce stack frame is <b>786,816 bytes</b>? That&#8217;s <i>got</i> to be wrong. <b>WTF?</b></p>
<p>Time to bring out the big guns.</p>
<h2>objdump + x86_64 asm FTW</h2>
<p>I pumped EventMachine&#8217;s shared object into <code>objdump</code> and captured the assembly dump:<br />
<code><br />
objdump -d rubyeventmachine.so > em.S<br />
</code><br />
I headed down to the <code>RunEpollOnce</code> function and saw the following:<br />
<code><br />
2f12b:       48 81 ec 78 01 0c 00    sub    $0xc0178,%rsp<br />
</code><br />
<b>Interesting</b>. So the code is moving <code>%rsp</code> down by 786,808 bytes to make room for something <b>big</b>. So, let&#8217;s see if the EventMachine code matches up with the assembly output.<br />
<code><br />
struct epoll_event ev [MaxEpollDescriptors];<br />
</code><br />
Where <code>MaxEpollDescriptors = 64*1024</code> and <code>sizeof(struct epoll_event) == 12</code>. That matches up with the assembly dump and the GDB output.</p>
<p>Usually, doing something like that in C/C++ is (usually) OK. Avoiding the heap whenever you can is a good idea because you avoid heap-lock contention, fragmenting the heap, and memory overhead for tracking the memory region. <b>When writing Ruby extensions, this isn&#8217;t necessarily true.</b> Remember, Ruby&#8217;s GC algorithm scans the <i>entire process stack</i> searching for references to Ruby objects. This EventMachine code causes Ruby to search an <i>extra</i> ~800,000 bytes drastically slowing down garbage collection.</p>
<h2>The patch</h2>
<p>Get the patch <a href="http://github.com/eventmachine/eventmachine/commit/1f6a4c912256b8110af94e270f7dde486f3c9d75">HERE</a></p>
<p> The patch simply moves the stack allocated <code>struct epoll_event ev</code> to the class definition so that it is allocated on the heap when an instance of the class is created with <code>new</code>. This <b>does not</b> change the memory usage of the process at all. It just moves the object off the stack. This makes all the difference because Ruby&#8217;s GC scans the <i>process stack</i> and <b>not</b> the process heap.</p>
<p>On top of all that, this patch helps with Ruby&#8217;s green threads, too. If the <code>epoll_wait</code> causes a Ruby event to fire and that event creates a Ruby thread, that Ruby thread gets an entire <b>copy</b> of the existing stack. Each time that thread is switched into and out of, that thread stack has to be memcpy&#8217;d into and out of place. Reducing those memcpys by ~800,000 bytes is a <b>HUGE</b> performance win. Want to learn more about threading implementations? Check out my threading models post: <a href="http://timetobleed.com/threading-models-so-many-different-ways-to-get-stuff-done/">here</a>.
</p>
<p>
Fixing this turned out to be pretty simple. A six (<b>6!!</b>) line patch:
</p>
<ul>
<li>Speeds up GC by <b>2-3x</b> because of the <i>huge</i> decrease in stack frame size.</li>
<li>Fixes an open bug in EventMachine where using threads with Epoll causes lots of slowness. The reason is that each thread will <b>inherit an ~800,000 byte stack</b> that gets copied in and out <b>every context switch</b>.</li>
<li>This results in an increase from <b>500 requests/sec to 7000 requests/sec</b> when using Sinatra+Thin+Epoll+Threads. <b>That is pretty ill.</b></li>
</ul>
<h2>Conclusion</h2>
<p>All in all, a productive debugging session lasting about an hour. The result was a simple patch, with 2 big performance improvements.
<p>A couple things to take away from this experience:</p>
<ul>
<li>Spend time learning your debugging tools because it pays off, especially <code>nm</code>, <code>objdump</code>, and of course <code>GDB</code>.</li>
<li>Getting familiar with x86_64 assembly is crucial if you hope to debug complex software and optimize it correctly.</li>
</ul>
<p>Keep your eyes open for up-coming blog posts about x86_64 assembly! Don&#8217;t forget to <a href="http://feeds.feedburner.com/TimeToBleed" rel="alternate" type="application/rss+xml">subscribe via RSS</a> or <a href="http://twitter.com/joedamato">follow me on twitter</a></p>
]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/6-line-eventmachine-bugfix-2x-faster-gc-1300-requestssec/feed/</wfw:commentRss>
		<slash:comments>46</slash:comments>
		</item>
		<item>
		<title>Yo Dawg: Using a package management system to install a package management system</title>
		<link>http://timetobleed.com/yo-dawg-using-a-package-management-system-to-install-a-package-management-system/</link>
		<comments>http://timetobleed.com/yo-dawg-using-a-package-management-system-to-install-a-package-management-system/#comments</comments>
		<pubDate>Mon, 27 Apr 2009 05:21:37 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[scaling]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[package management]]></category>
		<category><![CDATA[patches]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=183</guid>
		<description><![CDATA[Consider the following scenario: You would like to run a common Linux distro (Debian Etch, Centos/RHEL, whatever) for stability, the large community surrounding it, and maybe even for third-party support. There&#8217;s a catch though. You also want to easily use and deploy a small number of custom packages. Why? Maybe you want to apply a [...]]]></description>
			<content:encoded><![CDATA[<p><center><img src="http://timetobleed.com/images/packages.jpg" alt="" width="400" height="300" /></center><br />
</p>
<p>Consider the following scenario: You would like to run a common Linux distro (Debian Etch, Centos/RHEL, whatever) for stability, the large community surrounding it, and maybe even for third-party support.<br />
<br />
There&#8217;s a catch though.<br />
<br />
You also want to easily use and deploy a small number of custom packages. Why? Maybe you want to apply a patch for a library, compiler, interpreter, or something else you use. Sure, you could build a <code>.deb</code> or <code>.rpm</code>, but there is a bit of a learning curve; is that learning curve worth it just so you can apply a handful of patches?<br />
At Kickball Labs, we wanted to use the &#8220;stable&#8221; versions of packages that come bundled with Debian for the base system, but we also wanted to be able to use new packages that have features we are interested in. We decided to layer <code>pacman</code> on top of <code>apt</code> and install a small number of custom packages to a <code>/custom</code> directory on the filesystem. This enables us to use stable packages by default, but let&#8217;s us override them when we feel it is necessary.
</p>
<h2>What sucks about <code>RPM</code> and <code>APT</code> (imho)</h2>
<ol>
<li><b>Getting other people to use them</b> &#8211; OK, so you&#8217;ve bought in to <code>RPM</code> or <code>APT</code> and you don&#8217;t mind reading all the docs and cuddling up with the man pages. But what about the <b>rest of your team?</b> Unless there is only one person constantly cranking out custom packages, everyone is going to have to learn <code>RPM</code> or <code>APT</code>. Do you really want to waste valuable engineer brain cycles reading and debugging busted packages when instead you could be writing code?</li>
<li><b>Too much work to add 1 patch</b> &#8211; Let&#8217;s say I want to add one patch to fix a memory leak to <code>libX</code>. Here&#8217;s what I have to do for debian packages:
<ol>
<li>Download and unpack the library source.</li>
<li>Add a <code>debian/</code> sub-directory.</li>
<li>Create a <code>changelog</code>, <code>control</code>, and <code>files</code> file.</li>
<li>Create a file with a list of the patches that are being applied.</li>
<li>Drop in the patch.</li>
<li>Test the package.</li>
</ol>
<p><b>Wow.</b> <i>Extremely</i> painful. Especially for just one patch. Hell, you might even <b> throw the deb away</b> after if you decide you don&#8217;t like the patch.</p>
<li><b>Source control</b> &#8211; So you don&#8217;t mind the previous points. They don&#8217;t bother you all <i>that</i> much. But what about source control? How do you keep track of your Debian package files? You <i>could</i> keep an entire copy of the library&#8217;s source with your <code>debian/</code> sub-directory in your <code>git/svn/whatever</code>. That kind of sucks, though. What if you got your source code from the <code>git/svn</code> of the project instead of via a tarball? Yeah, I <i>guess</i> you could put all that into source control too. You could also check in your <code>debian/</code> sub-dirs into a repository and then symlink them into the source for the library&#8230;. <b>What a pain.</b>
</ol>
<h2><code>pacman</code> and the almighty <code>PKGBUILD</code></h2>
<p>
This is where <a href="http://www.archlinux.org/pacman/"><code>pacman</code></a> saves the day.</p>
<ol>
<li><a href="http://www.archlinux.org/pacman/pacman.8.html"><code>pacman</code></a> is simple &#8211; It doesn&#8217;t try to solve Global Warming. It just provides a dead simple set of command line switches for installing, removing, upgrading, and syncing packages. Not many options, but that is <b>exactly</b> what I want. You can just put a bunch of packages in a directory, point a webserver at it and its a <code>pacman</code> package server.</li>
<li><a href="http://www.archlinux.org/pacman/PKGBUILD.5.html"><code>PKGBUILD</code></a> files are simple &#8211; <code>PKGBUILD</code> files are just plain text files with a few fields. The fields are easy to understand and you can learn how to write your first PKGBUILD in 5 minutes.</li>
<li>Easily use with source control &#8211; Since the actual <code>PKGBUILD</code> file is plain text, your source control system should be able to easily keep track of changes. You don&#8217;t need to check in all the source, either. You can just point the PKGBUILD at a URL and it will automagically run wget and unpack the source. You can include a source tarball if you really want to, of course.</li>
<li>Quickly create create a new <code>PKGBUILD</code> or add a patch to an existing one &#8211; To add a new patch to an existing <code>PKGBUILD</code> I just add the filename to the <code>source = </code> line, and add a <code>patch -p N < file</code> line and I'm done. If the <code>PKGBUILD</code> doesn't exist, I can easily create a new one because the file format is <b>dead simple</b></li>
</ol>
<h2>Getting it on Debian</h2>
<p>This part is kind of weird. We want to get <code>pacman</code> on Debian. There isn't an <code>apt</code> package, so what now? Well, we can build a <code>.deb</code> file that installs <code>pacman</code> so we can use PKGBUILDs. Basically, we use a package management system to <i>install</i> a package management system.<br />
<br />
There's gotta be a "Yo Dawg" in there somewhere.<br />
<br />
Get it <a href="http://timetobleed.com/files/pacman-arch_3.2.1-2_amd64.deb">here</a> and be sure to get its dependency (libdownload) <a href="http://timetobleed.com/files/libdownload_1.3-1_amd64.deb">here</a>.</p>
<h2>A look at some PKGBUILDs</h2>
<p>
Let's take a look some PKGBUILDs that we use at Kickball Labs.
</p>
<p>The first is a simple PKGBUILD for ltrace, a program like <a href="http://timetobleed.com/hello-world/">strace</a> but for library calls. It just downloads the source, passes in some custom options to configure, builds the binary, and then installs to the package directory.</p>
<div class="dean_ch" style="white-space: wrap;">
<span class="re2">pkgname=</span>ltrace<br />
<span class="re2">pkgver=</span><span class="nu0">0.5</span><span class="nu0">.1</span><br />
<span class="re2">pkgrel=</span><span class="nu0">1</span><br />
<span class="re2">pkgdesc=</span><span class="st0">&quot;ltrace is a debugging program which runs a specified command until it exits&quot;</span><br />
<span class="re2">url=</span><span class="st0">&quot;http://packages.debian.org/unstable/utils/ltrace&quot;</span><br />
<span class="re2">arch=</span><span class="br0">&#40;</span><span class="st0">'x86_64'</span><span class="br0">&#41;</span><br />
<span class="re2">source=</span><span class="br0">&#40;</span>http://<span class="kw2">ftp</span>.debian.org/debian/pool/main/l/ltrace/<span class="re0">$<span class="br0">&#123;</span>pkgname<span class="br0">&#125;</span></span>_<span class="re0">$<span class="br0">&#123;</span>pkgver<span class="br0">&#125;</span></span>.orig.<span class="kw2">tar</span>.gz<span class="br0">&#41;</span></p>
<p>build<span class="br0">&#40;</span><span class="br0">&#41;</span><br />
<span class="br0">&#123;</span><br />
&nbsp; <span class="kw3">cd</span> <span class="re1">$startdir</span>/src/<span class="re1">$pkgname</span>-<span class="re1">$pkgver</span></p>
<p>&nbsp; ./configure --<span class="re2">prefix=</span>/custom --<span class="re2">sysconfdir=</span>/custom/etc<br />
&nbsp; <span class="kw2">make</span> || <span class="kw3">return</span> <span class="nu0">1</span><br />
&nbsp; <span class="kw2">make</span> <span class="re2">DESTDIR=</span><span class="re1">$startdir</span>/pkg <span class="kw2">install</span><br />
<span class="br0">&#125;</span></div>
<p>
<p>
Download it <a href="http://timetobleed.com/files/pkgbuild-ltrace">here.</a></p>
<p>This next PKGBUILD is a bit more intense. It is our PKGBUILD for Ruby, with a bunch of extra patches (<a href="http://timetobleed.com/fibers-implemented-for-ruby-1867/">fibers</a>, <a href="http://timetobleed.com/plugging-ruby-memory-leaks-heapstack-dump-patches-to-help-take-out-the-trash/">ruby GC patches</a>, and <a href="http://timetobleed.com/ruby-threading-bugfix-small-fix-goes-a-long-way/">ruby thread bugfixes</a>).</p>
<div class="dean_ch" style="white-space: wrap;">
<span class="re2">pkgname=</span>ruby<br />
<span class="re2">pkgver=</span><span class="nu0">1.8</span>.7_p72<br />
<span class="re2">_pkgver=</span><span class="nu0">1.8</span><span class="nu0">.7</span>-p72<br />
<span class="re2">pkgrel=</span><span class="nu0">27</span><br />
<span class="re2">pkgdesc=</span><span class="st0">&quot;An object-oriented language for quick and easy programming&quot;</span><br />
<span class="re2">arch=</span><span class="br0">&#40;</span>i686 x86_64<span class="br0">&#41;</span><br />
<span class="re2">license=</span><span class="br0">&#40;</span><span class="st0">'custom'</span><span class="br0">&#41;</span><br />
<span class="re2">url=</span><span class="st0">&quot;http://www.ruby-lang.org/en/&quot;</span><br />
<span class="re2">depends=</span><span class="br0">&#40;</span>google-perftools<span class="br0">&#41;</span><br />
<span class="re2">provides=</span><span class="br0">&#40;</span>ruby<span class="br0">&#41;</span><br />
<span class="re2">conflicts=</span><span class="br0">&#40;</span>ruby<span class="br0">&#41;</span><br />
<span class="re2">source=</span><span class="br0">&#40;</span><span class="kw2">ftp</span>://<span class="kw2">ftp</span>.ruby-lang.org/pub/ruby/stable/ruby-<span class="re0">$<span class="br0">&#123;</span>_pkgver<span class="br0">&#125;</span></span>.<span class="kw2">tar</span>.bz2 thread_timer.<span class="kw2">patch</span> fibers.<span class="kw2">patch</span> ruby<span class="nu0">-186</span>-gc-new.<span class="kw2">patch</span> dump_heap.<span class="kw2">patch</span><span class="br0">&#41;</span></p>
<p><span class="re2">options=</span><span class="br0">&#40;</span><span class="st0">'!emptydirs'</span> <span class="st0">'force'</span><span class="br0">&#41;</span></p>
<p>build<span class="br0">&#40;</span><span class="br0">&#41;</span> <span class="br0">&#123;</span><br />
&nbsp; <span class="kw2">sudo</span> apt-get <span class="kw2">install</span> libreadline5-dev zlib1g-dev libncurses5-dev libssl-dev libgdbm-dev libdb4<span class="nu0">.4</span>-dev</p>
<p>&nbsp; <span class="kw3">cd</span> <span class="re0">$<span class="br0">&#123;</span>startdir<span class="br0">&#125;</span></span>/src/<span class="re0">$<span class="br0">&#123;</span>pkgname<span class="br0">&#125;</span></span>-<span class="re0">$<span class="br0">&#123;</span>_pkgver<span class="br0">&#125;</span></span></p>
<p>&nbsp; <span class="kw2">patch</span> -p1 &lt; <span class="re0">$<span class="br0">&#123;</span>startdir<span class="br0">&#125;</span></span>/src/fibers.<span class="kw2">patch</span> || <span class="kw3">return</span> <span class="nu0">1</span><br />
&nbsp; <span class="kw2">patch</span> -p0 &lt; <span class="re0">$<span class="br0">&#123;</span>startdir<span class="br0">&#125;</span></span>/src/thread_timer.<span class="kw2">patch</span> || <span class="kw3">return</span> <span class="nu0">1</span><br />
&nbsp; <span class="kw2">patch</span> -p1 &lt; <span class="re0">$<span class="br0">&#123;</span>startdir<span class="br0">&#125;</span></span>/src/ruby<span class="nu0">-186</span>-gc-new.<span class="kw2">patch</span> || <span class="kw3">return</span> <span class="nu0">1</span><br />
&nbsp; <span class="kw2">patch</span> -p1 &lt; <span class="re0">$<span class="br0">&#123;</span>startdir<span class="br0">&#125;</span></span>/src/dump_heap.<span class="kw2">patch</span> || <span class="kw3">return</span> <span class="nu0">1</span></p>
<p>&nbsp; <span class="re3"># include /custom <span class="kw1">in</span> cflags/ldflags so extensions compile</span><br />
&nbsp; <span class="kw3">export</span> <span class="re2">CFLAGS=</span><span class="st0">&quot;-I/custom/include -g3 -gdwarf-2 -ggdb -O0&quot;</span><br />
&nbsp; <span class="kw3">export</span> <span class="re2">LDFLAGS=</span><span class="st0">&quot;-L/custom/lib&quot;</span><br />
&nbsp; <span class="kw3">export</span> <span class="re2">LIBS=</span><span class="st0">&quot;-L/custom/lib -ltcmalloc_minimal&quot;</span></p>
<p>&nbsp; ./configure --<span class="re2">prefix=</span>/custom --enable-shared --disable-pthread<br />
&nbsp; <span class="kw2">make</span> || <span class="kw3">return</span> <span class="nu0">1</span><br />
&nbsp; <span class="kw2">make</span> <span class="re2">DESTDIR=</span><span class="re0">$<span class="br0">&#123;</span>startdir<span class="br0">&#125;</span></span>/pkg <span class="kw2">install</span><br />
<span class="br0">&#125;</span></div>
<p>
<p>
Download it <a href="http://timetobleed.com/files/pkgbuild-ruby">here.</a></p>
<h2>Conclusion</h2>
<p>Package management is painful. If you have any plans on building a service that scales to multiple machines, you had better have a good solution for creating and distributing packages. <code>pacman</code> is good for this because:</p>
<ol>
<li>It's easy to learn and use, encouraging you to make everything (from libraries to configuration files and more) a PKGBUILD.</li>
<li>The simple plain text file format works great with your source control system of choice.</li>
<li>Applied a patch you didn't like? Just roll the PKGBUILD file back with your package manager.</li>
<li>Create a <code>PKGBUILD</code> repository by just putting the tarballs generated from your <code>PKGBUILD</code> files in a directory and pointing a web server at it. This is great for bringing up new hardware in a datacenter - just install <code>pacman</code>, point it at your repository, and install your base package which sets up the all your <code>passwd</code>, <code>host</code>, or other config files. </li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/yo-dawg-using-a-package-management-system-to-install-a-package-management-system/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>a/b test mallocs against your memory footprint</title>
		<link>http://timetobleed.com/ab-test-mallocs-against-your-memory-footprint/</link>
		<comments>http://timetobleed.com/ab-test-mallocs-against-your-memory-footprint/#comments</comments>
		<pubDate>Tue, 17 Mar 2009 01:39:42 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[debugging]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[scaling]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[testing]]></category>
		<category><![CDATA[allocator]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[malloc]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[profiling]]></category>
		<category><![CDATA[system health]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=317</guid>
		<description><![CDATA[The other day at Kickball Labs we were discussing whether linking Ruby against tcmalloc (or ptmalloc3, nedmalloc, or any other malloc) would have any noticeable effect on application latency. After taking a side in the argument, I started wondering how we could test this scenario. We had a couple different ideas about testing: Look at [...]]]></description>
			<content:encoded><![CDATA[<p><center><img src="http://timetobleed.com/images/brain.jpg"/></center><br />
</p>
<p>The other day at Kickball Labs we were discussing whether linking Ruby against tcmalloc (or ptmalloc3, nedmalloc, or any other malloc) would have any noticeable effect on application latency. After taking a side in the argument, I started wondering how we could test this scenario.</p>
<p>We had a couple different ideas about testing:</p>
<ul>
<li><b>Look at other people&#8217;s benchmarks</b><br />BUT do the memory workloads tested in the benchmarks actually match our own workload at all?</li>
<li><b>Run different allocators on different Ruby backends</b><br />BUT different backends will get different users who will use the system differently and cause different allocation patterns</li>
<li><b>Try to recreate our applications memory footprint and test that against different mallocs</b><br />
BUT how?</li>
</ul>
<p>I decided to explore <strong>the last option</strong> and came up with an interesting solution. Let&#8217;s dive into how to do this.</p>
<h2>Get the code:</h2>
<p><a href="http://github.com/ice799/malloc_wrap/tree/master">http://github.com/ice799/malloc_wrap/tree/master</a><br />
</p>
<h2>Step 1: We need to get a memory footprint of our process</h2>
<p>So we have some random binary  (in this case it happens to be a Ruby interpreter, but it could be anything) and we&#8217;d like to track when it calls malloc/realloc/calloc and free (from now on I&#8217;ll refer to all of these as malloc-family for brevity). There are two ways to do this, the right way and the wrong/hacky/unsafe way.</p>
<ul>
<li>
<h3>The &#8220;right&#8221; way to do this, with libc malloc hooks:</h3>
<p>Edit your application code to use the malloc debugging hooks provided by libc. When a malloc-family function is called, your hook executes and outputs to a file which function was called and what arguments were passed to it.</li>
<li>
<h3>The &#8220;wrong/hacky/unsafe&#8221; way to do this, with LD_PRELOAD:</h3>
<p>Create a shim library and point LD_PRELOAD at it. The shim exports the malloc-family symbols, and when your application calls one of those functions, the shim code gets executed. The shim logs which function was called and with what arguments. The shim then calls the libc version of the function (so that memory is actually allocated/freed) and returns control to the application.</li>
</ul>
<p>I chose to do it <strong>the second way</strong>, because I like living on the edge. <strong>The second way is unsafe because you can&#8217;t call any functions which use a malloc-family function before your hooks are setup. If you do, you can end up in an infinite loop and crash the application.</strong></p>
<p>You can check out my implementation for the shim library here: <a href="http://github.com/ice799/malloc_wrap/blob/master/malloc_wrap.c">malloc_wrap.c</a></p>
<h3> Why does your shim output such weirdly formatted data?</h3>
<p>Answer is sort of complicated, but let&#8217;s keep it simple: I originally had a different idea about how I was going to use the output. When that first try failed, I tried something else and translated the data to the format I needed it in, instead of re-writing the shim. What can I say, I&#8217;m a lazy programmer.</p>
<p>OK, so once you&#8217;ve built the shim (<b>gcc -O2 -Wall -ldl -fPIC -o malloc_wrap.so -shared malloc_wrap.c</b>), you can launch your binary like this:</p>
<div class="dean_ch" style="white-space: wrap;">% <span class="re2">LD_PRELOAD=</span>/path/to/shim/malloc_wrap.so /path/to/your/binary -your -args</div>
<p>You should now see output in /tmp/malloc-footprint.pid</p>
<h2>Step 2: Translate the data into a more usable format</h2>
<p>Yeah, I should have went back and re-written the shim, but nothing happens exactly as planned. So, I wrote a quick ruby script to convert my output into a more usable format. The script sorts through the output and renames memory addresses to unique integer ids starting at 1 (0 is hardcoded to NULL).</p>
<p>The format is pretty simple. The first line of the file has the number of calls to malloc-family functions, followed by a blank line, and then the memory footprint. Each line of the memory footprint has 1 character which represents the function called followed by a few arguments. For the free() function, there is only one argument, the ID of the memory block to free. malloc/calloc/realloc have different arguments, but the first argument following the one character is always the ID of the return value. The next arguments are the arguments usually passed to malloc/calloc/realloc in the same order.</p>
<p>Have a look at my ruby script here: <a href="http://github.com/ice799/malloc_wrap/blob/master/build_trace_file.rb">build_trace_file.rb</a></p>
<p>It might take a while to convert your data to this format, I suggest running this in a <a href="http://www.gnu.org/software/screen/"> screen</a> session, especially if your memory footprint data is large. Just as a warning, we collected 15 *gigabytes* of data over a 10 hour period. This script took *10 hours* to convert the data. We ended up with a 7.8 gigabyte file.</p>
<div class="dean_ch" style="white-space: wrap;">% ruby /path/to/script/build_trace_file.rb /path/to/raw/malloc-footprint.PID /path/to/converted/my-memory-footprint</div>
<h2>Step 3: Replay the allocation data with different allocators and measure time, memory usage.</h2>
<p>OK, so we now have a file which represents the memory footprint of our application. It&#8217;s time to build the replayer, link against your malloc implementation of choice, fire it up and start measuring time spent in allocator functions and memory usage.</p>
<p>Have a look at the replayer here: <a href="http://github.com/ice799/malloc_wrap/blob/master/alloc_tester.c">alloc_tester.c</a><br />
Build the replayer: <b>gcc -ggdb -Wall -ldl -fPIC -o tester alloc_tester.c</b></p>
<h3>Use ltrace</h3>
<p>ltrace is similar to <a href="http://timetobleed.com/hello-world/">strace</a>, but for library calls. You can use ltrace -c to sum the amount of time spent in different library calls and output a cool table at the end, it will look something like this:</p>
<pre>
% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
86.70   37.305797          62    600003 fscanf
10.64    4.578968          33    138532 malloc
2.36    1.014294          18     55263 free
0.25    0.109550          18      5948 realloc
0.03    0.011407          45       253 printf
0.02    0.010665          42       252 puts
0.00    0.000167          20         8 calloc
0.00    0.000048          48         1 fopen
------ ----------- ----------- --------- --------------------
100.00   43.030896                800260 total</pre>
<h2>Conclusion</h2>
<p>Using a different malloc implementation can provide a speed/memory increases depending on your allocation patterns. Hopefully the code provided will help you test different allocators to determine whether or not swapping out the default libc allocator is the right choice for you. Our results are still pending; we had a lot of allocator data (15g!) and it takes several hours to replay the data with just one malloc implementation. Once we&#8217;ve gathered some data about the different implementations and their effects, I&#8217;ll post the results and some analysis. As always, stay tuned and thanks for reading!</p>
]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/ab-test-mallocs-against-your-memory-footprint/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Fibers implemented for Ruby 1.8.{6,7}</title>
		<link>http://timetobleed.com/fibers-implemented-for-ruby-1867/</link>
		<comments>http://timetobleed.com/fibers-implemented-for-ruby-1867/#comments</comments>
		<pubDate>Thu, 05 Feb 2009 22:25:56 +0000</pubDate>
		<dc:creator>Joe Damato</dc:creator>
				<category><![CDATA[ruby]]></category>
		<category><![CDATA[scaling]]></category>
		<category><![CDATA[systems]]></category>
		<category><![CDATA[fibers]]></category>
		<category><![CDATA[patch]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[threading]]></category>

		<guid isPermaLink="false">http://timetobleed.com/?p=231</guid>
		<description><![CDATA[At Kickball Labs, Aman Gupta (http://github.com/tmm1) and I (http://github.com/ice799) have been working on an implementation of Fibers for Ruby 1.8.{6,7}. It is API compatible to Fibers in Ruby 1.9, except for the &#8220;transfer&#8221; method, which is currently unimplemented. This patch will allow you to use fibers with mysqlplus and neverblock. THIS IS ALPHA SOFTWARE (we [...]]]></description>
			<content:encoded><![CDATA[<p><center><img src="http://timetobleed.com/images/fibers.jpg" alt="" width="400" height="300" /></center></p>
<p>At Kickball Labs, Aman Gupta (<a href="http://github.com/tmm1">http://github.com/tmm1</a>) and I (<a href="http://github.com/ice799">http://github.com/ice799</a>) have been working on an implementation of Fibers for Ruby 1.8.{6,7}. It is <strong>API compatible to Fibers in Ruby 1.9</strong>, except for the &#8220;transfer&#8221; method, which is currently unimplemented. This patch will allow you to use fibers with <a href="http://github.com/oldmoe/mysqlplus/tree/master">mysqlplus</a> and <a href="http://www.espace.com.eg/neverblock">neverblock</a>.</p>
<p><center><strong>THIS IS ALPHA SOFTWARE (we are using it in production, though), USE WITH CAUTION.</strong></center></p>
<h2>Raw  patches</h2>
<p>Patch against ruby-1.8.7_p72: <strong><a href="http://timetobleed.com/files/fibers-187p72.patch">HERE</a></strong>. </p>
<p>Patch against ruby-1.8.6_p287: <strong><a href="http://timetobleed.com/files/fibers-186p287.patch">HERE</a></strong>.</p>
<p>To use the patch:<br />
Download ruby source <a href="ftp://ftp.ruby-lang.org/pub/ruby/1.8/ruby-1.8.7-p72.tar.gz"> Ruby 1.8.7_p72</a>, or if you prefer: <a href="ftp://ftp.ruby-lang.org/pub/ruby/1.8/ruby-1.8.6-p287.tar.gz"> Ruby 1.8.6-p287</a></p>
<p>Then, perform the following:</p>
<div class="dean_ch" style="white-space: wrap;">
<span class="kw3">cd</span> your-ruby-src-directory/<br />
<span class="kw2">wget</span> http://timetobleed.com/files/fibers-RUBY_VERSION.<span class="kw2">patch</span><br />
<span class="kw2">patch</span> -p1 &lt; fibers.<span class="kw2">patch</span><br />
./configure &#8212;-disable-pthread &#8212;-<span class="re2">prefix=</span>/tmp/ruby-with-fibers/ &amp;&amp; &nbsp;<span class="kw2">make</span> &amp;&amp; <span class="kw2">sudo</span> <span class="kw2">make</span> <span class="kw2">install</span><br />
/tmp/ruby-with-fibers/bin/ruby <span class="kw3">test</span>/test_fiber.rb</div>
<p>This will patch ruby and install it to a custom location: /tmp/ruby-with-fibers so you can test and play around with it without overwriting your existing Ruby installation.</p>
<h2>Github</h2>
<p>I am currently working on getting the ruby 1.8.6 patched code up on github, but Aman has a branch of ruby 1.8.7_p72 called fibers with the code at <a href="http://github.com/tmm1/ruby187/tree/fibers">http://github.com/tmm1/ruby187/tree/fibers</a></p>
<h2>What are fibers?</h2>
<p>Fibers are (usually) non-preemptible lightweight user-land threads.</p>
<h3>But I thought Ruby 1.8.{6,7} already had green threads?</h3>
<p>You are right; it does. Fibers are simply ruby green threads, without preemption. The programmer (you) gets to decide when to pause and resume execution of a fiber instead of a timer.</p>
<h2>Why would I use fibers?</h2>
<p>Bottom line: Your I/O should be asynchronous whenever possible, but sometimes re-writing your entire code base to be asynch and have callbacks can be difficult or painful. A simple solution to this problem is to create or use (see: NeverBlock) some middleware that wraps code paths which make I/O requests in a fiber.</p>
<p>The middleware can issue the asynch I/O operation in a fiber, and yield. Once the middleware&#8217;s asynch callback is hit, the Fiber can be resumed. Using NeverBlock (or rolling something similar yourself), <strong>should require only minimal code changes to your application, and will essentially make all of your I/O requests asynchronous without much pain at all.</strong></p>
<h2> How do I use fibers?</h2>
<p>There are already lots of great tutorials about fibers basics <a href="http://www.infoq.com/news/2007/08/ruby-1-9-fibers">here</a> and <a href="http://www.davidflanagan.com/2007/08/">here</a>.</p>
<p>Let&#8217;s take a look at something that drives home the point about being able to drop in some middleware to make synchronous code act asynchronous with minimal changes. </p>
<p>Consider the following code snippet:</p>
<div class="dean_ch" style="white-space: wrap;">
<span class="kw3">require</span> <span class="st0">&#8216;rubygems&#8217;</span><br />
<span class="kw3">require</span> <span class="st0">&#8216;sinatra&#8217;</span></p>
<p><span class="co1"># eventmachine/thin</span><br />
<span class="kw3">require</span> <span class="st0">&#8216;eventmachine&#8217;</span><br />
<span class="kw3">require</span> <span class="st0">&#8216;thin&#8217;</span></p>
<p><span class="co1"># mysql</span><br />
<span class="kw3">require</span> <span class="st0">&#8216;mysqlplus&#8217;</span></p>
<p><span class="co1"># single threaded</span><br />
DB = Mysql.<span class="me1">connect</span></p>
<p>disable <span class="re3">:reload</span></p>
<p>get <span class="st0">&#8216;/&#8217;</span> <span class="kw1">do</span><br />
&nbsp; <span class="nu0">4</span>.<span class="me1">times</span> <span class="kw1">do</span><br />
&nbsp; &nbsp; DB.<span class="me1">query</span><span class="br0">&#40;</span><span class="st0">&#8216;select sleep(0.25)&#8217;</span><span class="br0">&#41;</span><br />
&nbsp; <span class="kw1">end</span><br />
&nbsp; <span class="st0">&#8216;done&#8217;</span><br />
<span class="kw1">end</span><br />
&nbsp;</div>
<p>This code snippet creates a simple webservice which connects to a mysql database and issues long running queries (in this case, 4 queries which execute for a total of 1 second). </p>
<p>In this implementation, only one request can be handled at a time; the DB.query blocks, so the other users have to wait to have their queries executed.</p>
<p>This sucks because certainly mysql can handle more than just 4 sleep(0.25) queries a second! But, what are our options? </p>
<p>Well, we can rewrite the code to be asynchronous and string together some callbacks. For my contrived example, doing that would be pretty easy and it&#8217;d be only slightly harder to read. Let&#8217;s use our imaginations. Let&#8217;s pretend the code snippet I just showed you was some <i>huge, ugly, scary</i> blob of code and rewritting it to be asynchronous would not only take a long time, it would also make the code very ugly and difficult to read.</p>
<p>Now, let&#8217;s drop in fibers:</p>
<div class="dean_ch" style="white-space: wrap;">
<span class="kw3">require</span> <span class="st0">&#8216;rubygems&#8217;</span><br />
<span class="kw3">require</span> <span class="st0">&#8216;sinatra&#8217;</span></p>
<p><span class="co1"># eventmachine/thin</span><br />
<span class="kw3">require</span> <span class="st0">&#8216;eventmachine&#8217;</span><br />
<span class="kw3">require</span> <span class="st0">&#8216;thin&#8217;</span></p>
<p><span class="co1"># mysql</span><br />
<span class="kw3">require</span> <span class="st0">&#8216;mysqlplus&#8217;</span></p>
<p><span class="co1"># fibered</span><br />
<span class="kw3">require</span> <span class="st0">&#8216;neverblock&#8217;</span><br />
<span class="kw3">require</span> <span class="st0">&#8216;never_block/servers/thin&#8217;</span><br />
<span class="kw3">require</span> <span class="st0">&#8216;neverblock-mysql&#8217;</span><br />
<span class="kw1">class</span> <span class="re2">Thin::Server</span><br />
&nbsp;<span class="kw1">def</span> fiber_pool<span class="br0">&#40;</span><span class="br0">&#41;</span> <span class="re1">@fiber_pool</span> ||= <span class="re2">NB::Pool::FiberPool</span>.<span class="me1">new</span><span class="br0">&#40;</span><span class="nu0">20</span><span class="br0">&#41;</span> <span class="kw1">end</span><br />
<span class="kw1">end</span></p>
<p>DB = <span class="re2">NB::DB::PooledDBConnection</span>.<span class="me1">new</span><span class="br0">&#40;</span><span class="nu0">20</span><span class="br0">&#41;</span><span class="br0">&#123;</span> <span class="re2">NB::DB::FMysql</span>.<span class="me1">connect</span> <span class="br0">&#125;</span></p>
<p>
disable <span class="re3">:reload</span></p>
<p>get <span class="st0">&#8216;/&#8217;</span> <span class="kw1">do</span><br />
&nbsp; <span class="nu0">4</span>.<span class="me1">times</span> <span class="kw1">do</span><br />
&nbsp; &nbsp; DB.<span class="me1">query</span><span class="br0">&#40;</span><span class="st0">&#8216;select sleep(0.25)&#8217;</span><span class="br0">&#41;</span><br />
&nbsp; <span class="kw1">end</span><br />
&nbsp; <span class="st0">&#8216;done&#8217;</span><br />
<span class="kw1">end</span><br />
&nbsp;</div>
<p><b>NOTICE: The application code hasn&#8217;t changed</b>, we simply monkey patched Thin to use a pool of fibers. </p>
<p>Suddenly, our application can handle 20 connections. This is all handled by NeverBlock and mysqlplus. </p>
<ul>
<li> NeverBlock uses the fiber pool to issue an asynch DB query via mysqplus.</li>
<li> After the asynch query is executed, NeverBlock pauses the executing fiber</li>
<li> At this point other requests can be serviced</li>
<li> When the data comes back from the mysql server, a callback in NeverBlock is executed. </li>
<li> The callback resumes the paused fiber, which continues executing.</li>
</ul>
<p>Pretty sick, right?</p>
<h2>Memory consumption, context switches, cooperative multi-threading, oh my!</h2>
<p>In our implementation, fibers <em>are</em> ruby green threads, but with no scheduler or preemption. In fact, our fiber implementation shares many code-paths with the existing green thread implementation. As a result, there is very little difference in memory consumption between green threads and our fiber implementation.</p>
<p>Context switches are a different matter all together. The whole point of building a fiber implementation is to allow the programmer to decide when context switching is appropriate. In most circumstances, the application should be undergoing many fewer context switches with fibers and the context switches that do happen occur precisely when needed. As a result, the application can tend to run faster (fewer context switches ==&gt; fewer stack copies ==&gt;  fewer CPU cycles).</p>
<p>The major advantage of fibers over green threads is that you get to control when execution starts and stops. The major disadvantage of fibers is that if you have to code carefully, to ensure that you are starting and stopping your fibers appropriately.</p>
<h2>Future Directions</h2>
<p>Next stop will be &#8220;stackless&#8221; fibers. I have a fork of the fibers implementation in the works that pre-allocates fiber stacks on the ruby process&#8217; heap. I am hoping to eliminate the overhead associated with switching between fibers by simply shuffling pointers around.</p>
<p>A preliminary version seems to work, although a few bugs that crop up when you use fibers and threads together need to be squashed before the code can be considered &#8220;alpha&#8221; stage. When it&#8217;s done, you&#8217;ll find it right here.</p>
]]></content:encoded>
			<wfw:commentRss>http://timetobleed.com/fibers-implemented-for-ruby-1867/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
	</channel>
</rss>
