time to bleed by Joe Damato

technical ramblings from a wanna-be unix dinosaur

Archive for the ‘kernel’ tag

Useful kernel and driver performance tweaks for your Linux server

View Comments


This article is going to address some kernel and driver tweaks that are interesting and useful. We use several of these in production with excellent performance, but you should proceed with caution and do research prior to trying anything listed below.

Tickless System

The tickless kernel feature allows for on-demand timer interrupts. This means that during idle periods, fewer timer interrupts will fire, which should lead to power savings, cooler running systems, and fewer useless context switches.

Kernel option: CONFIG_NO_HZ=y

Timer Frequency

You can select the rate at which timer interrupts in the kernel will fire. When a timer interrupt fires on a CPU, the process running on that CPU is interrupted while the timer interrupt is handled. Reducing the rate at which the timer fires allows for fewer interruptions of your running processes. This option is particularly useful for servers with multiple CPUs where processes are not running interactively.

Kernel options: CONFIG_HZ_100=y and CONFIG_HZ=100

Connector

The connector module is a kernel module which reports process events such as fork, exec, and exit to userland. This is extremely useful for process monitoring. You can build a simple system (or use an existing one like god) to watch mission-critical processes. If the processes die due to a signal (like SIGSEGV, or SIGBUS) or exit unexpectedly you’ll get an asynchronous notification from the kernel. The processes can then be restarted by your monitor keeping downtime to a minimum when unexpected events occur.

Kernel options: CONFIG_CONNECTOR=y and CONFIG_PROC_EVENTS=y

TCP segmentation offload (TSO)

A popular feature among newer NICs is TCP segmentation offload (TSO). This feature allows the kernel to offload the work of dividing large packets into smaller packets to the NIC. This frees up the CPU to do more useful work and reduces the amount of overhead that the CPU passes along the bus. If your NIC supports this feature, you can enable it with ethtool:

[joe@timetobleed]% sudo ethtool -K eth1 tso on

Let’s quickly verify that this worked:

[joe@timetobleed]% sudo ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: on
large receive offload: off

[joe@timetobleed]% dmesg | tail -1
[892528.450378] 0000:04:00.1: eth1: TSO is Enabled

Intel I/OAT DMA Engine

This kernel option enables the Intel I/OAT DMA engine that is present in recent Xeon CPUs. This option increases network throughput as the DMA engine allows the kernel to offload network data copying from the CPU to the DMA engine. This frees up the CPU to do more useful work.

Check to see if it’s enabled:

[joe@timetobleed]% dmesg | grep ioat
ioatdma 0000:00:08.0: setting latency timer to 64
ioatdma 0000:00:08.0: Intel(R) I/OAT DMA Engine found, 4 channels, device version 0x12, driver version 3.64
ioatdma 0000:00:08.0: irq 56 for MSI/MSI-X

There’s also a sysfs interface where you can get some statistics about the DMA engine. Check the directories under /sys/class/dma/.

Kernel options: CONFIG_DMADEVICES=y and CONFIG_INTEL_IOATDMA=y and CONFIG_DMA_ENGINE=y and CONFIG_NET_DMA=y and CONFIG_ASYNC_TX_DMA=y

Direct Cache Access (DCA)

Intel’s I/OAT also includes a feature called Direct Cache Access (DCA). DCA allows a driver to warm a CPU cache. A few NICs support DCA, the most popular (to my knowledge) is the Intel 10GbE driver (ixgbe). Refer to your NIC driver documentation to see if your NIC supports DCA. To enable DCA, a switch in the BIOS must be flipped. Some vendors supply machines that support DCA, but don’t expose a switch for DCA. If that is the case, see my last blog post for how to enable DCA manually.

You can check if DCA is enabled:

[joe@timetobleed]% dmesg | grep dca
dca service started, version 1.8

If DCA is possible on your system but disabled you’ll see:

ioatdma 0000:00:08.0: DCA is disabled in BIOS

Which means you’ll need to enable it in the BIOS or manually.

Kernel option: CONFIG_DCA=y

NAPI

The “New API” (NAPI) is a rework of the packet processing code in the kernel to improve performance for high speed networking. NAPI provides two major features1:

Interrupt mitigation: High-speed networking can create thousands of interrupts per second, all of which tell the system something it already knew: it has lots of packets to process. NAPI allows drivers to run with (some) interrupts disabled during times of high traffic, with a corresponding decrease in system load.

Packet throttling: When the system is overwhelmed and must drop packets, it’s better if those packets are disposed of before much effort goes into processing them. NAPI-compliant drivers can often cause packets to be dropped in the network adaptor itself, before the kernel sees them at all.

Many recent NIC drivers automatically support NAPI, so you don’t need to do anything. Some drivers need you to explicitly specify NAPI in the kernel config or on the command line when compiling the driver. If you are unsure, check your driver documentation. A good place to look for docs is in your kernel source under Documentation, available on the web here: http://lxr.linux.no/linux+v2.6.30/Documentation/networking/ but be sure to select the correct kernel version, first!

Older e1000 drivers (newer drivers, do nothing): make CFLAGS_EXTRA=-DE1000_NAPI install

Throttle NIC Interrupts

Some drivers allow the user to specify the rate at which the NIC will generate interrupts. The e1000e driver allows you to pass a command line option InterruptThrottleRate

when loading the module with insmod. For the e1000e there are two dynamic interrupt throttle mechanisms, specified on the command line as 1 (dynamic) and 3 (dynamic conservative). The adaptive algorithm traffic into different classes and adjusts the interrupt rate appropriately. The difference between dynamic and dynamic conservative is the the rate for the “Lowest Latency” traffic class, dynamic (1) has a much more aggressive interrupt rate for this traffic class.

As always, check your driver documentation for more information.

With modprobe: insmod e1000e.o InterruptThrottleRate=1

Process and IRQ affinity

Linux allows the user to specify which CPUs processes and interrupt handlers are bound.

  • Processes You can use taskset to specify which CPUs a process can run on
  • Interrupt Handlers The interrupt map can be found in /proc/interrupts, and the affinity for each interrupt can be set in the file smp_affinity in the directory for each interrupt under /proc/irq/

This is useful because you can pin the interrupt handlers for your NICs to specific CPUs so that when a shared resource is touched (a lock in the network stack) and loaded to a CPU cache, the next time the handler runs, it will be put on the same CPU avoiding costly cache invalidations that can occur if the handler is put on a different CPU.

However, reports2 of up to a 24% improvement can be had if processes and the IRQs for the NICs the processes get data from are pinned to the same CPUs. Doing this ensures that the data loaded into the CPU cache by the interrupt handler can be used (without invalidation) by the process; extremely high cache locality is achieved.

oprofile

oprofile is a system wide profiler that can profile both kernel and application level code. There is a kernel driver for oprofile which generates collects data in the x86′s Model Specific Registers (MSRs) to give very detailed information about the performance of running code. oprofile can also annotate source code with performance information to make fixing bottlenecks easy. See oprofile’s homepage for more information.

Kernel options: CONFIG_OPROFILE=y and CONFIG_HAVE_OPROFILE=y

epoll

epoll(7) is useful for applications which must watch for events on large numbers of file descriptors. The epoll interface is designed to easily scale to large numbers of file descriptors. epoll is already enabled in most recent kernels, but some strange distributions (which will remain nameless) have this feature disabled.

Kernel option: CONFIG_EPOLL=y

Conclusion

  • There are a lot of useful levers that can be pulled when trying to squeeze every last bit of performance out of your system
  • It is extremely important to read and understand your hardware documentation if you hope to achieve the maximum throughput your system can achieve
  • You can find documentation for your kernel online at the Linux LXR. Make sure to select the correct kernel version because docs change as the source changes!

Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.

References

  1. http://www.linuxfoundation.org/en/Net:NAPI []
  2. http://software.intel.com/en-us/articles/improved-linux-smp-scaling-user-directed-processor-affinity/ []

Written by Joe Damato

July 28th, 2009 at 3:20 am

Enabling BIOS options on a live server with no rebooting

View Comments


This blog post is going to describe a C program that toggles some CPU and chipset registers directly to enable Direct Cache Access without needing a reboot or a switch in the BIOS. A very fun hack to write and investigate.

Special thanks…

Special thanks going out to Roman Nurik for helping me make the code CSS much, much prettier and easier to read.

Special thanks going out to Jake Douglas for convincing me that I shouldn’t use a stupid sensationalist title for this blog article :)

Intel I/OAT and Direct Cache Access (DCA)

From the Linux Foundation I/OAT project page1:

I/OAT (I/O Acceleration Technology) is the name for a collection of techniques by Intel to improve network throughput. The most significant of these is the DMA engine. The DMA engine is meant to offload from the CPU the copying of [socket buffer] data to the user buffer. This is not a zero-copy receive, but does allow the CPU to do other work while the copy operations are performed by the DMA engine.

Cool! So by using I/OAT the network stack in the Linux kernel can offload copy operations to increase throughput. I/OAT also includes a feature called Direct Cache Access (DCA) which can deliver data directly into processor caches. This is particularly cool because when a network interrupt arrives and data is copied to system memory, the CPU which will access this data will not cause a cache-miss on the CPU because DCA has already put the data it needs in the cache. Sick.

Measurements from the Linux Foundation project2 indicate a 10% reduction in CPU usage, while the Myri-10G NIC website claims they’ve measured a 40% reduction in CPU usage3. For more information describing the performance benefits of DCA see this incredibly detailed paper: Direct Cache Access for High Bandwidth Network I/O.

How to get I/OAT and DCA

To get I/OAT and DCA you need a few things:

  • Intel XEON CPU(s)
  • A NIC(s) which has DCA support
  • A chipset which supports DCA
  • The ioatdma and dca Linux kernel modules
  • And last but not least, a switch in your BIOS to turn DCA on

That last item can actually be a bit more tricky than it sounds for several reasons:

  • some BIOSes don’t expose a way to turn DCA on even though it is supported by the CPU, chipset, and NIC!
  • Your hosting provider may not allow BIOS access
  • Your system might be up and running and you don’t want to reboot to enter the BIOS to enable DCA

Let’s see what you can do to coerce DCA into working on your system if one of the above applies to you.

Build ioatdma kernel module

This is pretty easy, just make menuconfig and toggle I/OAT as a module. You must build it as a module if you cannot or do not want to enable DCA in your BIOS.

The option can be found in Device Drivers -> DMA Engine Support -> Intel I/OAT DMA Support.

Toggling that option will build the ioatdma and dca modules. Build and install the new module.

Enabling DCA without a reboot or BIOS access: Hack overview

In order to enable DCA a few special registers need to be touched.

  • The DCA capability bit in the PCI Express Control Register 4 in the configuration space for the PCI bridge your NIC(s) are attached to.
  • The DCA Model Specific Register on your CPU(s)

Let’s take a closer look at each stage of the hack.

Enable DCA in PCI Configuration Space

PCI configuration space is a memory region where control registers for PCI devices live. By changing register values, you can enable/disable specific features of that PCI device. The configuration space is addressable if you know the PCI bus, device, and function bits for a specific PCI device and the feature you care about.

To find the DCA register for the Intel 5000, 5100, and 7300 chipsets, we need to consult the documentation4:


Cool, so the register needed lives at offset 0×64. To enable DCA, bit 6 needs to be set to 1.

Toggling these register can be a bit cumbersome, but luckily there is libpci which provides some simple APIs to scan for PCI devices and accessing configuration space registers.

#define INTEL_BRIDGE_DCAEN_OFFSET   0x64
#define INTEL_BRIDGE_DCAEN_BIT      6
#define PCI_HEADER_TYPE_BRIDGE     1
#define PCI_VENDOR_ID_INTEL        0x8086 /* lol @ intel */
#define PCI_HEADER_TYPE             0x0e
#define MSR_P6_DCA_CAP             0x000001f8

void check_dca(struct pci_dev *dev)
{
  /* read DCA status */
  u32 dca = pci_read_long(dev, INTEL_BRIDGE_DCAEN_OFFSET);

  /* if it's not enabled */
  if (!(dca & (1 << INTEL_BRIDGE_DCAEN_BIT))) {
    printf("DCA disabled, enabling now.\n");

    /* enable it */
    dca |= 1 << INTEL_BRIDGE_DCAEN_BIT;

    /* write it back */
    pci_write_long(dev, INTEL_BRIDGE_DCAEN_OFFSET, dca);
  } else {
    printf("DCA already enabled!\n");
  }
}

int main(void)
{
  struct pci_access *pacc;
  struct pci_dev *dev;
  u8 type;

  pacc = pci_alloc();
  pci_init(pacc);

  /* scan the PCI bus */
  pci_scan_bus(pacc);

  /* for each device */
  for (dev = pacc->devices; dev; dev=dev->next) {
    pci_fill_info(dev, PCI_FILL_IDENT | PCI_FILL_BASES);

    /* if it's an intel device */
    if (dev->vendor_id == PCI_VENDOR_ID_INTEL) {

        /* read the header byte */
        type = pci_read_byte(dev, PCI_HEADER_TYPE);

        /* if its a PCI bridge, check and enable DCA */
        if (type == PCI_HEADER_TYPE_BRIDGE) {
          check_dca(dev);
        }
    }
  }

  msr_dca_enable();
  return 0;
}

Enable DCA in the CPU MSR

A model specific register (MSR) is a control register that is provided by a CPU to enable a feature that exists on a specific CPU. In this case, we care about the DCA MSR. In order to find it’s address, let’s consult the Intel Developer’s Manual 3B5.

This register lives at offset 0x1f8. We just need to set it to 1 and we should be good to go.

Thankfully, there are device files in /dev for the MSRs of each CPU:

#define MSR_P6_DCA_CAP      0x000001f8
void msr_dca_enable(void)
{
  char msr_file_name[64];
  int fd = 0, i = 0;
  u64 data;

  /* for each CPU */
  for (;i < NUM_CPUS; i++) {
    sprintf(msr_file_name, "/dev/cpu/%d/msr", i);

    /* open the MSR device file */
    fd = open(msr_file_name, O_RDWR);
    if (fd < 0) {
      perror("open failed!");
      exit(1);
    }

    /* read the current DCA status */
    if (pread(fd, &data, sizeof(data), MSR_P6_DCA_CAP) != sizeof(data)) {
      perror("reading msr failed!");
      exit(1);
    }

    printf("got msr value: %*llx\n", 1, (unsigned long long)data);

    /* if DCA is not enabled */
    if (!(data & 1)) {

      /* enable it */
      data |= 1;

      /* write it back */
      if (pwrite(fd, &data, sizeof(data), MSR_P6_DCA_CAP) != sizeof(data)) {
        perror("writing msr failed!");
        exit(1);
      }
    } else {
      printf("msr already enabled for CPU %d\n", i);
    }
  }
}

Code for the hack is on github

Get it here: http://github.com/ice799/dca_force/tree/master

Putting it all together to get your speed boost

  1. Checkout the hack from github: git clone git://github.com/ice799/dca_force.git
  2. Build the hack: make NUM_CPUS=whatever
  3. Run it: sudo ./dca_force
  4. Load the kernel module: sudo modprobe ioatdma
  5. Check your dmesg: dmesg | tail

You should see:

[   72.782249] dca service started, version 1.8
[   72.838853] ioatdma 0000:00:08.0: setting latency timer to 64
[   72.838865] ioatdma 0000:00:08.0: Intel(R) I/OAT DMA Engine found, 4 channels, device version 0x12, driver version 3.64
[   72.904027]   alloc irq_desc for 56 on cpu 0 node 0
[   72.904030]   alloc kstat_irqs on cpu 0 node 0
[   72.904039] ioatdma 0000:00:08.0: irq 56 for MSI/MSI-X

in your dmesg.

You should NOT SEE

[    8.367333] ioatdma 0000:00:08.0: DCA is disabled in BIOS

You can now enjoy the DCA performance boost your BIOS or hosting provider didn't want you to have!

Conclusion

  • Intel I/OAT and DCA is pretty cool, and enabling it can give pretty substantial performance wins
  • Cool features are sometimes stuffed away in the BIOS
  • If you don't have access to your BIOS, you should ask you provider nicely to do it for you
  • If your BIOS doesn't have a toggle switch for the feature you need, do a BIOS update
  • If all else fails and you know what you are doing, you can sometimes pull off nasty hacks like this in userland to get what you want

Thanks for reading and don't forget to subscribe (via RSS or e-mail) and follow me on twitter.

P.S.

I know, I know. I skipped Part 2 of the signals post (here's Part 1 if you missed it). Part 2 is coming soon!

References

  1. http://www.linuxfoundation.org/en/Net:I/OAT []
  2. http://www.linuxfoundation.org/en/Net:I/OAT []
  3. http://www.myri.com/serve/cache/626.html []
  4. Intel® 7300 Chipset Memory Controller Hub (MCH) Datasheet, Section 4.8.12.6 []
  5. Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide, Part 2, Appendix B-19 []

Written by Joe Damato

July 6th, 2009 at 8:00 am

A Few Things You Didn’t Know about Signals in Linux Part 1

View Comments

Another post about signal handling?

There are probably lots of people who have blogged about signal handling in Linux, but this series is going to be different. In this blog post, I’m going to unravel the signal handling code paths in the Linux kernel starting at the hardware level, working though the kernel, and ending in the userland signal handler. I’ve tried to use footnotes for code samples which have links to the code in the Linux lxr. Many of the code examples have been snipped for brevity.

As always, this post is specific to the x86_64 CPU architecture and Linux kernel 2.6.29.

Hardware or software generated

Signals are not generated directly by hardware, but certain hardware states can cause the Linux kernel to generate signals. As such we can imagine two ways to generate signals:

  1. Hardware – the CPU does something bad (divides by 0, touches a bad address, etc) which causes the kernel to create and deliver (unless the signal is SIGKILL or SIGSTOP, of course) a signal (SIGFPE, SIGSEGV, etc) to the running process.
  2. Software – an application executes a kill() system call and sends a signal to a specific process.

Both types of signals share a common code path, but hardware generated signals have a very interesting birth. Let’s start there and as we work our way up to userland we’ll stumble into the software signal code path along the way.

Exceptions on the x86

Let’s start by first understanding what an x86 exception is. For that, we’ll turn to the documentation1:

[...] exceptions are events that indicate that a condition exists somewhere in the system, the processor, or within the currently executing program or task that requires the attention of a processor. They typically result in a forced transfer of execution from the currently running program or task to a special software routine [...] or an exception handler.

At a high level this is pretty simple to understand; the system gets in a weird state and the CPU immediately begins executing a predefined handler function to try to fix things (if it can) or die gracefully.

Let’s take a look at how the kernel creates and installs handler functions that the CPU executes when an exception occurs.

Low-level exception handlers

Low level exception handlers are specified in the Interrupt Descriptor Table (IDT). The IDT can hold up to 256 entries and it can live anywhere in memory. Each entry in the IDT is mapped to a different exception. For example, #DE Divide Error is the first entry in the IDT, IDT[0]; #PF Page Fault ‘s handler lives at IDT[14]. When a specific exception is encountered, the CPU looks up the handler function in the IDT, puts some data on the stack, and executes the handler.

What does an entry in the IDT look like? Let’s take a look at a picture2 from Intel:


Take a look at the fields labeled ‘Offset’ – this is field that contains the address of the function to execute. As you can see, there are three fields labeled ‘Offset.’ Can you guess why?

In order to actually set the address of the function you want to execute, you’ll need to do some bit-shifting. Each ‘Offset’ field is indicates which bits of the address of the handler function it wants. You need to be really careful when writing the code that is responsible for creating IDT entries. A bug here could cause your system to do really bizarre things.

We know what an IDT entry looks like, but what about the data that the CPU pushes on the stack before executing a handler? Unfortunately, I couldn’t track down a picture of what the x86_64 puts on the stack and I can’t draw. So, here is a picture of the data the x86 CPU puts on the stack from Intel3:

When an exception occurs, the CPU pushes the stack pointer, the CPU flags register value, the code and stack segment selectors, and the instruction pointer on to the stack before executing the handler.

Nice, but where does the IDT itself live?

The address of the IDT is stored in a CPU register that can be accessed with the instructions lidt and sidt to load and store (respectively) the address of the IDT. Usually, the address of the IDT is set during the initialization of the kernel.

Let’s see where this happens in Linux4:

void __init x86_64_start_kernel(char * real_mode_data)
{
        int i;

        /* [...] */

        for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
                set_intr_gate(i, early_idt_handler);
        }

        load_idt((const struct desc_ptr *)&idt_descr);

	/* [...] */
}

Cool, so Linux creates a bunch of early handlers in case something goes bad during boot and then a few function calls later (not shown), Linux calls start_kernel()5, which calls trap_init()6 for your architecture which actually sets the handlers.

This is pretty important, so let's take a look at the code for this. Thankfully, Linux includes some descriptive function names, so we can see which exceptions are being set.

void __init trap_init(void)
{
   /* ... */

	set_intr_gate(0, &divide_error);

	/* ... */

        set_intr_gate(5, &bounds);
        set_intr_gate(6, &invalid_op);
        set_intr_gate(7, &device_not_available);

	/* ... */

        set_intr_gate(13, &general_protection);
 	set_intr_gate(14, &page_fault);
 	set_intr_gate(15, &spurious_interrupt_bug);
 	set_intr_gate(16, &coprocessor_error);
 	set_intr_gate(17, &alignment_check);

   /* ... */
}

Awesome, now let's try to track down where these exception handlers are defined. As it turns out, there is a little bit of C and assembly magic to string this all together.

The low-level exception handlers have a common entry and exit point and are "templated" with a macro. Let's take a look at the macro7 and some of the handlers8 in the kernel:

.macro zeroentry sym do_sym
ENTRY(\sym)
        INTR_FRAME
        pushq_cfi $-1           /* ORIG_RAX: no syscall to restart */
        subq $15*8,%rsp
        call error_entry
        DEFAULT_FRAME 0
        movq %rsp,%rdi          /* pt_regs pointer */
        xorl %esi,%esi          /* no error code */
        call \do_sym
        jmp error_exit          /* %ebx: no swapgs flag */
 ND(\sym)
.endm

So the macro uses the first argument sym as the name of the function, and the second argument do_sym is a C function that is called from this assembly stub.

We also notice from the stub above a very important (and somewhat subtle) piece of code: movq %rsp,%rdi This piece of code puts the address of the stack pointer in %rdi and we'll see why shortly. First, let's look at how the macro is used to get a better idea how it works:

zeroentry divide_error do_divide_error
zeroentry overflow do_overflow
zeroentry bounds do_bounds
zeroentry invalid_op do_invalid_op
zeroentry device_not_available do_device_not_available

This block of code uses the macro above to create symbols named divide_error, overflow, and more which call out to C functions named do_divide_error, do_overflow, etc. The craziness doesn't end there. These C functions are also generated with macros9:

#define DO_ERROR(trapnr, signr, str, name)                              \
dotraplinkage void do_##name(struct pt_regs *regs, long error_code)     \
{                                                                       \
        if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, signr)  \
                                                        == NOTIFY_STOP) \
                return;                                                 \
        conditional_sti(regs);                                          \
        do_trap(trapnr, signr, str, regs, error_code, NULL);            \
}
/*...*/
DO_ERROR(4, SIGSEGV, "overflow", overflow)
DO_ERROR(5, SIGSEGV, "bounds", bounds)

Those last two lines get substituted with the macro above, creating do_overflow, do_bounds, and more. As you might have noticed, the functions generated have dotraplinkage which is a macro for a gcc attribute regparm which tells gcc to pass arguments to the function in registers and not on the stack.

Remember the movq %rsp,%rdi from the common assembly stub? That line of code exists to pass the address of the state the CPU dumped to the do_* functions via the %rdi register.

The do_* functions notify interested parties about the exception, re-enable interrupts/exceptions if they were disabled, and finally tells the upper layer signal handling code of the kernel that a signal should be generated and hands over the associated CPU state at the time the exception was generated.

Conclusion for Part 1

Wow. What a wild ride that was.

  1. There is a lot of trickery and subtle hacks in the Linux kernel. Reading and understanding the code can make you a more clever programmer. Dig in!
  2. It is pretty cool (imho) to understand how you actually converse with the CPU and how the CPU talks to the kernel, and how that data is pushed to the upper layers.
  3. The Intel CPU manuals, the gcc man page, and the Linux lxr are a big time help for deciphering this code, which can be cryptic at times.
  4. Understanding what information you have at your disposal can let you do pretty crazy things in userland, as we'll see in the next piece of this series.

Stay tuned, in the next piece of this series I'll walk through the signal handling code in the kernel and show some crazy non-portable tricks you can do in userland.

Thanks for reading and don't forget to subscribe (via RSS or e-mail) and follow me on twitter.

References

  1. Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1, 5.1: Interrupt and Exception Overview []
  2. Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1, 5.1: Interrupt and Exception Overview []
  3. Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1, 5.12.1.1: Exception- or Interrupt-Handler Procedures []
  4. http://lxr.linux.no/linux+v2.6.30/arch/x86/kernel/head64.c#L89 []
  5. http://lxr.linux.no/linux+v2.6.29/init/main.c#L590 []
  6. http://lxr.linux.no/linux+v2.6.29/arch/x86/kernel/traps.c#L953 []
  7. http://lxr.linux.no/linux+v2.6.29/arch/x86/kernel/entry_64.S#L1028 []
  8. http://lxr.linux.no/linux+v2.6.29/arch/x86/kernel/entry_64.S#L1121 []
  9. http://lxr.linux.no/linux+v2.6.29/arch/x86/kernel/traps.c#L236 []

Written by Joe Damato

June 15th, 2009 at 6:00 am

Fixing Threads in Ruby 1.8: A 2-10x performance boost

View Comments

Quick notes before things get crazy

OK, things might get a little crazy in this blog post so let’s clear a few things up before we get moving.

  • I like the gritty details, and this article in particular has a lot of gritty info. To reduce the length of the article for the casual reader, I’ve put a portion of the really gritty stuff in the Epilogue below. Definitely check it out if that is your thing.
  • This article, the code, and the patches below are for Linux and OSX for the x86 and x86_64 platforms, only.
  • Even though there are code paths for both x86 and x86_64, I’m going to use the 64bit register names and (briefly) mention the 64bit binary interface.
  • Let’s assume the binary is built with -fno-omit-frame-pointer, the patches don’t care, but it’ll make the explanation a bit simpler later.
  • If you don’t know what the above two things mean, don’t worry; I got your back chief.

How threads work in Ruby

Ruby 1.8 implements pre-emptible userland threads, also known as “green threads.” (Want to know more about threading models? See this post.) The major performance killer in Ruby’s implementation of green threads is that the entire thread stack is copied to and from the heap every context switch. Let’s take a look at a high level what happens when you:

Thread.new{
	10000.times {
		a << "a"
		a.pop
	}
}

  1. A thread control block (tcb) is allocated in Ruby.
  2. The infamous thread timer is initialized, either as a pthread or as an itimer.
  3. Ruby scope information is copied to the heap.
  4. The new thread is added to the list of threads.
  5. The current thread is set as the new thread.
  6. rb_thread_yield is called to yield to the block you passed in.
  7. Your block starts executing.
  8. The timer interrupts the executing thread.
  9. The current thread’s state is stored:
    • memcpy() #1 (sometimes): If the stack has grown since the last save, realloc is called. If the allocator cannot extend the size of the current block in place, it may decide to move the data to a new block that is large enough. If that happens memcpy() is called to move the data over.
    • memcpy() #2 (always): A copy of this thread’s entire stack (starting from the top of the interpreter’s stack) is put on the heap.
  10. The next thread’s state is restored.
    • memcpy() #3 (always): A copy of this thread’s entire stack is placed on the stack.

Steps 9 and 10 crush performance when even small amounts of Ruby code are executed.

Many of the functions the interpreter uses to evaluate code are massive. They allocate a large number of local variables creating stack frames up to 4 kilobytes per function call. Those functions also call themselves recursively many times in a single expression. This leads to huge stacks, huge memcpy()s, and an incredible performance penalty.

If we can eliminate the memcpy()s we can get a lot of performance back. So, let’s do it.

Increase performance by putting thread stacks on the heap

[Remember: we are only talking about x86_64]

How stacks work – a refresher

Stacks grow downward from high addresses to low addresses. As data is pushed on to the stack, it grows downward. As stuff is popped, it shrinks upward. The register %rsp serves as a pointer to the bottom of the stack. When it is decremented or incremented the stack grows or shrinks, respectively. The special property of the program stack is that it will grow until you run out of memory (or are killed by the OS for being bad). The operating system handles the automatic growth. See the Epilogue for some more information about this.

How to actually switch stacks

The %rsp register can be (and is) changed and adjusted directly by user code. So all we have to do is put the address of our stack in %rsp, and we’ve switched stacks. Then we can just call our thread start function. Pretty easy. A small blob of inline assembly should do the trick:

__asm__ __volatile__ ("movq %0, %%rsp\n\t"
                      "callq *%1\n"
                      :: "r" (th->stk_base),
                         "r" (rb_thread_start_2));

Two instructions, not too bad.

  1. movq %0, %%rsp moves a quad-word (th->stk_base) into the %rsp. Quad-word is Intel speak for 4 words, where 1 Intel word is 2 bytes.
  2. callq *%1 calls a function at the address “rb_thread_start_2.” This has a side-effect or two, which I’ll mention in the Epilogue below, for those interested in a few more details.

The above code is called once per thread. Calling rb_thread_start_2 spins up your thread and it never returns.

Where do we get stack space from?

When the tcb is created, we’ll allocate some space with mmap and set a pointer to it.

/* error checking omitted for brevity, but exists in the patch =] */
stack_area = mmap(NULL, total_size, PROT_READ | PROT_WRITE | PROT_EXEC,
			MAP_PRIVATE | MAP_ANON, -1, 0);

th->stk_ptr = th->stk_pos = stack_area;
th->stk_base = th->stk_ptr + (total_size - sizeof(int))/sizeof(VALUE *);

Remember, stacks grow downward so that last line: th->stk_base = ... is necessary because the base of the stack is actually at the top of the memory region return by mmap(). The ugly math in there is for alignment, to comply with the x86_64 binary interface. Those curious about more gritty details should see the Epilogue below.

BUT WAIT, I thought stacks were supposed to grow automatically?

Yeah, the OS does that for the normal program stack. Not gonna happen for our mmap‘d regions. The best we can do is pick a good default size and export a tuning lever so that advanced users can adjust the stack size as they see fit.

BUT WAIT, isn’t that dangerous? If you fall off your stack, wouldn’t you just overwrite memory below?

Yep, but there is a fix for that too. It’s called a guard page. We’ll create a guard page below each stack that has its permission bits set to PROT_NONE. This means, if a thread falls off the bottom of its stack and tries to read, write, or execute the memory below the thread stack, a signal (usually SIGSEGV or SIGBUS) will be sent to the process.

The code for the guard page is pretty simple, too:

/* omit error checking for brevity */
mprotect(th->stk_ptr, getpagesize(), PROT_NONE);

Cool, let’s modify the SIGSEGV and SIGBUS signal handlers to check for stack overflow:

/* if the address which generated the fault is within the current thread's guard page... */
  if(fault_addr <= (caddr_t)rb_curr_thread->guard &&
     fault_addr >= (caddr_t)rb_curr_thread->stk_ptr) {
  /* we hit the guard page, print out a warning to help app developers */
  rb_bug("Thread stack overflow! Try increasing it!");
}

See the epilogue for more details about this signal handler trick.

Patches

As always, this is super-alpha software.

Ruby 1.8.6 github raw .patch
Ruby 1.8.7 github raw .patch

Benchmarks

The computer language shootout has a thread test called thread-ring; let’s start with that.

require 'thread'
THREAD_NUM = 403
number = ARGV.first.to_i

threads = []
for i in 1..THREAD_NUM
   threads << Thread.new(i) do |thr_num|
      while true
         Thread.stop
         if number > 0
            number -= 1
         else
            puts thr_num
            exit 0
         end
      end
   end
end

prev_thread = threads.last
while true
   for thread in threads
      Thread.pass until prev_thread.stop?
      thread.run
      prev_thread = thread
   end
end

Results (ARGV[0] = 50000000):

Ruby 1.8.6 1389.52s
Ruby 1.8.6 w/ heap stacks 793.06s
Ruby 1.9.1 752.44s

A speed up of about 2.3x compared to Ruby 1.8.6. A bit slower than Ruby 1.9.1.

That is a pretty strong showing, for sure. Let’s modify the test slightly to illustrate the true power of this implementation.

Since our implementation does no memcpy()s we expect the cost of context switching to stay constant regardless of thread stack size. Moreover, the unmodified Ruby 1.8.6 should perform worse as thread stack size increases (therefore increasing the amount of time the CPU is doing memcpy()s).

Let’s test this hypothesis by modifying thread-ring slightly so that it increases the size of the stack after spawning threads.

def grow_stack n=0, &blk
  unless n > 100
    grow_stack n+1, &blk
  else
    yield
  end
end

require 'thread'
THREAD_NUM = 403
number = ARGV.first.to_i

threads = []
for i in 1..THREAD_NUM
  threads << Thread.new(i) do |thr_num|
    grow_stack do
      while true
        Thread.stop
        if number > 0
          number -= 1
        else
          puts thr_num
          exit 0
        end
      end
    end
  end
end

prev_thread = threads.last
while true
   for thread in threads
      Thread.pass until prev_thread.stop?
      thread.run
      prev_thread = thread
   end
end

Results (ARGV[0] = 50000000):

Ruby 1.8.6 7493.50s
Ruby 1.8.6 w/ heap stacks 799.52s
Ruby 1.9.1 680.92s

A speed up of about 9.4x compared to Ruby 1.8.6. A bit slower than Ruby 1.9.1.

Now, lets benchmark mongrel+sinatra.

require 'rubygems'
require 'sinatra'

disable :reload

set :server, 'mongrel' 

get '/' do
  'hi'
end

Results:

Ruby 1.8.6 1395.43 request/sec
Ruby 1.8.6 w/ heap stacks 1770.26 request/sec

An increase of about 1.26x in the most naive case possible.

Of course, if the handler did anything more than simply write “hi” (like use memcache or make sql queries) there would be more function calls, more context switches, and a much greater savings.

Conclusion

A couple lessons learned this time:

  • Hacking a VM like Ruby is kind of like hacking a kernel. Some subset of the tricks used in kernel hacking are useful in userland.
  • The x86_64 ABI is a must read if you plan on doing any low-level hacking.
  • Keep your CPU manuals close by, they come in handy even in userland.
  • Installing your own signal handlers is really useful for debugging, even if they are dumping architecture specific information.

Hope everyone enjoyed this blog post. I’m always looking for things to blog about. If there is something you want explained or talked about, send me an email or a tweet!

Don’t forget to subscribe and follow me and Aman on twitter.

Epilogue

Automatic stack growth

This can be achieved pretty easily with a little help from virtual memory and the programmable interrupt controller (PIC). The idea is pretty simple. When you (or your shell on your behalf) calls exec() to execute a binary, the OS will map a bunch of pages of memory for the stack and set the stack pointer of the process to the top of the memory. Once the stack space is exhausted, and the stack pointer is pushed onto un-mapped memory, a page fault will be generated.

The OS’s page fault handler (installed via the PIC) will fire. The OS can then check the address that generated the exception and see that you fell off the bottom of your stack. This works very similarly to the guard page idea we added to protect Ruby thread stacks. It can then just map more memory to that area, and tell your process to continue executing. Your process doesn’t know anything bad happened.

I hope to chat a little bit about interrupt and exception handlers in an upcoming blog post. Stay tuned!

callq side-effects

When a callq instruction is executed, the CPU pushes the return address on to the stack and then begins executing the function that was called. This is important because when the function you are calling executes a ret instruction, a quad-word is popped from the stack and put into the instruction pointer (%rip).

x86_64 Application Binary Interface

The x86_64 ABI is an extension of the x86 ABI. It specifies architecture programming information such as the fundamental types, caller and callee saved registers, alignment considerations and more. It is a really important document for any programmer messing with x86_64 architecture specific code.
The particular piece of information relevant for this blog post is found buried in section 3.2.2

The end of the input argument area shall be aligned on a 16 … byte boundary.

This is important to keep in mind when constructing thread stacks. We decided to avoid messing with alignment issues. As such we did not pass any arguments to rb_thread_start_2. We wanted to avoid mathematical error that could happen if we try to align the memory ourselves after pushing some data. We also wanted to avoid writing more assembly than we had to, so we avoided passing the arguments in registers, too.

Signal handler trick

The signal handler “trick” to check if you have hit the guard page is made possible by the sigaltstack() system call and the POSIX sa_sigaction interface.

sigaltstack() lets us specify a memory region to be used as the stack when a signal is delivered. This extremely important for the signal handler trick because once we fall off our thread stack, we certainly cannot expect to handle a signal using that stack space.

POSIX provides two ways for signals to be handled:

  • sa_handler interface: calls your handler and passes in the signal number.
  • sa_sigaction interface: calls your handler and passes in the signal number, a siginfo_t struct, and a ucontext_t. The siginfo_t struct contains (among other things), the address which generated the fault. Simply check this address to see if its in the guard page and if so let the user know they just overflowed their stack. Another useful, but extremely non-portable modification that was added to Ruby’ signal handlers was a dump of the contents in ucontext_t to provide useful debugging information. This structure contains the register state at the time of signal. Dumping it can help debugging by showing which values are in what registers.

Written by Joe Damato

May 18th, 2009 at 5:00 am

Fix a bug in Ruby’s configure.in and get a ~30% performance boost.

View Comments


Special thanks…

Going out to Jake Douglas for pushing the initial investigation and getting the ball rolling.

The whole --enable-pthread thing

Ask any Ruby hacker how to easily increase performance in a threaded Ruby application and they’ll probably tell you:

Yo dude… Everyone knows you need to configure Ruby with --disable-pthread.

And it’s true; configure Ruby with --disable-pthread and you get a ~30% performance boost. But… why?

For this, we’ll have to turn to our handy tool strace. We’ll also need a simple Ruby program to this one. How about something like this:

def make_thread
  Thread.new {
    a = []
    10_000_000.times {
      a << "a"
      a.pop
    }
  }
end

t = make_thread
t1 = make_thread 

t.join
t1.join

Now, let's run strace on a version of Ruby configure'd with --enable-pthread and point it at our test script. The output from strace looks like this:

22:46:16.706136 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706177 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706218 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706259 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000005>
22:46:16.706301 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706342 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706383 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706425 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>
22:46:16.706466 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000004>

Pages and pages and pages of sigprocmask system calls (Actually, running with strace -c, I get about 20,054,180 calls to sigprocmask, WOW). Running the same test script against a Ruby built with --disable-pthread and the output does not have pages and pages of sigprocmask calls (only 3 times, a HUGE reduction).

OK, so let's just set a breakpoint in GDB... right?

OK, so we should just be able to set a breakpoint on sigprocmask and figure out who is calling it.

Well, not exactly. You can try it, but the breakpoint won't trigger (we'll see why a little bit later).

Hrm, that kinda sucks and is confusing. This will make it harder to track down who is calling sigprocmask in the threaded case.

Well, we know that when you run configure the script creates a config.h with a bunch of defines that Ruby uses to decide which functions to use for what. So let's compare ./configure --enable-pthread with ./configure --disable-pthread:

[joe@mawu:/home/joe/ruby]% diff config.h config.h.pthread
> #define _REENTRANT 1
> #define _THREAD_SAFE 1
> #define HAVE_LIBPTHREAD 1
> #define HAVE_NANOSLEEP 1
> #define HAVE_GETCONTEXT 1
> #define HAVE_SETCONTEXT 1


OK, now if we grep the Ruby source code, we see that whenever HAVE_[SG]ETCONTEXT are set, Ruby uses the system calls setcontext() and getcontext() to save and restore state for context switching and for exception handling (via the EXEC_TAG).

What about when HAVE_[SG]ETCONTEXT are not define'd? Well in that case, Ruby uses _setjmp/_longjmp.

Bingo!

That's what's going on! From the _setjmp/_longjmp man page:

... The _longjmp() and _setjmp() functions shall be equivalent to longjmp() and setjmp(), respectively, with the additional restriction that _longjmp() and _setjmp() shall not manipulate the signal mask...

And from the [sg]etcontext man page:

... uc_sigmask is the set of signals blocked in this context (see sigprocmask(2)) ...


The issue is that getcontext calls sigprocmask on every invocation but _setjmp does not.

BUT WAIT if that's true why didn't GDB hit a sigprocmask breakpoint before?

x86_64 assembly FTW, again

Let's fire up gdb and figure out this breakpoint-not-breaking thing. First, let's start by disassembling getcontext (snipped for brevity):

(gdb) p getcontext
$1 = {} 0x7ffff7825100
(gdb) disas getcontext
...
0x00007ffff782517f : mov $0xe,%rax
0x00007ffff7825186 : syscall
...

Yeah, that's pretty weird. I'll explain why in a minute, but let's look at the disassembly of sigprocmask first:

(gdb) p sigprocmask
$2 = {} 0x7ffff7817340 <__sigprocmask>
(gdb) disas sigprocmask
...
0x00007ffff7817383 <__sigprocmask+67>: mov $0xe,%rax
0x00007ffff7817388 <__sigprocmask+72>: syscall
...

Yeah, this is a bit confusing, but here's the deal.

Recent Linux kernels implement a shiny new method for calling system calls called sysenter/sysexit. This new way was created because the old way (int $0x80) turned out to be pretty slow. So Intel created some new instructions to execute system calls without such huge overhead.

All you need to know right now (I'll try to blog more about this in the future) is that the %rax register holds the system call number. The syscall instruction transfers control to the kernel and the kernel figures out which syscall you wanted by checking the value in %rax. Let's just make sure that sigprocmask is actually 0xe:

[joe@pluto:/usr/include]% grep -Hrn "sigprocmask" asm-x86_64/unistd.h
asm-x86_64/unistd.h:44:#define __NR_rt_sigprocmask                     14


Bingo. It's calling sigprocmask (albeit a bit obscurely).

OK, so getcontext isn't calling sigprocmask directly, instead it replicates a bunch of code that sigprocmask has in its function body. That's why we didn't hit the sigprocmask breakpoint; GDB was going to break if you landed on the address 0x7ffff7817340 but you didn't.

Instead, getcontext reimplements the wrapper code for sigprocmask itself and GDB is none the wiser.

Mystery solved.

The patch

Get it HERE

The patch works by adding a new configure flag called --disable-ucontext to allow you to specifically disable [sg]etcontext from being called, you use this in conjunction with --enable-pthread, like this:

./configure --disable-ucontext --enable-pthread


After you build Ruby configured like that, its performance is on par with (and sometimes slightly faster) than Ruby built with --disable-pthread for about a 30% performance boost when compared to --enable-pthread.

I added the switch because I wanted to preserve the original Ruby behavior, if you just pass --enable-pthread without --disable-ucontext Ruby will do the old thing and generate piles of sigprocmasks.

Conclusion

  1. Things aren't always what they seem - GDB may lie to you. Be careful.
  2. Use the source, Luke. Libraries can do unexpected things, debug builds of libc can help!
  3. I know I keep saying this, assembly is useful. Start learning it today!

If you enjoyed this blog post, consider subscribing (via RSS) or following (via twitter).

You'll want to stay tuned; tmm1 and I have been on a roll the past week. Lots of cool stuff coming out!

Written by Joe Damato

May 5th, 2009 at 3:20 am