time to bleed by Joe Damato

technical ramblings from a wanna-be unix dinosaur

Archive for the ‘x86’ tag

Enabling BIOS options on a live server with no rebooting

View Comments


This blog post is going to describe a C program that toggles some CPU and chipset registers directly to enable Direct Cache Access without needing a reboot or a switch in the BIOS. A very fun hack to write and investigate.

Special thanks…

Special thanks going out to Roman Nurik for helping me make the code CSS much, much prettier and easier to read.

Special thanks going out to Jake Douglas for convincing me that I shouldn’t use a stupid sensationalist title for this blog article :)

Intel I/OAT and Direct Cache Access (DCA)

From the Linux Foundation I/OAT project page1:

I/OAT (I/O Acceleration Technology) is the name for a collection of techniques by Intel to improve network throughput. The most significant of these is the DMA engine. The DMA engine is meant to offload from the CPU the copying of [socket buffer] data to the user buffer. This is not a zero-copy receive, but does allow the CPU to do other work while the copy operations are performed by the DMA engine.

Cool! So by using I/OAT the network stack in the Linux kernel can offload copy operations to increase throughput. I/OAT also includes a feature called Direct Cache Access (DCA) which can deliver data directly into processor caches. This is particularly cool because when a network interrupt arrives and data is copied to system memory, the CPU which will access this data will not cause a cache-miss on the CPU because DCA has already put the data it needs in the cache. Sick.

Measurements from the Linux Foundation project2 indicate a 10% reduction in CPU usage, while the Myri-10G NIC website claims they’ve measured a 40% reduction in CPU usage3. For more information describing the performance benefits of DCA see this incredibly detailed paper: Direct Cache Access for High Bandwidth Network I/O.

How to get I/OAT and DCA

To get I/OAT and DCA you need a few things:

  • Intel XEON CPU(s)
  • A NIC(s) which has DCA support
  • A chipset which supports DCA
  • The ioatdma and dca Linux kernel modules
  • And last but not least, a switch in your BIOS to turn DCA on

That last item can actually be a bit more tricky than it sounds for several reasons:

  • some BIOSes don’t expose a way to turn DCA on even though it is supported by the CPU, chipset, and NIC!
  • Your hosting provider may not allow BIOS access
  • Your system might be up and running and you don’t want to reboot to enter the BIOS to enable DCA

Let’s see what you can do to coerce DCA into working on your system if one of the above applies to you.

Build ioatdma kernel module

This is pretty easy, just make menuconfig and toggle I/OAT as a module. You must build it as a module if you cannot or do not want to enable DCA in your BIOS.

The option can be found in Device Drivers -> DMA Engine Support -> Intel I/OAT DMA Support.

Toggling that option will build the ioatdma and dca modules. Build and install the new module.

Enabling DCA without a reboot or BIOS access: Hack overview

In order to enable DCA a few special registers need to be touched.

  • The DCA capability bit in the PCI Express Control Register 4 in the configuration space for the PCI bridge your NIC(s) are attached to.
  • The DCA Model Specific Register on your CPU(s)

Let’s take a closer look at each stage of the hack.

Enable DCA in PCI Configuration Space

PCI configuration space is a memory region where control registers for PCI devices live. By changing register values, you can enable/disable specific features of that PCI device. The configuration space is addressable if you know the PCI bus, device, and function bits for a specific PCI device and the feature you care about.

To find the DCA register for the Intel 5000, 5100, and 7300 chipsets, we need to consult the documentation4:


Cool, so the register needed lives at offset 0×64. To enable DCA, bit 6 needs to be set to 1.

Toggling these register can be a bit cumbersome, but luckily there is libpci which provides some simple APIs to scan for PCI devices and accessing configuration space registers.

#define INTEL_BRIDGE_DCAEN_OFFSET   0x64
#define INTEL_BRIDGE_DCAEN_BIT      6
#define PCI_HEADER_TYPE_BRIDGE     1
#define PCI_VENDOR_ID_INTEL        0x8086 /* lol @ intel */
#define PCI_HEADER_TYPE             0x0e 
#define MSR_P6_DCA_CAP             0x000001f8

void check_dca(struct pci_dev *dev)
{
  /* read DCA status */
  u32 dca = pci_read_long(dev, INTEL_BRIDGE_DCAEN_OFFSET);

  /* if it's not enabled */
  if (!(dca & (1 << INTEL_BRIDGE_DCAEN_BIT))) {
    printf("DCA disabled, enabling now.\n");
   
    /* enable it */
    dca |= 1 << INTEL_BRIDGE_DCAEN_BIT;

    /* write it back */
    pci_write_long(dev, INTEL_BRIDGE_DCAEN_OFFSET, dca);
  } else {
    printf("DCA already enabled!\n");
  }
}

int main(void)
{
  struct pci_access *pacc;
  struct pci_dev *dev;
  u8 type;

  pacc = pci_alloc();
  pci_init(pacc);

  /* scan the PCI bus */
  pci_scan_bus(pacc);

  /* for each device */
  for (dev = pacc->devices; dev; dev=dev->next) {
    pci_fill_info(dev, PCI_FILL_IDENT | PCI_FILL_BASES);

    /* if it's an intel device */
    if (dev->vendor_id == PCI_VENDOR_ID_INTEL) {

        /* read the header byte */
        type = pci_read_byte(dev, PCI_HEADER_TYPE);

        /* if its a PCI bridge, check and enable DCA */
        if (type == PCI_HEADER_TYPE_BRIDGE) {
          check_dca(dev);
        }
    }
  }

  msr_dca_enable();
  return 0;
}

Enable DCA in the CPU MSR

A model specific register (MSR) is a control register that is provided by a CPU to enable a feature that exists on a specific CPU. In this case, we care about the DCA MSR. In order to find it’s address, let’s consult the Intel Developer’s Manual 3B5.

This register lives at offset 0x1f8. We just need to set it to 1 and we should be good to go.

Thankfully, there are device files in /dev for the MSRs of each CPU:

#define MSR_P6_DCA_CAP      0x000001f8
void msr_dca_enable(void)
{
  char msr_file_name[64];
  int fd = 0, i = 0;
  u64 data;

  /* for each CPU */
  for (;i < NUM_CPUS; i++) {
    sprintf(msr_file_name, "/dev/cpu/%d/msr", i);
    
    /* open the MSR device file */
    fd = open(msr_file_name, O_RDWR);
    if (fd < 0) {
      perror("open failed!");
      exit(1);
    }

    /* read the current DCA status */
    if (pread(fd, &data, sizeof(data), MSR_P6_DCA_CAP) != sizeof(data)) {
      perror("reading msr failed!");
      exit(1);
    }

    printf("got msr value: %*llx\n", 1, (unsigned long long)data);

    /* if DCA is not enabled */
    if (!(data & 1)) {

      /* enable it */
      data |= 1;

      /* write it back */
      if (pwrite(fd, &data, sizeof(data), MSR_P6_DCA_CAP) != sizeof(data)) {
        perror("writing msr failed!");
        exit(1);
      }
    } else {
      printf("msr already enabled for CPU %d\n", i);
    }
  }
}

Code for the hack is on github

Get it here: http://github.com/ice799/dca_force/tree/master

Putting it all together to get your speed boost

  1. Checkout the hack from github: git clone git://github.com/ice799/dca_force.git
  2. Build the hack: make NUM_CPUS=whatever
  3. Run it: sudo ./dca_force
  4. Load the kernel module: sudo modprobe ioatdma
  5. Check your dmesg: dmesg | tail

You should see:

[   72.782249] dca service started, version 1.8
[   72.838853] ioatdma 0000:00:08.0: setting latency timer to 64
[   72.838865] ioatdma 0000:00:08.0: Intel(R) I/OAT DMA Engine found, 4 channels, device version 0x12, driver version 3.64
[   72.904027]   alloc irq_desc for 56 on cpu 0 node 0
[   72.904030]   alloc kstat_irqs on cpu 0 node 0
[   72.904039] ioatdma 0000:00:08.0: irq 56 for MSI/MSI-X

in your dmesg.

You should NOT SEE

[    8.367333] ioatdma 0000:00:08.0: DCA is disabled in BIOS

You can now enjoy the DCA performance boost your BIOS or hosting provider didn't want you to have!

Conclusion

  • Intel I/OAT and DCA is pretty cool, and enabling it can give pretty substantial performance wins
  • Cool features are sometimes stuffed away in the BIOS
  • If you don't have access to your BIOS, you should ask you provider nicely to do it for you
  • If your BIOS doesn't have a toggle switch for the feature you need, do a BIOS update
  • If all else fails and you know what you are doing, you can sometimes pull off nasty hacks like this in userland to get what you want

Thanks for reading and don't forget to subscribe (via RSS or e-mail) and follow me on twitter.

P.S.

I know, I know. I skipped Part 2 of the signals post (here's Part 1 if you missed it). Part 2 is coming soon!

References

  1. http://www.linuxfoundation.org/en/Net:I/OAT []
  2. http://www.linuxfoundation.org/en/Net:I/OAT []
  3. http://www.myri.com/serve/cache/626.html []
  4. Intel® 7300 Chipset Memory Controller Hub (MCH) Datasheet, Section 4.8.12.6 []
  5. Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide, Part 2, Appendix B-19 []

Written by Joe Damato

July 6th, 2009 at 8:00 am

A Few Things You Didn’t Know about Signals in Linux Part 1

View Comments

Another post about signal handling?

There are probably lots of people who have blogged about signal handling in Linux, but this series is going to be different. In this blog post, I’m going to unravel the signal handling code paths in the Linux kernel starting at the hardware level, working though the kernel, and ending in the userland signal handler. I’ve tried to use footnotes for code samples which have links to the code in the Linux lxr. Many of the code examples have been snipped for brevity.

As always, this post is specific to the x86_64 CPU architecture and Linux kernel 2.6.29.

Hardware or software generated

Signals are not generated directly by hardware, but certain hardware states can cause the Linux kernel to generate signals. As such we can imagine two ways to generate signals:

  1. Hardware – the CPU does something bad (divides by 0, touches a bad address, etc) which causes the kernel to create and deliver (unless the signal is SIGKILL or SIGSTOP, of course) a signal (SIGFPE, SIGSEGV, etc) to the running process.
  2. Software – an application executes a kill() system call and sends a signal to a specific process.

Both types of signals share a common code path, but hardware generated signals have a very interesting birth. Let’s start there and as we work our way up to userland we’ll stumble into the software signal code path along the way.

Exceptions on the x86

Let’s start by first understanding what an x86 exception is. For that, we’ll turn to the documentation1:

[...] exceptions are events that indicate that a condition exists somewhere in the system, the processor, or within the currently executing program or task that requires the attention of a processor. They typically result in a forced transfer of execution from the currently running program or task to a special software routine [...] or an exception handler.

At a high level this is pretty simple to understand; the system gets in a weird state and the CPU immediately begins executing a predefined handler function to try to fix things (if it can) or die gracefully.

Let’s take a look at how the kernel creates and installs handler functions that the CPU executes when an exception occurs.

Low-level exception handlers

Low level exception handlers are specified in the Interrupt Descriptor Table (IDT). The IDT can hold up to 256 entries and it can live anywhere in memory. Each entry in the IDT is mapped to a different exception. For example, #DE Divide Error is the first entry in the IDT, IDT[0]; #PF Page Fault ‘s handler lives at IDT[14]. When a specific exception is encountered, the CPU looks up the handler function in the IDT, puts some data on the stack, and executes the handler.

What does an entry in the IDT look like? Let’s take a look at a picture2 from Intel:


Take a look at the fields labeled ‘Offset’ – this is field that contains the address of the function to execute. As you can see, there are three fields labeled ‘Offset.’ Can you guess why?

In order to actually set the address of the function you want to execute, you’ll need to do some bit-shifting. Each ‘Offset’ field is indicates which bits of the address of the handler function it wants. You need to be really careful when writing the code that is responsible for creating IDT entries. A bug here could cause your system to do really bizarre things.

We know what an IDT entry looks like, but what about the data that the CPU pushes on the stack before executing a handler? Unfortunately, I couldn’t track down a picture of what the x86_64 puts on the stack and I can’t draw. So, here is a picture of the data the x86 CPU puts on the stack from Intel3:

When an exception occurs, the CPU pushes the stack pointer, the CPU flags register value, the code and stack segment selectors, and the instruction pointer on to the stack before executing the handler.

Nice, but where does the IDT itself live?

The address of the IDT is stored in a CPU register that can be accessed with the instructions lidt and sidt to load and store (respectively) the address of the IDT. Usually, the address of the IDT is set during the initialization of the kernel.

Let’s see where this happens in Linux4:

void __init x86_64_start_kernel(char * real_mode_data)
{
        int i;

        /* [...] */

        for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
                set_intr_gate(i, early_idt_handler);
        }

        load_idt((const struct desc_ptr *)&idt_descr);

	/* [...] */
}

Cool, so Linux creates a bunch of early handlers in case something goes bad during boot and then a few function calls later (not shown), Linux calls start_kernel()5, which calls trap_init()6 for your architecture which actually sets the handlers.

This is pretty important, so let's take a look at the code for this. Thankfully, Linux includes some descriptive function names, so we can see which exceptions are being set.

void __init trap_init(void)
{
   /* ... */

	set_intr_gate(0, &divide_error);

	/* ... */

        set_intr_gate(5, &bounds);
        set_intr_gate(6, &invalid_op);
        set_intr_gate(7, &device_not_available);

	/* ... */

        set_intr_gate(13, &general_protection);
 	set_intr_gate(14, &page_fault);
 	set_intr_gate(15, &spurious_interrupt_bug);
 	set_intr_gate(16, &coprocessor_error);
 	set_intr_gate(17, &alignment_check);

   /* ... */
}

Awesome, now let's try to track down where these exception handlers are defined. As it turns out, there is a little bit of C and assembly magic to string this all together.

The low-level exception handlers have a common entry and exit point and are "templated" with a macro. Let's take a look at the macro7 and some of the handlers8 in the kernel:

.macro zeroentry sym do_sym
ENTRY(\sym)
        INTR_FRAME
        pushq_cfi $-1           /* ORIG_RAX: no syscall to restart */
        subq $15*8,%rsp
        call error_entry
        DEFAULT_FRAME 0
        movq %rsp,%rdi          /* pt_regs pointer */
        xorl %esi,%esi          /* no error code */
        call \do_sym
        jmp error_exit          /* %ebx: no swapgs flag */
 ND(\sym)
.endm

So the macro uses the first argument sym as the name of the function, and the second argument do_sym is a C function that is called from this assembly stub.

We also notice from the stub above a very important (and somewhat subtle) piece of code: movq %rsp,%rdi This piece of code puts the address of the stack pointer in %rdi and we'll see why shortly. First, let's look at how the macro is used to get a better idea how it works:

zeroentry divide_error do_divide_error
zeroentry overflow do_overflow
zeroentry bounds do_bounds
zeroentry invalid_op do_invalid_op
zeroentry device_not_available do_device_not_available

This block of code uses the macro above to create symbols named divide_error, overflow, and more which call out to C functions named do_divide_error, do_overflow, etc. The craziness doesn't end there. These C functions are also generated with macros9:

#define DO_ERROR(trapnr, signr, str, name)                              \
dotraplinkage void do_##name(struct pt_regs *regs, long error_code)     \
{                                                                       \
        if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, signr)  \
                                                        == NOTIFY_STOP) \
                return;                                                 \
        conditional_sti(regs);                                          \
        do_trap(trapnr, signr, str, regs, error_code, NULL);            \
}
/*...*/
DO_ERROR(4, SIGSEGV, "overflow", overflow)
DO_ERROR(5, SIGSEGV, "bounds", bounds)

Those last two lines get substituted with the macro above, creating do_overflow, do_bounds, and more. As you might have noticed, the functions generated have dotraplinkage which is a macro for a gcc attribute regparm which tells gcc to pass arguments to the function in registers and not on the stack.

Remember the movq %rsp,%rdi from the common assembly stub? That line of code exists to pass the address of the state the CPU dumped to the do_* functions via the %rdi register.

The do_* functions notify interested parties about the exception, re-enable interrupts/exceptions if they were disabled, and finally tells the upper layer signal handling code of the kernel that a signal should be generated and hands over the associated CPU state at the time the exception was generated.

Conclusion for Part 1

Wow. What a wild ride that was.

  1. There is a lot of trickery and subtle hacks in the Linux kernel. Reading and understanding the code can make you a more clever programmer. Dig in!
  2. It is pretty cool (imho) to understand how you actually converse with the CPU and how the CPU talks to the kernel, and how that data is pushed to the upper layers.
  3. The Intel CPU manuals, the gcc man page, and the Linux lxr are a big time help for deciphering this code, which can be cryptic at times.
  4. Understanding what information you have at your disposal can let you do pretty crazy things in userland, as we'll see in the next piece of this series.

Stay tuned, in the next piece of this series I'll walk through the signal handling code in the kernel and show some crazy non-portable tricks you can do in userland.

Thanks for reading and don't forget to subscribe (via RSS or e-mail) and follow me on twitter.

References

  1. Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1, 5.1: Interrupt and Exception Overview []
  2. Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1, 5.1: Interrupt and Exception Overview []
  3. Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1, 5.12.1.1: Exception- or Interrupt-Handler Procedures []
  4. http://lxr.linux.no/linux+v2.6.30/arch/x86/kernel/head64.c#L89 []
  5. http://lxr.linux.no/linux+v2.6.29/init/main.c#L590 []
  6. http://lxr.linux.no/linux+v2.6.29/arch/x86/kernel/traps.c#L953 []
  7. http://lxr.linux.no/linux+v2.6.29/arch/x86/kernel/entry_64.S#L1028 []
  8. http://lxr.linux.no/linux+v2.6.29/arch/x86/kernel/entry_64.S#L1121 []
  9. http://lxr.linux.no/linux+v2.6.29/arch/x86/kernel/traps.c#L236 []

Written by Joe Damato

June 15th, 2009 at 6:00 am