Archive for the ‘scaling’ tag
Below are the slides for a talk that Aman Gupta and I gave at Ruby Hoedown
Download the PDF here
This article is going to address some kernel and driver tweaks that are interesting and useful. We use several of these in production with excellent performance, but you should proceed with caution and do research prior to trying anything listed below.
The tickless kernel feature allows for on-demand timer interrupts. This means that during idle periods, fewer timer interrupts will fire, which should lead to power savings, cooler running systems, and fewer useless context switches.
Kernel option: CONFIG_NO_HZ=y
You can select the rate at which timer interrupts in the kernel will fire. When a timer interrupt fires on a CPU, the process running on that CPU is interrupted while the timer interrupt is handled. Reducing the rate at which the timer fires allows for fewer interruptions of your running processes. This option is particularly useful for servers with multiple CPUs where processes are not running interactively.
Kernel options: CONFIG_HZ_100=y and CONFIG_HZ=100
The connector module is a kernel module which reports process events such as
exit to userland. This is extremely useful for process monitoring. You can build a simple system (or use an existing one like god) to watch mission-critical processes. If the processes die due to a signal (like
SIGBUS) or exit unexpectedly you’ll get an asynchronous notification from the kernel. The processes can then be restarted by your monitor keeping downtime to a minimum when unexpected events occur.
Kernel options: CONFIG_CONNECTOR=y and CONFIG_PROC_EVENTS=y
TCP segmentation offload (TSO)
A popular feature among newer NICs is TCP segmentation offload (TSO). This feature allows the kernel to offload the work of dividing large packets into smaller packets to the NIC. This frees up the CPU to do more useful work and reduces the amount of overhead that the CPU passes along the bus. If your NIC supports this feature, you can enable it with
[joe@timetobleed]% sudo ethtool -K eth1 tso on
Let’s quickly verify that this worked:
[joe@timetobleed]% sudo ethtool -k eth1 Offload parameters for eth1: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on udp fragmentation offload: off generic segmentation offload: on large receive offload: off [joe@timetobleed]% dmesg | tail -1 [892528.450378] 0000:04:00.1: eth1: TSO is Enabled
Intel I/OAT DMA Engine
This kernel option enables the Intel I/OAT DMA engine that is present in recent Xeon CPUs. This option increases network throughput as the DMA engine allows the kernel to offload network data copying from the CPU to the DMA engine. This frees up the CPU to do more useful work.
Check to see if it’s enabled:
[joe@timetobleed]% dmesg | grep ioat ioatdma 0000:00:08.0: setting latency timer to 64 ioatdma 0000:00:08.0: Intel(R) I/OAT DMA Engine found, 4 channels, device version 0x12, driver version 3.64 ioatdma 0000:00:08.0: irq 56 for MSI/MSI-X
There’s also a sysfs interface where you can get some statistics about the DMA engine. Check the directories under
Kernel options: CONFIG_DMADEVICES=y and CONFIG_INTEL_IOATDMA=y and CONFIG_DMA_ENGINE=y and CONFIG_NET_DMA=y and CONFIG_ASYNC_TX_DMA=y
Direct Cache Access (DCA)
Intel’s I/OAT also includes a feature called Direct Cache Access (DCA). DCA allows a driver to warm a CPU cache. A few NICs support DCA, the most popular (to my knowledge) is the Intel 10GbE driver (
ixgbe). Refer to your NIC driver documentation to see if your NIC supports DCA. To enable DCA, a switch in the BIOS must be flipped. Some vendors supply machines that support DCA, but don’t expose a switch for DCA. If that is the case, see my last blog post for how to enable DCA manually.
You can check if DCA is enabled:
[joe@timetobleed]% dmesg | grep dca dca service started, version 1.8
If DCA is possible on your system but disabled you’ll see:
ioatdma 0000:00:08.0: DCA is disabled in BIOS
Which means you’ll need to enable it in the BIOS or manually.
Kernel option: CONFIG_DCA=y
The “New API” (NAPI) is a rework of the packet processing code in the kernel to improve performance for high speed networking. NAPI provides two major features1:
Interrupt mitigation: High-speed networking can create thousands of interrupts per second, all of which tell the system something it already knew: it has lots of packets to process. NAPI allows drivers to run with (some) interrupts disabled during times of high traffic, with a corresponding decrease in system load.
Packet throttling: When the system is overwhelmed and must drop packets, it’s better if those packets are disposed of before much effort goes into processing them. NAPI-compliant drivers can often cause packets to be dropped in the network adaptor itself, before the kernel sees them at all.
Many recent NIC drivers automatically support NAPI, so you don’t need to do anything. Some drivers need you to explicitly specify NAPI in the kernel config or on the command line when compiling the driver. If you are unsure, check your driver documentation. A good place to look for docs is in your kernel source under Documentation, available on the web here: http://lxr.linux.no/linux+v2.6.30/Documentation/networking/ but be sure to select the correct kernel version, first!
Older e1000 drivers (newer drivers, do nothing):
make CFLAGS_EXTRA=-DE1000_NAPI install
Throttle NIC Interrupts
Some drivers allow the user to specify the rate at which the NIC will generate interrupts. The
e1000e driver allows you to pass a command line option
when loading the module with
insmod. For the
e1000e there are two dynamic interrupt throttle mechanisms, specified on the command line as 1 (dynamic) and 3 (dynamic conservative). The adaptive algorithm traffic into different classes and adjusts the interrupt rate appropriately. The difference between dynamic and dynamic conservative is the the rate for the “Lowest Latency” traffic class, dynamic (1) has a much more aggressive interrupt rate for this traffic class.
As always, check your driver documentation for more information.
insmod e1000e.o InterruptThrottleRate=1
Process and IRQ affinity
Linux allows the user to specify which CPUs processes and interrupt handlers are bound.
- Processes You can use
tasksetto specify which CPUs a process can run on
- Interrupt Handlers The interrupt map can be found in /proc/interrupts, and the affinity for each interrupt can be set in the file smp_affinity in the directory for each interrupt under /proc/irq/
This is useful because you can pin the interrupt handlers for your NICs to specific CPUs so that when a shared resource is touched (a lock in the network stack) and loaded to a CPU cache, the next time the handler runs, it will be put on the same CPU avoiding costly cache invalidations that can occur if the handler is put on a different CPU.
However, reports2 of up to a 24% improvement can be had if processes and the IRQs for the NICs the processes get data from are pinned to the same CPUs. Doing this ensures that the data loaded into the CPU cache by the interrupt handler can be used (without invalidation) by the process; extremely high cache locality is achieved.
oprofile is a system wide profiler that can profile both kernel and application level code. There is a kernel driver for oprofile which generates collects data in the x86′s Model Specific Registers (MSRs) to give very detailed information about the performance of running code. oprofile can also annotate source code with performance information to make fixing bottlenecks easy. See oprofile’s homepage for more information.
Kernel options: CONFIG_OPROFILE=y and CONFIG_HAVE_OPROFILE=y
epoll(7) is useful for applications which must watch for events on large numbers of file descriptors. The
epoll interface is designed to easily scale to large numbers of file descriptors.
epoll is already enabled in most recent kernels, but some strange distributions (which will remain nameless) have this feature disabled.
Kernel option: CONFIG_EPOLL=y
- There are a lot of useful levers that can be pulled when trying to squeeze every last bit of performance out of your system
- It is extremely important to read and understand your hardware documentation if you hope to achieve the maximum throughput your system can achieve
- You can find documentation for your kernel online at the Linux LXR. Make sure to select the correct kernel version because docs change as the source changes!