time to bleed by Joe Damato

technical ramblings from a wanna-be unix dinosaur

Archive for the ‘system health’ tag

a/b test mallocs against your memory footprint

View Comments

The other day at Kickball Labs we were discussing whether linking Ruby against tcmalloc (or ptmalloc3, nedmalloc, or any other malloc) would have any noticeable effect on application latency. After taking a side in the argument, I started wondering how we could test this scenario.

We had a couple different ideas about testing:

  • Look at other people’s benchmarks
    BUT do the memory workloads tested in the benchmarks actually match our own workload at all?
  • Run different allocators on different Ruby backends
    BUT different backends will get different users who will use the system differently and cause different allocation patterns
  • Try to recreate our applications memory footprint and test that against different mallocs
    BUT how?

I decided to explore the last option and came up with an interesting solution. Let’s dive into how to do this.

Get the code:


Step 1: We need to get a memory footprint of our process

So we have some random binary  (in this case it happens to be a Ruby interpreter, but it could be anything) and we’d like to track when it calls malloc/realloc/calloc and free (from now on I’ll refer to all of these as malloc-family for brevity). There are two ways to do this, the right way and the wrong/hacky/unsafe way.

  • The “right” way to do this, with libc malloc hooks:

    Edit your application code to use the malloc debugging hooks provided by libc. When a malloc-family function is called, your hook executes and outputs to a file which function was called and what arguments were passed to it.

  • The “wrong/hacky/unsafe” way to do this, with LD_PRELOAD:

    Create a shim library and point LD_PRELOAD at it. The shim exports the malloc-family symbols, and when your application calls one of those functions, the shim code gets executed. The shim logs which function was called and with what arguments. The shim then calls the libc version of the function (so that memory is actually allocated/freed) and returns control to the application.

I chose to do it the second way, because I like living on the edge. The second way is unsafe because you can’t call any functions which use a malloc-family function before your hooks are setup. If you do, you can end up in an infinite loop and crash the application.

You can check out my implementation for the shim library here: malloc_wrap.c

Why does your shim output such weirdly formatted data?

Answer is sort of complicated, but let’s keep it simple: I originally had a different idea about how I was going to use the output. When that first try failed, I tried something else and translated the data to the format I needed it in, instead of re-writing the shim. What can I say, I’m a lazy programmer.

OK, so once you’ve built the shim (gcc -O2 -Wall -ldl -fPIC -o malloc_wrap.so -shared malloc_wrap.c), you can launch your binary like this:

% LD_PRELOAD=/path/to/shim/malloc_wrap.so /path/to/your/binary -your -args

You should now see output in /tmp/malloc-footprint.pid

Step 2: Translate the data into a more usable format

Yeah, I should have went back and re-written the shim, but nothing happens exactly as planned. So, I wrote a quick ruby script to convert my output into a more usable format. The script sorts through the output and renames memory addresses to unique integer ids starting at 1 (0 is hardcoded to NULL).

The format is pretty simple. The first line of the file has the number of calls to malloc-family functions, followed by a blank line, and then the memory footprint. Each line of the memory footprint has 1 character which represents the function called followed by a few arguments. For the free() function, there is only one argument, the ID of the memory block to free. malloc/calloc/realloc have different arguments, but the first argument following the one character is always the ID of the return value. The next arguments are the arguments usually passed to malloc/calloc/realloc in the same order.

Have a look at my ruby script here: build_trace_file.rb

It might take a while to convert your data to this format, I suggest running this in a screen session, especially if your memory footprint data is large. Just as a warning, we collected 15 *gigabytes* of data over a 10 hour period. This script took *10 hours* to convert the data. We ended up with a 7.8 gigabyte file.

% ruby /path/to/script/build_trace_file.rb /path/to/raw/malloc-footprint.PID /path/to/converted/my-memory-footprint

Step 3: Replay the allocation data with different allocators and measure time, memory usage.

OK, so we now have a file which represents the memory footprint of our application. It’s time to build the replayer, link against your malloc implementation of choice, fire it up and start measuring time spent in allocator functions and memory usage.

Have a look at the replayer here: alloc_tester.c
Build the replayer: gcc -ggdb -Wall -ldl -fPIC -o tester alloc_tester.c

Use ltrace

ltrace is similar to strace, but for library calls. You can use ltrace -c to sum the amount of time spent in different library calls and output a cool table at the end, it will look something like this:

% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
86.70   37.305797          62    600003 fscanf
10.64    4.578968          33    138532 malloc
2.36    1.014294          18     55263 free
0.25    0.109550          18      5948 realloc
0.03    0.011407          45       253 printf
0.02    0.010665          42       252 puts
0.00    0.000167          20         8 calloc
0.00    0.000048          48         1 fopen
------ ----------- ----------- --------- --------------------
100.00   43.030896                800260 total


Using a different malloc implementation can provide a speed/memory increases depending on your allocation patterns. Hopefully the code provided will help you test different allocators to determine whether or not swapping out the default libc allocator is the right choice for you. Our results are still pending; we had a lot of allocator data (15g!) and it takes several hours to replay the data with just one malloc implementation. Once we’ve gathered some data about the different implementations and their effects, I’ll post the results and some analysis. As always, stay tuned and thanks for reading!

Written by Joe Damato

March 16th, 2009 at 8:39 pm

It’s 10PM: Do you know your RAID/BBU/consistency status?

View Comments

Huh? RAID status? Consistency status?

The status of your RAID array tells you if your RAID array has degraded and which disk(s) are the culprit. Most RAID statuses will include more information like temperature, installed memory amount, and more.

You also need to run consistency checks to ensure that data on bad blocks will either be moved or rewritten to good blocks. Why is this important? Consider the following scenario: You have a RAID 10 array. One disk dies, say disk A of stripe set 1. You now replace that disk and start a rebuild of the array. You never ran a consistency check and it turns out that there were bad blocks on disk B of stripe set 1 that were never reallocated to good blocks. When data is written to the replacement disk, disk B may not be able to read data from its bad blocks. Corrupt data then gets written to the replacement disk and you likely won’t notice a problem until the box crashes or you are missing data due to corruption

Whoa that is pretty serious. How can I keep track of all that?

The two common failure notifications for a logical failure I’ve seen are alarms and RAID status changes.

In my opinion, alarms are generally useless unless you are sitting near your server. What good is an alarm if you don’t hear it? While I wouldn’t rely on an alarm as the first line of defense against a RAID failure, it can definitely grab the attention of a nearby tech in the data center when a problem arises.

RAID status changes are probably the most useful way to determine when a RAID array degrades.

For physical disk failures, you’ll only know when a consistency check is run or when you lose data or the box dies. Some RAID adapters can be set up to automatically run consistency checks, others need to be invoke each time.

Speaking of consistency, don’t forget about that battery backup unit (BBU)!

A battery backup unit is necessary for a RAID array which has its write cache enabled. This is because if write requests are in the cache and power is lost to the system, the BBU will provide power so that the outstanding writes can be synced to the array. If you have the write cache enabled, but don’t have a BBU when power is lost to the system, the data on the system could be corrupt because the writes in the cache may not be written to disk.

How do I check my RAID/BBU status?

Checking your RAID/BBU status is very vendor specific. Each vendor has their own method, but the most common method by far is to expose a management interface (in the form of a character device) which listens for different queries from userspace via an ioctl interface.

Most hardware RAID vendors include a small binary or script which will send ioctls to the management interface and give you detailed information about the status of your device. I’ve listed the names of the management apps for Adaptec and 3ware RAID devices below and included a sample output from an aacraid device at the bottom of this post.

Adaptec aacraid – /usr/StorMan/arcconf

3WARE raid – /usr/bin/tw_cli

You can write a script that runs as a cron job, parses the output of the management binary, and sends an email/page when a status change occurs.

How can I run consistency checks?

This is also incredibly vendor specific. The consistency check can usually be run/scheduled via the CLI. You should check the documentation for the CLI tool. With an aacraid controller, a consistency check can be run by using the datascrub command:

/usr/StorMan/arcconf datascrub 1 period 10

This will perform a consistency check in the background that has 10 days to complete.

How can I protect myself from a single disk failure?

There are many different RAID configurations, but the most common ones which can protect you from a single disk failure are:

  • RAID 1
  • RAID 5
  • RAID 6
  • RAID 10

What about a multiple disk failure?

Again, there are many different RAID configurations, but there are two major ways to survive multiple disk failure. Unfortunately, one way involves being really lucky.

  • RAID 10 – You have to be pretty lucky here. As long as there is one working disk on each stripe set, you should be OK.
  • Double Parity RAID 6 – This configuration can survive a failure of any two disks.


Read your RAID device documentation carefully and follow any relevant suggestions. If you don’t have RAID status monitoring set up, do it now. The minimal time investment to set this up can save you down the road when a hardware failure occurs.

You should also set up and run a consistency check as soon as possible and schedule them to run at regular intervals. Check your RAID docs for more info about how to run a consistency check.

Sample output from an aacraid device that doesn’t have consistency checks running:

sudo /usr/StorMan/arcconf getconfig 1 AD

Controller information
   Controller Status                        : Optimal
   Channel description                      : SAS/SATA
   Controller Model                         : Adaptec 3405
   Controller Serial Number                 : 7C391118F8E
   Physical Slot                            : 2
   Temperature                              : 43 C/ 109 F (Normal)
   Installed memory                         : 128 MB
   Copyback                                 : Disabled
   Background consistency check             : Disabled
   Automatic Failover                       : Enabled
   Defunct disk drive count                 : 0
   Logical devices/Failed/Degraded          : 1/0/0
   Controller Version Information
   BIOS                                     : 5.2-0 (15753)
   Firmware                                 : 5.2-0 (15753)
   Driver                                   : 1.1-5 (2456)
   Boot Flash                               : 5.2-0 (15753)
   Controller Battery Information
   Status                                   : Optimal
   Over temperature                         : No
   Capacity remaining                       : 100 percent
   Time remaining (at current draw)         : 3 days, 1 hours, 31 minutes

Written by Joe Damato

January 11th, 2009 at 8:28 pm