time to bleed by Joe Damato

technical ramblings from a wanna-be unix dinosaur

Archive for the ‘monitoring’ tag

an obscure kernel feature to get more info about dying processes

View Comments


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

tl;dr

This post will describe how I stumbled upon a code path in the Linux kernel which allows external programs to be launched when a core dump is about to happen. I provide a link to a short and ugly Ruby script which captures a faulting process, runs gdb to get a backtrace (and other information), captures the core dump, and then generates a notification email.

I don’t care about faults because I use monit, god, etc

Chill.

Your processes may get automatically restarted when a fault occurs and you may even get an email letting you know your process died. Both of those things are useful, but it turns out that with just a tiny bit of extra work you can actually get very detailed emails showing a stack trace, register information, and a snapshot of the process’ files in /proc.

random walking the linux kernel

One day I was sitting around wondering how exactly the coredump code paths are wired. I cracked open the kernel source and started reading.

It wasn’t long until I saw this piece of code from exec.c1:

void do_coredump(long signr, int exit_code, struct pt_regs *regs)
{
  /* .... */
  lock_kernel();
  ispipe = format_corename(corename, signr);
  unlock_kernel();

   if (ispipe) {
   /* ... */

Hrm. ispipe? That seems interesting. I wonder what format_corename does and what ispipe means. Following through and reading format_corename2:

static int format_corename(char *corename, long signr)
{
	/* ... */

        const char *pat_ptr = core_pattern;
        int ispipe = (*pat_ptr == '|');

	/* ... */

        return ispipe;
}

In the above code, core_pattern (which can be set with a sysctl or via /proc/sys/kernel/core_pattern) to determine if the first character is a |. If so, format_corename returns 1. So | seems relatively important, but at this point I’m still unclear on what it actually means.

Scanning the rest of the code for do_coredump reveals something very interesting3 (this is more code from the function in the first code snippet above):

     /* ... */

     helper_argv = argv_split(GFP_KERNEL, corename+1, NULL);

     /* ... */

     retval = call_usermodehelper_fns(helper_argv[0], helper_argv,
                             NULL, UMH_WAIT_EXEC, umh_pipe_setup,
                             NULL, &cprm);

    /* ... */

WTF? call_usermodehelper_fns? umh_pipe_setup? This is looking pretty interesting. If you follow the code down a few layers, you end up at call_usermodehelper_exec which has the following very enlightening comment:

/**
 * call_usermodehelper_exec - start a usermode application
 *
 *  . . .
 *
 * Runs a user-space application.  The application is started
 * asynchronously if wait is not set, and runs as a child of keventd.
 * (ie. it runs with full root capabilities).
 */

what it all means

All together this is actually pretty fucking sick:

  • You can set /proc/sys/kernel/core_pattern to run a script when a process is going to dump core.
  • Your script is run before the process is killed.
  • A pipe is opened and attached to your script. The kernel writes the coredump to the pipe. Your script can read it and write it to storage.
  • Your script can attach GDB, get a backtrace, and gather other information to send a detailed email.

But the coolest part of all:

  • All of the files in /proc/[pid] for that process remain intact and can be inspected. You can check the open file descriptors, the process’s memory map, and much much more.

ruby codez to harness this amazing code path

I whipped up a pretty simple, ugly, ruby script. You can get it here. I set up my system to use it by:

% echo "|/path/to/core_helper.rb %p %s %u %g" > /proc/sys/kernel/core_pattern 

Where:

  • %pPID of the dying process
  • %s – signal number causing the core dump
  • %u – real user id of the dying process
  • %g – real group id of the dyning process

Why didn’t you read the documentation instead?

This (as far as I can tell) little-known feature is documented at linux-kernel-source/Documentation/sysctl/kernel.txt under the “core_pattern” section. I didn’t read the documentation because (little known fact) I actually don’t know how to read. I found the code path randomly and it was much more fun an interesting to discover this little feature by diving into the code.

Conclusion

  • This could/should probably be a feature/plugin/whatever for god/monit/etc instead of a stand-alone script.
  • Reading code to discover features doesn’t scale very well, but it is a lot more fun than reading documentation all the time. Also, you learn stuff and reading code makes you a better programmer.

References

  1. http://lxr.linux.no/linux+v2.6.35.4/fs/exec.c#L1836 []
  2. http://lxr.linux.no/linux+v2.6.35.4/fs/exec.c#L1446 []
  3. http://lxr.linux.no/linux+v2.6.35.4/fs/exec.c#L1836 []

Written by Joe Damato

September 20th, 2010 at 4:59 am

Slides from Defcon 18: Function hooking for OSX and Linux

View Comments

Written by Aman Gupta

August 1st, 2010 at 11:24 am

memprof: A Ruby level memory profiler

View Comments


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

What is memprof and why do I care?

memprof is a Ruby gem which supplies memory profiler functionality similar to bleak_house without patching the Ruby VM. You just install the gem, call a function or two, and off you go.

Where do I get it?

memprof is available on gemcutter, so you can just:

gem install memprof

Feel free to browse the source code at: http://github.com/ice799/memprof.

How do I use it?

Using memprof is simple. Before we look at some examples, let me explain more precisely what memprof is measuring.

memprof is measuring the number of objects created and not destroyed during a segment of Ruby code. The ideal use case for memprof is to show you where objects that do not get destroyed are being created:

  • Objects are created and not destroyed when you create new classes. This is a good thing.
  • Sometimes garbage objects sit around until garbage_collect has had a chance to run. These objects will go away.
  • Yet in other cases you might be holding a reference to a large chain of objects without knowing it. Until you remove this reference, the entire chain of objects will remain in memory taking up space.

memprof will show objects created in all cases listed above.

OK, now Let’s take a look at two examples and their output.

A simple program with an obvious memory “leak”:

require 'memprof'

@blah = Hash.new([])

Memprof.start
100.times {
  @blah[1] << "aaaaa"
}

1000.times {
   @blah[2] << "bbbbb"
}
Memprof.stats
Memprof.stop

This program creates 1100 objects which are not destroyed during the start and stop sections of the file because references are held for each object created.

Let's look at the output from memprof:

   1000 test.rb:11:String
    100 test.rb:7:String

In this example memprof shows the 1100 created, broken up by file, line number, and type.

Let's take a look at another example:

require 'memprof'
Memprof.start
require "stringio"
StringIO.new
Memprof.stats

This simple program is measuring the number of objects created when requiring stringio.

Let's take a look at the output:

    108 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:__node__
     14 test2.rb:3:String
      2 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Class
      1 test2.rb:4:StringIO
      1 test2.rb:4:String
      1 test2.rb:3:Array
      1 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Enumerable

This output shows an internal Ruby interpreter type __node__ was created (these represent code), as well as a few Strings and other objects. Some of these objects are just garbage objects which haven't had a chance to be recycled yet.

What if nudge the garbage_collector along a little bit just for our example? Let's add the following two lines of code to our previous example:

GC.start
Memprof.stats

We're now nudging the garbage collector and outputting memprof stats information again. This should show fewer objects, as the garbage collector will recycle some of the garbage objects:

    108 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:__node__
      2 test2.rb:3:String
      2 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Class
      1 /custom/ree/lib/ruby/1.8/x86_64-linux/stringio.so:0:Enumerable

As you can see above, a few Strings and other objects went away after the garbage collector ran.

Which Rubies and systems are supported?

  • Only unstripped binaries are supported. To determine if your Ruby binary is stripped, simply run: file `which ruby`. If it is, consult your package manager's documentation. Most Linux distributions offer a package with an unstripped Ruby binary.
  • Only x86_64 is supported at this time. Hopefully, I'll have time to add support for i386/i686 in the immediate future.
  • Linux Ruby Enterprise Edition (1.8.6 and 1.8.7) is supported.
  • Linux MRI Ruby 1.8.6 and 1.8.7 built with --disable-shared are supported. Support for --enable-shared binaries is coming soon.
  • Snow Leopard support is experimental at this time.
  • Ruby 1.9 support coming soon.

How does it work?

If you've been reading my blog over the last week or so, you'd have noticed two previous blog posts (here and here) that describe some tricks I came up with for modifying a running binary image in memory.

memprof is a combination of all those tricks and other hacks to allow memory profiling in Ruby without the need for custom patches to the Ruby VM. You simply require the gem and off you go.

memprof works by inserting trampolines on object allocation and deallocation routines. It gathers metadata about the objects and outputs this information when the stats method is called.

What else is planned?

Myself, Jake Douglas, and Aman Gupta have lots of interesting ideas for new features. We don't want to ruin the surprise, but stay tuned. More cool stuff coming really soon :)

Thanks for reading and don't forget to subscribe (via RSS or e-mail) and follow me on twitter.

Written by Joe Damato

December 11th, 2009 at 5:59 am

Extending ltrace to make your Ruby/Python/Perl/PHP apps faster

View Comments


If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

A few days ago, Aman (@tmm1) was complaining to me about a slow running process:

I want to see what is happening in userland and trace calls to extensions. Why doesn’t ltrace work for Ruby processes? I want to figure out which MySQL queries are causing my app to be slow.

It turns out that ltrace did not have support for libraries loaded with libdl. This is a problem for languages like Ruby, Python, PHP, Perl, and others because in many cases extensions, libraries, and plugins for these languages are loaded by the VM using libdl. This means that ltrace is somewhat useless for tracking down performance issues in dynamic languages.

A couple late nights of hacking and I managed to finagle libdl support in ltrace. Since most people probably don’t care about the technical details of how it was implemented, I’ll start with showing how to use the patch I wrote and what sort of output you can expect. This patch has made tracking down slow queries (among other things) really easy and I hope others will find this useful.

How to use ltrace:

After you’ve applied my patch (below) and rebuilt ltrace, let’s say you’d like to trace MySQL queries and have ltrace tell you when the query was executed and how long it took. There are two steps:

  1. Give ltrace info so it can pretty print – echo “int mysql_real_query(addr,string,ulong);” > custom.conf
  2. Tell ltrace you want to hear about mysql_real_query: ltrace -F custom.conf -ttTgx mysql_real_query -p <pid>

Here’s what those arguments mean:

  • -F use a custom config file when pretty-printing (default: /etc/ltrace.conf, add your stuff there to avoid -F if you wish).
  • -tt print the time (including microseconds) when the call was executed
  • -T time the call and print how long it took
  • -x tells ltrace the name of the function you care about
  • -g avoid placing breakpoints on all library calls except the ones you specify with -x. This is optional, but it makes ltrace produce much less output and is a lot easier to read if you only care about your one function.

PHP

Test script

mysql_connect("localhost", "root");
while(true){
    mysql_query("SELECT sleep(1)");
}

ltrace output

22:31:50.507523 zend_hash_find(0x025dc3a0, "mysql_query", 12) = 0 <0.000029>
22:31:50.507781 mysql_real_query(0x027bc540, "SELECT sleep(1)", 15) = 0 <1.000600>
22:31:51.508531 zend_hash_find(0x025dc3a0, "mysql_query", 12) = 0 <0.000025>
22:31:51.508675 mysql_real_query(0x027bc540, "SELECT sleep(1)", 15) = 0 <1.000926>

ltrace command

ltrace -ttTg -x zend_hash_find -x mysql_real_query -p [pid of script above]

Python

Test script

import MySQLdb
db = MySQLdb.connect("localhost", "root", "", "test")
cursor = db.cursor()
sql = """SELECT sleep(1)"""
while True:
	cursor.execute(sql)
	data = cursor.fetchone()
db.close()

ltrace output

22:24:39.104786 PyEval_SaveThread() = 0x21222e0 <0.000029>
22:24:39.105020 PyEval_SaveThread() = 0x21222e0 <0.000024>
22:24:39.105210 PyEval_SaveThread() = 0x21222e0 <0.000024>
22:24:39.105303 mysql_real_query(0x021d01d0, "SELECT sleep(1)", 15) = 0 <1.002083>
22:24:40.107553 PyEval_SaveThread() = 0x21222e0 <0.000026>
22:24:40.107713 PyEval_SaveThread()= 0x21222e0 <0.000024>
22:24:40.107909 PyEval_SaveThread() = 0x21222e0 <0.000025>
22:24:40.108013 mysql_real_query(0x021d01d0, "SELECT sleep(1)", 15) = 0 <1.001821>

ltrace command

ltrace -ttTg -x PyEval_SaveThread -x mysql_real_query -p [pid of script above]

Perl

Test script

#!/usr/bin/perl
use DBI;

$dsn = "DBI:mysql:database=test;host=localhost";
$dbh = DBI->connect($dsn, "root", "");
$drh = DBI->install_driver("mysql");
@databases = DBI->data_sources("mysql");
$sth = $dbh->prepare("SELECT SLEEP(1)");

while (1) {
  $sth->execute;
}

ltrace output

22:42:11.194073 Perl_push_scope(0x01bd3010) =  <0.000028>
22:42:11.194299 mysql_real_query(0x01bfbf40, "SELECT SLEEP(1)", 15) = 0 <1.000876>
22:42:12.195302 Perl_push_scope(0x01bd3010) =  <0.000024>
22:42:12.195408 mysql_real_query(0x01bfbf40, "SELECT SLEEP(1)", 15) = 0 <1.000967>

ltrace command

ltrace -ttTg -x mysql_real_query -x Perl_push_scope -p [pid of script above]

Ruby

Test script

require 'rubygems'
require 'sequel'

DB = Sequel.connect('mysql://root@localhost/test')

while true
  p DB['select sleep(1)'].select.first
  GC.start
end

snip of ltrace output

22:10:00.195814 garbage_collect()  = 0 <0.022194>
22:10:00.218438 mysql_real_query(0x02740000, "select sleep(1)", 15) = 0 <1.001100>
22:10:01.219884 garbage_collect() = 0 <0.021401>
22:10:01.241679 mysql_real_query(0x02740000, "select sleep(1)", 15) = 0 <1.000812>

ltrace command used:

ltrace -ttTg -x garbage_collect -x mysql_real_query -p [pid of script above]

Where to get it

How ltrace works normally

ltrace works by setting software breakpoints on entries in a process’ Procedure Linkage Table (PLT).

What is a software breakpoint

A software breakpoint is just a series of bytes (0xcc on the x86 and x86_64) that raise a debug interrupt (interrupt 3 on the x86 and x86_64). When interrupt 3 is raised, the CPU executes a handler installed by the kernel. The kernel then sends a signal to the process that generated the interrupt. (Want to know more about how signals and interrupts work? Check out an earlier blog post: here)

What is a PLT and how does it work?

A PLT is a table of absolute addresses to functions. It is used because the link editor doesn’t know where functions in shared objects will be located. Instead, a table is created so that the program and the dynamic linker can work together to find and execute functions in shared objects. I’ve simplified the explanation a bit1, but at a high level:

  1. Program calls a function in a shared object, the link editor makes sure that the program jumps to a slot in the PLT.
  2. The program sets some data up for the dynamic linker and then hands control over to it.
  3. The dynamic linker looks at the info set up by the program and fills in the absolute address of the function that was called in the PLT.
  4. Then the dynamic linker calls the function.
  5. Subsequent calls to the same function jump to the same slot in the PLT, but every time after the first call the absolute address is already in the PLT (because when the dynamic linker is invoked the first time, it fills in the absolute address in the PLT).

Since all calls to library functions occur via the PLT, ltrace sets breakpoints on each PLT entry in a program.

Why ltrace didn’t work with libdl loaded libraries

Libraries loaded with libdl are loaded at run time and functions (and other symbols) are accessed by querying the dynamic linker (by calling dlsym()). The compiler and link editor don’t know anything about libraries loaded this way (they may not even exist!) and as such no PLT entries are created for them.

Since no PLT entries exist, ltrace can’t trace these functions.

What needed to be done to make ltrace libdl-aware

OK, so we understand the problem. ltrace only sets breakpoints on PLT entries and libdl loaded libraries don’t have PLT entries. How can this be fixed?

Luckily, the dynamic linker and ELF all work together to save your ass.

Executable and Linking Format (ELF) is a file format for executables, shared libraries, and more2. The file format can get a bit complicated, but all you really need to know is: ELF consists of different sections which hold different types of entries. There is a section called .dynamic which has an entry named DT_DEBUG. This entry stores the address of a debugging structure in the address space of the process. In Linux, this struct has type struct r_debug.

How to use struct r_debug to win the game

The debug structure is updated by the dynamic linker at runtime to reflect the current state of shared object loading. The structure contains 3 things that will help us in our quest:

  1. state – the current state of the mapping change taking place (begin add, begin delete, consistent)
  2. brk – the address of a function internal to the dynamic linker that will be called when the linker maps, unmaps, or has completed mapping a shared object.
  3. link map – Pointer to the start of a list of currently loaded objects. This list is called the link map and is represented as a struct link_map in Linux.

Tie it all together and bring it home

To add support for libdl loaded libraries to ltrace, the steps are:

  1. Find the address of the debug structure in the .dynamic section of the program.
  2. Set a software breakpoint on brk.
  3. When the dynamic linker updates the link map, it will trigger the software breakpoint.
  4. When the breakpoint is triggered, check state in the debug structure.
  5. If a new library has been added, walk the link map and figure out what was added.
  6. Search the added library’s symbol table for the symbols we care about.
  7. Set a software breakpoints on whatever is found.
  8. Steps 3-8 repeat.

That isn’t too hard all thanks to the dynamic linker providing a way for us to hook into its internal events.

Conclusion

  • Read the System V ABI for your CPU. It is filled with insanely useful information that can help you be a better programmer.
  • Use the source. A few times while hacking on this patch I looked through the source for GDB and glibc to help me figure out what was going on.
  • Understanding how things work at a low-level can help you build tools to solve your high-level problems.

Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.

References

  1. System V Application Binary Interface AMD64 Architecture Processor Supplement, p 78 []
  2. Executable and Linking Format (ELF) Specification []

Written by Joe Damato

October 8th, 2009 at 4:59 am

a/b test mallocs against your memory footprint

View Comments


The other day at Kickball Labs we were discussing whether linking Ruby against tcmalloc (or ptmalloc3, nedmalloc, or any other malloc) would have any noticeable effect on application latency. After taking a side in the argument, I started wondering how we could test this scenario.

We had a couple different ideas about testing:

  • Look at other people’s benchmarks
    BUT do the memory workloads tested in the benchmarks actually match our own workload at all?
  • Run different allocators on different Ruby backends
    BUT different backends will get different users who will use the system differently and cause different allocation patterns
  • Try to recreate our applications memory footprint and test that against different mallocs
    BUT how?

I decided to explore the last option and came up with an interesting solution. Let’s dive into how to do this.

Get the code:

http://github.com/ice799/malloc_wrap/tree/master

Step 1: We need to get a memory footprint of our process

So we have some random binary  (in this case it happens to be a Ruby interpreter, but it could be anything) and we’d like to track when it calls malloc/realloc/calloc and free (from now on I’ll refer to all of these as malloc-family for brevity). There are two ways to do this, the right way and the wrong/hacky/unsafe way.

  • The “right” way to do this, with libc malloc hooks:

    Edit your application code to use the malloc debugging hooks provided by libc. When a malloc-family function is called, your hook executes and outputs to a file which function was called and what arguments were passed to it.

  • The “wrong/hacky/unsafe” way to do this, with LD_PRELOAD:

    Create a shim library and point LD_PRELOAD at it. The shim exports the malloc-family symbols, and when your application calls one of those functions, the shim code gets executed. The shim logs which function was called and with what arguments. The shim then calls the libc version of the function (so that memory is actually allocated/freed) and returns control to the application.

I chose to do it the second way, because I like living on the edge. The second way is unsafe because you can’t call any functions which use a malloc-family function before your hooks are setup. If you do, you can end up in an infinite loop and crash the application.

You can check out my implementation for the shim library here: malloc_wrap.c

Why does your shim output such weirdly formatted data?

Answer is sort of complicated, but let’s keep it simple: I originally had a different idea about how I was going to use the output. When that first try failed, I tried something else and translated the data to the format I needed it in, instead of re-writing the shim. What can I say, I’m a lazy programmer.

OK, so once you’ve built the shim (gcc -O2 -Wall -ldl -fPIC -o malloc_wrap.so -shared malloc_wrap.c), you can launch your binary like this:

% LD_PRELOAD=/path/to/shim/malloc_wrap.so /path/to/your/binary -your -args

You should now see output in /tmp/malloc-footprint.pid

Step 2: Translate the data into a more usable format

Yeah, I should have went back and re-written the shim, but nothing happens exactly as planned. So, I wrote a quick ruby script to convert my output into a more usable format. The script sorts through the output and renames memory addresses to unique integer ids starting at 1 (0 is hardcoded to NULL).

The format is pretty simple. The first line of the file has the number of calls to malloc-family functions, followed by a blank line, and then the memory footprint. Each line of the memory footprint has 1 character which represents the function called followed by a few arguments. For the free() function, there is only one argument, the ID of the memory block to free. malloc/calloc/realloc have different arguments, but the first argument following the one character is always the ID of the return value. The next arguments are the arguments usually passed to malloc/calloc/realloc in the same order.

Have a look at my ruby script here: build_trace_file.rb

It might take a while to convert your data to this format, I suggest running this in a screen session, especially if your memory footprint data is large. Just as a warning, we collected 15 *gigabytes* of data over a 10 hour period. This script took *10 hours* to convert the data. We ended up with a 7.8 gigabyte file.

% ruby /path/to/script/build_trace_file.rb /path/to/raw/malloc-footprint.PID /path/to/converted/my-memory-footprint

Step 3: Replay the allocation data with different allocators and measure time, memory usage.

OK, so we now have a file which represents the memory footprint of our application. It’s time to build the replayer, link against your malloc implementation of choice, fire it up and start measuring time spent in allocator functions and memory usage.

Have a look at the replayer here: alloc_tester.c
Build the replayer: gcc -ggdb -Wall -ldl -fPIC -o tester alloc_tester.c

Use ltrace

ltrace is similar to strace, but for library calls. You can use ltrace -c to sum the amount of time spent in different library calls and output a cool table at the end, it will look something like this:

% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
86.70   37.305797          62    600003 fscanf
10.64    4.578968          33    138532 malloc
2.36    1.014294          18     55263 free
0.25    0.109550          18      5948 realloc
0.03    0.011407          45       253 printf
0.02    0.010665          42       252 puts
0.00    0.000167          20         8 calloc
0.00    0.000048          48         1 fopen
------ ----------- ----------- --------- --------------------
100.00   43.030896                800260 total

Conclusion

Using a different malloc implementation can provide a speed/memory increases depending on your allocation patterns. Hopefully the code provided will help you test different allocators to determine whether or not swapping out the default libc allocator is the right choice for you. Our results are still pending; we had a lot of allocator data (15g!) and it takes several hours to replay the data with just one malloc implementation. Once we’ve gathered some data about the different implementations and their effects, I’ll post the results and some analysis. As always, stay tuned and thanks for reading!

Written by Joe Damato

March 16th, 2009 at 8:39 pm