time to bleed by Joe Damato

technical ramblings from a wanna-be unix dinosaur

MySQL Doesn’t Always Suck; This Time it’s AMD

View Comments


UPDATED! Added explanation about how to check if you are affected below. Thanks to vicaya on hacker news for commenting about this.

This article is going to take a look at a rather nasty low-level hardware bug which could bite people writing applications which are multi-threaded. The bug manifests itself in MySQL as we’ll see a bit later. Part of the blame for the nastiness of this bug certainly lies with a third party who should have (and could have) publicly released a detailed errata, as we’ll see in a bit.

Consider this article a bit of a thought experiment; when you read this article ask yourself: “How would I have debugged this?”

A quick cautionary note: We’re going to look at a little bit of assembly. If you don’t know assembly, don’t worry.

Starting at the atom

An important concept when building complex systems with multiple threads or processes is the concept of atomicity. Very briefly, atomicity is the idea that there is some set of instructions that when executed appear as an indivisible unit to the rest of the system.

The Intel x86 architecture has two ways to execute atomic operations:

  • guaranteed atomic operations
  • system bus locking – a processor asserts a signal and other processors are unable to gain control of the system bus during this time

Guaranteed atomic operations

The least interesting, but on the Intel 486 and above:

  • Reading or writing a byte
  • Reading or writing a word aligned on a 16-bit boundary
  • Reading or writing a doubleword aligned on a 32-bit boundary

are all guaranteed to be atomic.

[NOTE: There are a few more guaranteed atomic operations for the Pentium and P6 families, but I won't go into the details of them here as there is way too much bleeding involved. If you are the type of person who runs toward a burning building check out: Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3A Chapter 7 for more information about the other guaranteed atomic operations.]

This list is good, but it doesn’t give us enough power to actually implement a synchronization primitive. You can imagine two processes competing for a resource. One process reads a flag to check whether the resource is available, but before it has time to flip the flag, the other process is reading the value.

We need a little more juice, we’d like to be able to read, write, and maybe compare as one indivisible operation. This is where bus locking and two x86 instructions step in to save the day.

Bus locking

There are two instructions on the x86 which are commonly used as atomic operations:

  • xchg

    exchange a source and destination.

  • cmpxchg

    compare the EAX register with the destination, if equal load source into destination. Otherwise, load destination in EAX.

And there are two flavors of bus locking:

  • automatic bus locking – bus is locked whenever xchg and other special instructions are executed
  • software controlled bus locking – a small set of instructions (cmpxchg, dec, and others) can be executed atomically if preceded with lock

OK, cool. Now we’re talking. We have enough firepower to build some synchronization primitives.

Let’s build a mutex

Let’s build a simple mutex using cmpxchg.

Assume a function exists:

/**
 * compare old with the destination memory. if they match, put new
 *  in the destination and return old. Otherwise, return new.
 */

int cmpxchg(void *memory, int old, int new);

This implementation is left as an exercise to the reader. Wow, I’ve always wanted to say that.
(Protip: If you don’t feel like bleeding, you can just: #include <asm/system.h> which will take care of it for you on Linux.)

Then we can implement a stupid and simple spinlock like this:

typedef int mutex;
void mutex_init (mutex *m) { *m = 1; }
void mutex_lock (mutex *m) { while (!cmpxchg(m, 1, 0));  }
void mutex_unlock (mutex *m) { *m = 1; }

That was pretty easy! Admittedly, it is a pretty stupid implementation. You could imagine adding more features, and even building more complex synchronization primitives (semaphores, barriers, read-write locks, etc). Go crazy and build them out.

Synchronization bugs are nasty

So we’ve talked a little bit about how atomic operations work, and we’ve built a simple mutex. Complex multi-threaded applications can use lots of synchronization primitives to control access to a shared resource (like a piece of memory, or a file, etc). As you can imagine, weird bugs can crop up like deadlock, livelock, and memory corruption if synchronization primitives are used incorrectly.

A large multi-threaded application used by lots of people in the Web 2.0 world is MySQL. MySQL, like any other application, has bugs. Take a look at an interesting bug report filed on Feb 5, 2007 regarding a MySQL crash: http://bugs.mysql.com/bug.php?id=26081

So, it looks as if an assertion is being tripped in the source code, but the code in the second update to the bug looks very straightforward. Maybe some one else is writing all over the mutex and corrupting its state? Definitely possible, but there is something strange about the bug reports that start to roll in afterward.

It seems that everyone hitting this bug is using AMD Opteron CPUs. Maybe this a hardware bug?

AMD Opteron bug causes a MySQL crash

It turns out that MySQL was crashing because of a hardware bug in AMD Opteron Revision E and pre-release Revision F processors.

There is no official errata published about this, but we can find clues in the OpenSolaris mutex implementation.

From OpenSolaris’ IA-32 lock_prim.S file (NOTE: AT&T syntax):

560 	ENTRY_NP(mutex_enter)
...
563 	lock
564 	cmpxchgq %rdx, (%rdi)
565 	jnz	mutex_vector_enter
566 	.mutex_enter_lockstat_patch_point:
567 	#if defined(OPTERON_WORKAROUND_6323525)
568 	.mutex_enter_6323525_patch_point:
569 	ret					/* nop space for lfence */
570 	nop
571 	nop
....

Note that on line 563, the lock prefix is used, so the cmpxchg instruction should be atomic. But something weird is going on here, notice how line 567 checks for a macro called OPTERON_WORKAROUND_6323525? That looks suspicious, as does the comment on line 569. The nop instruction is a no-op; the instruction doesn’t do anything. In this case it looks as if nop instructions are being used to set aside space and will be overwritten later with an lfence if you have an affected AMD CPU.

If you look up lfence and nop in the Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2A: Instruction Set Reference, A-M, you can see that nop instruction is 1 byte wide and the lfence instruction is 2 bytes wide.

Looks like the OpenSolaris guys left just enough room for one lfence.

If we dig a little deeper in lock_prim.S, we discover a helpful comment:

1133 /*
1134  * patch_workaround_6323525: provide workaround for 6323525
1135  *
1136  * The workaround is to place a fencing instruction (lfence) between the
1137  * mutex operation and the subsequent read-modify-write instruction.
1138  *
1139  * This routine hot patches the lfence instruction on top of the space
1140  * reserved by nops in the lock enter routines.
1141  */

Wow.

That is extremely painful. So it turns out that an lfence instruction is needed after the cmpxchg. lfence (load fence) forces all memory reads preceding the lfence to be read from memory before memory reads following the lfence. Without the lfence, a read after the cmpxchg can happen before the cmpxchg. This read can result in reading stale/incomplete state information.

The workaround hot patches the kernel and writes an lfence instruction in the space set aside by the nop instructions on lines 570 and 571.

Thankfully, MySQL is well written and a runtime assertion caught this bug before it could do real damage or corrupt data.

But what about your multi-threaded code? Is it equally well-written? Would you have caught the same hardware bug with runtime assertions? Now is definitely a good time to go add those assertions you’ve been putting off.

Current state of affairs of this bug

This bug is fixed in recent versions of OpenSolaris and it is unclear whether or not this is currently an issue with Linux. Depending on which version of the Linux kernel you are using, Linux uses an atomic decrement (dec) in the fast path and an xchg in the slow path. With the lack of an official public errata about this issue, it is difficult to tell which other instructions could be affected.

How to check if you are affected

Special thanks to vicaya on hacker news for suggesting this!

If you are using Linux you can simply check /proc/cpuinfo:

$ cat /proc/cpuinfo

If your CPU family is 15 and your model is 32-63 inclusive your processor is buggy!

Conclusion

There are a couple things to take away from this:

  1. It is pretty surprising how many different kinds of bugs exist in really important areas like atomic operations. It is truly amazing that computers actually boot and do useful things even some of the time.
  2. Staying sharp with your low level skills is still important, even in the Web 2.0 world.
  3. CPUs aren’t normally equated with such subtle bugs, but remember CPUs are just like any other system component and they have bugs, too
  4. Last, but not least: Stay away from AMD Opteron Revision E!

Written by Joe Damato

April 6th, 2009 at 12:31 am

  • test

  • silverjam

    The kernel bug:

    http://bugzilla.kernel.org/sho...

    Which references an AMD "errata 147" from "Revision Guide for AMD Athlon™ 64 and AMD Opteron™ Processors."

    http://support.amd.com/us/Proc...

    Which says:
    """
    Potential Violation of Read Ordering Rules Between Semaphore Operations and Unlocked Read-Modify-Write Instructions

    Description

    Under a highly specific set of internal timing circumstances, the memory read ordering between a
    semaphore operation and a subsequent read-modify-write instruction (an instruction that uses the
    same memory location as both a source and destination) may be incorrect and allow the read-modifywrite
    instruction to operate on the memory location ahead of the completion of the semaphore
    operation. The erratum will not occur if there is a LOCK prefix on the read-modify-write instruction.
    This erratum does not apply if the read-only value in MSRC001_1023h[33] is 1b.

    Potential Effect on System

    In the unlikely event that the condition described above occurs, the read-modify-write instruction (in
    the critical section) may operate on data that existed prior to the semaphore operation. This erratum
    can only occur in multiprocessor or multicore configurations.

    Suggested Workaround

    To provide a workaround for this unlikely event, software can perform any of the following actions
    for multiprocessor or multicore systems:
    • Place a LFENCE instruction between the semaphore operation and any subsequent read-modifywrite
    instruction(s) in the critical section.
    • Use a LOCK prefix with the read-modify-write instruction.
    • Decompose the read-modify-write instruction into separate instructions.

    No workaround is necessary if software checks that MSRC001_1023h[33] is set on all processors that
    may execute the code. The value in MSRC001_1023h[33] may not be the same on all processors in a
    multi-processor system.

    Fix Planned: Yes
    """

  • bd_

    Patching this on Linux would be complicated by the fact that some synchronization primitives are implemented partially in user-space libraries - the kernel would need to be able to patch this, but glibc would need to do so as well, and so would anything else that uses the builtin atomic primitives...

  • Joe Damato

    @brian - thanks for the tip! didn't know there were builtins for atomics in gcc. Very cool.

  • Brian

    __sync_val_compare_and_swap() for non-bleeding on not just linux, but all OSs that gcc supports (on architectures that have an atomic instruction for it).

    http://gcc.gnu.org/onlinedocs/...

blog comments powered by Disqus