Comments:"Attack of the Cosmic Rays! (Ksplice Blog)"
URL:https://blogs.oracle.com/ksplice/entry/attack_of_the_cosmic_rays1
It's a well-documented fact that RAM in modern computers is susceptible to occasional random bit flips due to various sources of noise, most commonly high-energy cosmic rays. By some estimates, you can even expect error rates as high as one error per 4GB of RAM per day! Many servers these days have ECC RAM, which uses extra bits to store error-correcting codes that let them correct most bit errors, but ECC RAM is still fairly rare in desktops, and unheard-of in laptops.
For me, bitflips due to cosmic rays are one of those problems I always assumed happen to "other people". I also assumed that even if I saw random cosmic-ray bitflips, my computer would probably just crash, and I'd never really be able to tell the difference from some random kernel bug.
A few weeks ago, though, I encountered some bizarre behavior on my desktop, that honestly just didn't make sense. I spent about half an hour digging to discover what had gone wrong, and eventually determined, conclusively, that my problem was a single undetected flipped bit in RAM. I can't prove whether the problem was due to cosmic rays, bad RAM, or something else, but in any case, I hope you find this story interesting and informative.
The problem
The symptom that I observed was that the expr
program, used by shell scripts to do basic arithmetic, had started consistently segfaulting. This first manifested when trying to build a software project, since the GNU autotools make heavy use of this program:
[nelhage@psychotique]$ autoreconf -fvi autoreconf: Entering directory `.' autoreconf: configure.ac: not using Gettext autoreconf: running: aclocal --force -I m4 autoreconf: configure.ac: tracing Segmentation fault Segmentation fault Segmentation fault Segmentation fault Segmentation fault Segmentation fault …
dmesg
revealed that the segfaulting program was expr
:
psychotique kernel: [105127.372705] expr[7756]: segfault at 1a70 ip 0000000000001a70 sp 00007fff2ee0cc40 error 4 in expr
And I was easily able to reproduce the problem by hand:
[nelhage@psychotique]$ expr 3 + 3 Segmentation fault
expr
definitely hadn't been segfaulting as of a day ago or so, so something had clearly gone suddenly, and strangely, wrong. I had no idea what, but I decided to find out.
Check the dumb things
I run Ubuntu, so the first things I checked were the/var/log/dpkg.log
and /var/log/aptitude.log
files, to determine whether any suspicious packages had been upgraded recently. Perhaps Ubuntu accidentally let a buggy package slip into the release. I didn't recall doing any significant upgrades, but maybe dependencies had pulled in an upgrade I had missed.
The logs revealed I hadn't upgraded anything of note in the last several days, so that theory was out.
Next up, I checked env | grep ^LD
. The dynamic linker takes input from a number of environment variables, all of whose names start with LD_
. Was it possible I had somehow ended up setting some variable that was messing up the dynamic linker, causing it to link a broken library or something?
[nelhage@psychotique]$ env | grep ^LD [nelhage@psychotique]$
That, too, turned up nothing.
Start digging
I was fortunate in that, although this failure is strange and sudden, it seemed perfectly reproducible, which means I had the luxury of being able to run as many tests as I wanted to debug it.
The problem is a segfault, so I decided to pull up a debugger and figure out where it's segfaulting. First, though, I'd want debug symbols, so I could make heads or tails of the crashed program. Fortunately, Ubuntu provides debug symbols for every package they ship, in a separate repository. I already had the debug sources enabled, so I used dpkg -S
to determine that expr
belongs to the coreutils
package:
[nelhage@psychotique]$ dpkg -S $(which expr) coreutils: /usr/bin/expr
And installed the coreutils debug symbols:
[nelhage@psychotique]$ sudo aptitude install coreutils-dbgsym
Now, I could run expr
inside gdb
, catch the segfault, and get a stack trace:
[nelhage@psychotique]$ gdb --args expr 3 + 3 … (gdb) run Starting program: /usr/bin/expr 3 + 3 Program received signal SIGSEGV, Segmentation fault. 0x0000000000001a70 in ?? () (gdb) bt #0 0x0000000000001a70 in ?? () #1 0x0000000000402782 in eval5 (evaluate=true) at expr.c:745 #2 0x00000000004027dd in eval4 (evaluate=true) at expr.c:773 #3 0x000000000040291d in eval3 (evaluate=true) at expr.c:812 #4 0x000000000040208d in eval2 (evaluate=true) at expr.c:842 #5 0x0000000000402280 in eval1 (evaluate=<value optimized out>) at expr.c:921 #6 0x0000000000402320 in eval (evaluate=<value optimized out>) at expr.c:952 #7 0x0000000000402da5 in main (argc=2, argv=0x0) at expr.c:329
So, for some reason, the eval5
function has jumped off into an invalid memory address, which of course causes a segfault. Repeating the test a few time confirmed that the crash was totally deterministic, with the same stack trace each time. But what is eval5
trying to do that's causing it to jump off into nowhere? Let's grab the source and find out:
[nelhage@psychotique]$ apt-get source coreutils [nelhage@psychotique]$ cd coreutils-7.4/src/ [nelhage@psychotique]$ gdb --args expr 3 + 3 # Run gdb, wait for the segfault (gdb) up #1 0x0000000000402782 in eval5 (evaluate=true) at expr.c:745 745 if (nextarg (":")) (gdb) l 740 trace ("eval5"); 741 #endif 742 l = eval6 (evaluate); 743 while (1) 744 { 745 if (nextarg (":")) 746 { 747 r = eval6 (evaluate); 748 if (evaluate) 749 {
I used the apt-get source
command to download the source package from Ubuntu, and ran gdb
in the source directory, so it could find the files referred to by the debug symbols. I then used the up
command in gdb to go up a stack frame, to the frame where eval5
called off into nowhere.
From the source, we see that eval5
is trying to call the nextarg
function. `gdb` will happily tell us where that function is supposed to be located:
(gdb) p nextarg $1 = {_Bool (const char *)} 0x401a70 <nextarg>
Comparing that address with the address in the stack trace above, we see that they differ by a single bit. So it appears that somewhere a single bit has been flipped, causing that call to go off into nowhere.
But why?
So there's a flipped bit. But why, and how did it happen? First off, let's determine where the problem is. Is it in the expr
binary itself, or is something more subtle going on?
[nelhage@psychotique]$ debsums coreutils | grep FAILED /usr/bin/expr FAILED
The debsums
program will compare checksums of files on disk with a manifest contained in the Debian package they came from. In this case, examining the coreutils
package, we see that the expr
binary has in fact been modified since it was installed. We can verify how it's different by downloading a new version of the package, and comparing the files:
[nelhage@psychotique]$ aptitude download coreutils [nelhage@psychotique]$ mkdir coreutils [nelhage@psychotique]$ dpkg -x coreutils_7.4-2ubuntu1_amd64.deb coreutils [nelhage@psychotique]$ cmp -bl coreutils/usr/bin/expr /usr/bin/expr 10113 377 M-^? 277 M-?
aptitude download
downloads a .deb
package, instead of actually doing the installation. I used dpkg -x
to just extract the contents of the file, and cmp
to compare the packaged expr
with the installed one. -b
tells cmp
to list any bytes that differ, and -l
tells it to list all differences, not just the first one. So we can see that two bytes differ, and by a single bit, which agrees with the failure we saw. So somehow the installed expr
binary is corrupted.
So how did that happen? We can check the mtime
("modified time") field on the program to determine when the file on disk was modified (assuming, for the moment, that whatever modified it didn't fix up the mtime
, which seems unlikely):
[nelhage@psychotique]$ ls -l /usr/bin/expr -rwxr-xr-x 1 root root 111K 2009-10-06 07:06 /usr/bin/expr*
Curious. The mtime on the binary is from last year, presumably whenever it was built by Ubuntu, and set by the package manager when it installed the system. So unless something really fishy is going on, the binary on disk hasn't been touched.
Memory is a tricky thing.
But hold on. I have 12GB of RAM on my desktop, most of which, at any moment, is being used by the operating system to cache the contents of files on disk. expr
is a pretty small program, and frequently used, so there's a good chance it will be entirely in cache, and my OS has basically never touched the disk to load it, since it first did so, probably when I booted my computer. So it's likely that this corruption is entirely in memory. But how can we test that? Simple: by forcing the OS to discard the cached version and re-read it from disk.
On Linux, we can do this by writing to the /proc/sys/vm/drop_caches
file, as root. We'll take a checksum of the binary first, drop the caches, and compare the checksum after forcing it to be re-read:
[nelhage@psychotique]$ sha256sum /usr/bin/expr 4b86435899caef4830aaae2bbf713b8dbf7a21466067690a796fa05c363e6089 /usr/bin/expr [nelhage@psychotique]$ echo 3 | sudo tee /proc/sys/vm/drop_caches 3 [nelhage@psychotique]$ sha256sum /usr/bin/expr 5dbe7ab7660268c578184a11ae43359e67b8bd940f15412c7dc44f4b6408a949 /usr/bin/expr [nelhage@psychotique]$ sha256sum coreutils/usr/bin/expr 5dbe7ab7660268c578184a11ae43359e67b8bd940f15412c7dc44f4b6408a949 coreutils/usr/bin/expr
And behold, the file changed. The corruption was entirely in memory. And, furthermore, expr
no longer segfaults, and matches the version I downloaded earlier.
(The sudo tee
idiom is a common one I used to write to a file as root from a normal user shell. sudo echo 3 > /proc/sys/vm/drop_caches
of course won't work because the file is still opened for writing by my shell, which doesn't have the required permissions).
Conclusion
As I mentioned earlier, I can't prove this was due to a cosmic ray, or even a hardware error. It could have been some OS bug in my kernel that accidentally did a wild write into my memory in a way that only flipped a single bit. But that'd be a pretty weird bug.
And in fact, since that incident, I've had several other, similar problems. I haven't gotten around to memtesting my machine, but that does suggest I might just have a bad RAM chip on my hands. But even with bad RAM, I'd guess that flipped bits come from noise somewhere -- they're just susceptible to lower levels of noise. So it could just mean I'm more susceptible to the low-energy cosmic rays that are always falling. Regardless of whatever the cause was, though, I hope this post inspires you to think about the dangers of your RAM corrupting your work, and that the tale of my debugging helps you learn some new tools that you might find useful some day.Now that I've written this post, I'm going to go memtest my machine and check prices on ECC RAM. In the meanwhile, leave your stories in the comments -- have you ever tracked a problem down to memory corruption? What are your practices for coping with the risk of these problems?
Edited to add a note that this could well just be bad RAM, in addition to a one-off cosmic-ray event.