Archive for May, 2011
Executive summary: Don’t use btrfs on Debian Squeeze.
Longer summary: Don’t use btrfs RAID with the kernel Debian Squeeze comes with.
About six months ago, I set up a new server to handle this web site, mail, and various other things. The system and most services (including web and mail) was set to use an MD RAID 1 array across two small partitions on two separate disks, and the remaining space was setup in three different btrfs file systems:
- One btrfs RAID 0 for shared data I wouldn’t mind having offline while fixing issues on one disk
- One btrfs RAID 1 for shared data I would mind having offline while fixing issues on one disk
- One last btrfs RAID 0 for entirely throwable things such as build chroots
Three days ago, this happened:
May 10 10:18:04 goemon kernel: [3545898.548311] ata4: hard resetting link May 10 10:18:04 goemon kernel: [3545898.867556] ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310) May 10 10:18:04 goemon kernel: [3545898.874973] ata4.00: configured for UDMA/33
followed by other ATA related messages, then, garbage such as:
May 10 10:18:07 goemon kernel: [3545901.28123] sd3000 d]SneKy:AotdCmad[urn][ecitr May 10 10:18:07 goemon kernel: 4[550.821 ecio es aawt es ecitr i e) May 10 10:18:07 goemon kernel: 6[550.824 20 00 00 00 00 00 00 00 <>3491225 16 44 <>3491216]s ::::[d]Ad es:N diinlsneifrain<>3491216]s ::::[d]C:Ra(0:2 00 03 80 06 0<>3491217]edrqet / ro,dvsb etr2272 May 10 10:18:07 goemon kernel: 3[550.837 ad:sb:rshdln etr2252 May 10 10:18:07 goemon kernel: 6s ::::[d]Rsl:hsbt=I_Kdiebt=RVRSNE<>3491215]s ::::[d]SneKy:AotdCmad[urn][ecitr May 10 10:18:07 goemon kernel: 4[550.833 ecitrsnedt ihsnedsrpos(nhx:<>3491216] 7 b0 00 00 c0 a8 00 00 0
Then later on:
May 10 12:01:18 goemon kernel: [3552089.226147] lost page write due to I/O error on sdb4 May 10 12:01:18 goemon kernel: [3552089.226312] lost page write due to I/O error on sdb4 May 10 12:10:14 goemon kernel: [3552624.625669] btrfs no csum found for inode 23642 start 0 May 10 12:10:14 goemon kernel: [3552624.625783] btrfs no csum found for inode 23642 start 4096 May 10 12:10:14 goemon kernel: [3552624.625884] btrfs no csum found for inode 23642 start 8192
etc. and more garbage.
At that point, I wanted to shutdown the server, check the hardware, and reboot. Shutdown didn’t want to proceed completely. Btrfs just froze on the
sync happening during the shutdown phase, so I had to power off violently. Nothing seemed really problematic on the hardware end, and after a reboot, both disks were properly working.
The MD RAID would resynchronize, and the btrfs filesystems would be automatically mounted. It would work for a while, until such things could be seen in the logs, with more garbage as above in between:
May 10 14:41:18 goemon kernel: [ 1253.455545] __ratelimit: 35363 callbacks suppressed May 10 14:45:04 goemon kernel: [ 1478.717749] parent transid verify failed on 358190825472 wanted 42547 found 42525 May 10 14:45:04 goemon kernel: [ 1478.717936] parent transid verify failed on 358316642304 wanted 42547 found 42515 May 10 14:45:04 goemon kernel: [ 1478.717939] parent transid verify failed on 358190825472 wanted 42547 found 42525 May 10 14:45:04 goemon kernel: [ 1478.718128] parent transid verify failed on 358316642304 wanted 42547 found 42515 May 10 14:45:04 goemon kernel: [ 1478.718131] parent transid verify failed on 358190825472 wanted 42547 found 42525
Then there would be kernel btrfs processes going on and on sucking CPU and I/O, doing whatever it was doing. At such moment, most file reading off one of the btrfs volumes would either take very long or freeze, and un-mounting would only freeze. At that point, considering the advantages of btrfs (in my case, mostly, snapshots) were outweighed by such issues (this wasn’t my first btrfs fuck up, but by large, the most dreadful) and the fact that btrfs is just so slow compared to other filesystems, I decided I didn’t want to care trying to save these filesystems from their agonizing death, and that I’d just go with ext4 on MD RAID instead. Also, I didn’t want to just try (with the possibility of going through similar pain) again with a more recent kernel.
Fortunately, I had backups of most of the data (only problem being the time required to restore that amount of data), but for the few remaining things which, by force of bad timing, I didn’t have a backup of, I needed to somehow get them back from these btrfs volumes. So I created new file systems to replace the btrfs volumes I could directly throw away and started recovering data from backups. I also, at the same time, tried to copy a big disk image from the remaining btrfs volume. Somehow, this worked, with the system load varying between 20 and 60… (with a lot of garbage in the logs and other services deeply impacted as well) But when trying to copy the remaining files I wanted to recover, things got worse, so I had to initiate a shutdown, and power cycle again.
Since apparently the kernel wasn’t going to be very helpful, the next step was to just get other things working, and get the data back some other way. What I did was to use a virtual machine to get the data off the remaining btrfs volume. The kernel could become unusable all it wanted to, I could just hard reboot without impacting the other services.
In the virtual machine, things got “interesting”. I did try various things I’ve seen on the linux-btrfs list, but nothing really did anything at all except spew some more
parent transid messages. I should mention that the remaining btrfs volume was a RAID 0. To mount those, you’d mount one of the constituting disks like this:
$ mount /dev/sdb /mnt
Except that it would complain that it can’t find a valid whatever (I don’t remember the exact term, and I threw the VM away already) so it wouldn’t mount the volume. But when mounting the other constituting disk, it would just work. Well, that’s kind of understandable, but what is not is that on the next boot (I had to reboot a lot, see below), it would error out on the disk that worked previously, and work on the disk that was failing before.
So, here is how things went:
- I would boot the VM and mount the volume,
- launch an rsync of the data to recover, which I’d send onto the host system,
- observe, from the host system, what was going on I/O wise,
- at some point (usually after something like 10 to 50 files rsync’ed), after throwing a bunch of
parent transiderror messages, the VM would just stop doing any kind of I/O (even if left alone for several minutes), at which point I’d hard shutdown the VM and start over.
Ain’t that fun?
The good thing is that in the end, despite the pain, I recovered all that needed to be recovered. I’m in the process of recreating my build chroots from scratch, but that’s not exactly difficult. It would just have taken a lot more time to recover them the same way, 50 files at a time.
Side note: yes, I did try newer versions of btrfsck ; yes I did try newer kernels. No, nothing worked to make these btrfs volumes viable. No, I don’t have an image of these completely fucked up volumes.
It has been two weeks since we switched to faster Linux builds. After some “fun” last week, it is time to look back.
The news that Mozilla will be providing faster Linux builds made it to quite a lot of news sites, apparently. Most of the time with titles as misleading as “as fast as Windows builds”. I love that kind of journalism where “much closer to” is spelled “as fast as”. Anyways, I’ve also seen a number of ensuing comments basically saying that we sucked and that some people had been successfully building with GCC 4.5 for a while, and now with GCC 4.6, so why can’t we do that as well?
Well, for a starter, I doubt they’ve been building with GCC 4.6 for long, and definitely not Firefox 4.0, because we only recently fixed a bunch of C++ conformance problems that GCC 4.6 doesn’t like. Update: now that I think of it, I might have mixed up things. This bunch might only become a problem when compiling in C++0x mode (which is now enabled when supported on mozilla-central).
Then, there are fundamental differences between a build someone does for her own use, and Mozilla builds:
- Mozilla builds need to work on as many machines as possible, on as many Linux distros as possible,
- Mozilla builds are heavily tested (yet, not enough).
Builds that run (almost) everywhere
One part of the challenge of using a modern compiler is that newer versions of GCC like to change subtle things in their C++ standard library making compiled binaries dependent on a newer version of
libstdc++. This behaviour pretty much depends on the C++ standard library features used.
For quite a while, Mozilla builds have been compiled with GCC 4.3, but up to Firefox 3.6, only
libstdc++ 4.1 was required. Some new code added to Firefox 4.0 however changed that and
libstdc++ 4.3 is now required. This is the reason why Firefox 4.0 doesn’t work on RedHat/CentOS 5 while Firefox 3.6 did, because these systems don’t have
libstdc++ version 4.3.
Switching to GCC 4.5 (or 4.6, for that matter), in Firefox case, means requiring
libstdc++ version 4.5 (or 4.6). While this is not a problem for people building for their own system, or for distros, it is when you want the binary you distribute to work on most systems, because
libstdc++ version 4.5 is less widespread.
So on one end, we had an outdated toolchain that couldn’t handle Profile Guided Optimization properly, and on the other hand, a more modern toolchain that creates a dependency on a
libstdc++ version that is not widespread enough.
At this point, I should point out that an easy way out exists: statically linking
libstdc++. The downside is that is makes the binaries significantly bigger.
Fortunately, we found a hackish way to avoid these dependencies on newer
libstdc++. It has been extended since, and now allows to build Firefox with GCC up to version 4.7, with or without the experimental C++0x mode enabled. The resulting binaries only depend on
libstdc++ 4.1, meaning they should work on RedHat/CentOS 5.
Passing the test suites
We have a big test suite, which is probably an understatement: we have plenty thousands of unit tests. And we try to avoid these unit tests regressing. I don’t think most people building Firefox run them. Actually most of the hundreds of Linux distributions don’t.
I know, for I also happen to be the Debian maintainer, that Debian does run test suites on all its architectures, but it skips mochitests because they take too long. As Debian has switched to GCC 4.5 for a while, now, I knew there weren’t regressions in these test suites it runs, at least at the optimization level used by default.
And after the switch to faster Linux builds, we haven’t seen regressions either. Well, not exactly, but I’ll come back on that further below.
GCC 4.5, optimization levels, and Murphy’s Law
Sadly, after the switch, we weren’t getting symbols in crash reports anymore. The problem was that the program used to dump debugging symbols from our binaries in a usable form for crash reports post-processing didn’t output function information. This, in turn, was due to a combination of a lack of functionality in the dump program, and a bug in GCC 4.5 (which seems to be fixed in GCC 4.6) that prevented the necessary information from being present in the DWARF sections when the
-freorder-block-and-partition option is used. I’ll come back on this issue in a subsequent blog post. The short term (and most probably long term) solution was to remove the incriminated option.
But while searching for that root cause, we completely disabled PGO, leaving the optimization level to -O3. I had tested gcc 4.5 and -O3 without PGO a few times on the Try server with no other problems than a few unimportant rounding errors we decided to ignore by modifying the relevant tests, so I wasn’t expecting anything bad.
That was without counting on Murphy’s Law, in the form of a permanent Linux x86 reftest regression. But that error didn’t appear in my previous tests, so it had to have been introduced by some change in the tree. After some quite painful bisecting (I couldn’t reproduce the problem with local builds, so I had to resort on the Try Server, each build+test run taking between 1 and 2 hours), I narrowed it down to the first part of bug 641426 triggering a change in how GCC optimizes some code, and as a side effect, changes some floating point operations on x86, using memory instead of registers or vice versa, introducing rounding discrepancy in different parts of the code.
But while searching for that root cause, we backed out the switch to aggressive optimization and went back to -Os instead of -O3. The only remaining change from the switch was thus the GCC version. And Murphy’s Law kicked in yet again, in the form of a permanent Linux x86/x86-64 a11y mochitest regression. As it turned out, that regression had already been spotted on the tracemonkey tree, during the couple days it had PGO enabled, but wasn’t using -O3, and disappeared when the -O3 switch was merged from mozilla-central. But at the time, we didn’t track it down. We disabled the tests to open the tree for development, but the issue is still there, just hidden. Though now that we’re back to aggressive optimization and PGO, we re-enabled the test and the issue has gone away, which is kind of scary. We definitely need to find the real issue, which might be related to some uninitialized memory.
What does this all mean?
First, it means that in some cases it seems a newer compiler unveils some dormant bugs in our code. And that with the same compiler, different optimization options can lead to different results/breakages.
By extension, this means it is important that we carefully choose our default optimization options, especially when PGO is not used (which is most of the time for non Mozilla builds). I’m even tempted to say it would be important for us to test these non-PGO defaults, but we also can’t try all possible compiler versions either.
This also means it is important that Linux distros run our test suites with their builds, especially when they use newer compilers.
A few related thoughts
While handling the transition to this new toolchain, it became clear that the lack of correlation between our code base and our
mozconfig files is painful. The best demonstration is the Try server, which is now using GCC 4.5 for all builds by default. But if you push there a commit that doesn’t have the necessary
libstdc++ compatibility hack, the builds will fail. There are many other cases of changes in our
mozconfigs requiring changes in e.g.
configure.in, and these are even more reasons to get
mozconfigs in our code base.
The various issues we got in the process also made me reflect on our random oranges. I think we lack one important information when we have a test failure: does it reliably happen with a given build? Chances are that most random oranges don’t (like the two I mentioned further above), but those that do may point out subtle problems of compiler optimizations breaking some of our assumptions (though so far, most of the time, they just turn into permanent oranges). The self-serve API does help in that regard, allowing to re-trigger a given test suite on the same build, but I think we should enhance our test harnesses to automatically retry failing tests.
What about GCC 4.6?
I think it’s too early to think about GCC 4.6. While it has some improvements over GCC 4.5, it may also bring its own set of surprises. GCC also has a pretty bad history of screwing things up in dot-zero releases, so it would be better to wait for 4.6.1, which I hear is planned for soon. And GCC 4.6 would make things even harder for the Try server and some other branches considering the C++ conformance problems I mentioned.
Also, most of the people mentioning GCC 4.6 also mention Link Time Optimization, which is the main nicety it brings. Unfortunately, linking requires gigabytes of memory, which means several things:
- We need that much memory on our build bots, which I’m not sure they currently have
- It actually exhausts the 32-bits address space, which means we’d need to cross compile the 32-bits builds on 64-bits hosts with a 64-bits toolchain. Which, in turn, means changing build bots, and maybe some fun with our build system.
GCC people are working on decreasing the amount of memory required to link, but it’s work in progress and won’t be workable until GCC 4.7 (or, who knows, even later). We might have switched to clang before that ;-)
- Go to the Debian Mozilla Team page.
- Select the Debian version you are running, “Iceweasel” and the version, “5.0”.
- Follow the instructions.
Only amd64 and i386 packages are available. Note that there is another Iceweasel “version” available there: “aurora”. Currently, this is the same as “5.0”, but whenever Firefox 5.0 will reach the beta stage, “aurora” will be 6.0a2. Please feel free to use “aurora” if you want to keep using these pre-beta builds.