Executive summary: Don’t use btrfs on Debian Squeeze. Longer summary: Don’t use btrfs RAID with the kernel Debian Squeeze comes with.
About six months ago, I set up a new server to handle this web site, mail, and various other things. The system and most services (including web and mail) was set to use an MD RAID 1 array across two small partitions on two separate disks, and the remaining space was setup in three different btrfs file systems:
- One btrfs RAID 0 for shared data I wouldn’t mind having offline while fixing issues on one disk
- One btrfs RAID 1 for shared data I would mind having offline while fixing issues on one disk
- One last btrfs RAID 0 for entirely throwable things such as build chroots
Three days ago, this happened:
May 10 10:18:04 goemon kernel: [3545898.548311] ata4: hard resetting link May 10 10:18:04 goemon kernel: [3545898.867556] ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310) May 10 10:18:04 goemon kernel: [3545898.874973] ata4.00: configured for UDMA/33
followed by other ATA related messages, then, garbage such as:
May 10 10:18:07 goemon kernel: [3545901.28123] sd3000 d]SneKy:AotdCmad[urn][ecitr May 10 10:18:07 goemon kernel: 4[550.821 ecio es aawt es ecitr i e) May 10 10:18:07 goemon kernel: 6[550.824 20 00 00 00 00 00 00 00 <>3491225 16 44 <>3491216]s ::::[d]Ad es:N diinlsneifrain<>3491216]s ::::[d]C:Ra(0:2 00 03 80 06 03491217]edrqet / ro,dvsb etr2272 May 10 10:18:07 goemon kernel: 3[550.837 ad:sb:rshdln etr2252 May 10 10:18:07 goemon kernel: 6s ::::[d]Rsl:hsbt=I_Kdiebt=RVRSNE<>3491215]s ::::[d]SneKy:AotdCmad[urn][ecitr May 10 10:18:07 goemon kernel: 4[550.833 ecitrsnedt ihsnedsrpos(nhx:<>3491216] 7 b0 00 00 c0 a8 00 00 0
Then later on:
May 10 12:01:18 goemon kernel: [3552089.226147] lost page write due to I/O error on sdb4 May 10 12:01:18 goemon kernel: [3552089.226312] lost page write due to I/O error on sdb4 May 10 12:10:14 goemon kernel: [3552624.625669] btrfs no csum found for inode 23642 start 0 May 10 12:10:14 goemon kernel: [3552624.625783] btrfs no csum found for inode 23642 start 4096 May 10 12:10:14 goemon kernel: [3552624.625884] btrfs no csum found for inode 23642 start 8192
etc. and more garbage.
At that point, I wanted to shutdown the server, check the hardware, and reboot. Shutdown didn’t want to proceed completely. Btrfs just froze on the
sync happening during the shutdown phase, so I had to power off violently. Nothing seemed really problematic on the hardware end, and after a reboot, both disks were properly working.
The MD RAID would resynchronize, and the btrfs filesystems would be automatically mounted. It would work for a while, until such things could be seen in the logs, with more garbage as above in between:
May 10 14:41:18 goemon kernel: [ 1253.455545] __ratelimit: 35363 callbacks suppressed May 10 14:45:04 goemon kernel: [ 1478.717749] parent transid verify failed on 358190825472 wanted 42547 found 42525 May 10 14:45:04 goemon kernel: [ 1478.717936] parent transid verify failed on 358316642304 wanted 42547 found 42515 May 10 14:45:04 goemon kernel: [ 1478.717939] parent transid verify failed on 358190825472 wanted 42547 found 42525 May 10 14:45:04 goemon kernel: [ 1478.718128] parent transid verify failed on 358316642304 wanted 42547 found 42515 May 10 14:45:04 goemon kernel: [ 1478.718131] parent transid verify failed on 358190825472 wanted 42547 found 42525
Then there would be kernel btrfs processes going on and on sucking CPU and I/O, doing whatever it was doing. At such moment, most file reading off one of the btrfs volumes would either take very long or freeze, and un-mounting would only freeze. At that point, considering the advantages of btrfs (in my case, mostly, snapshots) were outweighed by such issues (this wasn’t my first btrfs fuck up, but by large, the most dreadful) and the fact that btrfs is just so slow compared to other filesystems, I decided I didn’t want to care trying to save these filesystems from their agonizing death, and that I’d just go with ext4 on MD RAID instead. Also, I didn’t want to just try (with the possibility of going through similar pain) again with a more recent kernel.
Fortunately, I had backups of most of the data (only problem being the time required to restore that amount of data), but for the few remaining things which, by force of bad timing, I didn’t have a backup of, I needed to somehow get them back from these btrfs volumes. So I created new file systems to replace the btrfs volumes I could directly throw away and started recovering data from backups. I also, at the same time, tried to copy a big disk image from the remaining btrfs volume. Somehow, this worked, with the system load varying between 20 and 60… (with a lot of garbage in the logs and other services deeply impacted as well) But when trying to copy the remaining files I wanted to recover, things got worse, so I had to initiate a shutdown, and power cycle again.
Since apparently the kernel wasn’t going to be very helpful, the next step was to just get other things working, and get the data back some other way. What I did was to use a virtual machine to get the data off the remaining btrfs volume. The kernel could become unusable all it wanted to, I could just hard reboot without impacting the other services.
In the virtual machine, things got “interesting”. I did try various things I’ve seen on the linux-btrfs list, but nothing really did anything at all except spew some more
parent transid messages. I should mention that the remaining btrfs volume was a RAID 0. To mount those, you’d mount one of the constituting disks like this:
$ mount /dev/sdb /mnt
Except that it would complain that it can’t find a valid whatever (I don’t remember the exact term, and I threw the VM away already) so it wouldn’t mount the volume. But when mounting the other constituting disk, it would just work. Well, that’s kind of understandable, but what is not is that on the next boot (I had to reboot a lot, see below), it would error out on the disk that worked previously, and work on the disk that was failing before.
So, here is how things went:
- I would boot the VM and mount the volume,
- launch an rsync of the data to recover, which I’d send onto the host system,
- observe, from the host system, what was going on I/O wise,
- at some point (usually after something like 10 to 50 files rsync’ed), after throwing a bunch of
parent transiderror messages, the VM would just stop doing any kind of I/O (even if left alone for several minutes), at which point I’d hard shutdown the VM and start over.
Ain’t that fun?
The good thing is that in the end, despite the pain, I recovered all that needed to be recovered. I’m in the process of recreating my build chroots from scratch, but that’s not exactly difficult. It would just have taken a lot more time to recover them the same way, 50 files at a time.
Side note: yes, I did try newer versions of btrfsck ; yes I did try newer kernels. No, nothing worked to make these btrfs volumes viable. No, I don’t have an image of these completely fucked up volumes.