Debian Squeeze + btrfs = FAIL

Executive summary: Don't use btrfs on Debian Squeeze.
Longer summary: Don't use btrfs RAID with the kernel Debian Squeeze comes with.

About six months ago, I set up a new server to handle this web site, mail, and various other things. The system and most services (including web and mail) was set to use an MD RAID 1 array across two small partitions on two separate disks, and the remaining space was setup in three different btrfs file systems:

  • One btrfs RAID 0 for shared data I wouldn't mind having offline while fixing issues on one disk
  • One btrfs RAID 1 for shared data I would mind having offline while fixing issues on one disk
  • One last btrfs RAID 0 for entirely throwable things such as build chroots

Three days ago, this happened:

May 10 10:18:04 goemon kernel: [3545898.548311] ata4: hard resetting link
May 10 10:18:04 goemon kernel: [3545898.867556] ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
May 10 10:18:04 goemon kernel: [3545898.874973] ata4.00: configured for UDMA/33

followed by other ATA related messages, then, garbage such as:

May 10 10:18:07 goemon kernel: [3545901.28123] sd3000 d]SneKy:AotdCmad[urn][ecitr
May 10 10:18:07 goemon kernel: 4[550.821 ecio es aawt es ecitr i e)
May 10 10:18:07 goemon kernel: 6[550.824     20 00 00 00 00 00 00 00 <>3491225     16 44 <>3491216]s ::::[d]Ad es:N diinlsneifrain<>3491216]s ::::[d]C:Ra(0:2 00 03 80 06 0<>3491217]edrqet / ro,dvsb etr2272
May 10 10:18:07 goemon kernel: 3[550.837 ad:sb:rshdln etr2252
May 10 10:18:07 goemon kernel: 6[551214]s ::::[d]Rsl:hsbt=I_Kdiebt=RVRSNE<>3491215]s ::::[d]SneKy:AotdCmad[urn][ecitr
May 10 10:18:07 goemon kernel: 4[550.833 ecitrsnedt ihsnedsrpos(nhx:<>3491216]    7 b0 00 00 c0 a8 00 00 0

Then later on:

May 10 12:01:18 goemon kernel: [3552089.226147] lost page write due to I/O error on sdb4
May 10 12:01:18 goemon kernel: [3552089.226312] lost page write due to I/O error on sdb4
May 10 12:10:14 goemon kernel: [3552624.625669] btrfs no csum found for inode 23642 start 0
May 10 12:10:14 goemon kernel: [3552624.625783] btrfs no csum found for inode 23642 start 4096
May 10 12:10:14 goemon kernel: [3552624.625884] btrfs no csum found for inode 23642 start 8192

etc. and more garbage.

At that point, I wanted to shutdown the server, check the hardware, and reboot. Shutdown didn't want to proceed completely. Btrfs just froze on the sync happening during the shutdown phase, so I had to power off violently. Nothing seemed really problematic on the hardware end, and after a reboot, both disks were properly working.

The MD RAID would resynchronize, and the btrfs filesystems would be automatically mounted. It would work for a while, until such things could be seen in the logs, with more garbage as above in between:

May 10 14:41:18 goemon kernel: [ 1253.455545] __ratelimit: 35363 callbacks suppressed
May 10 14:45:04 goemon kernel: [ 1478.717749] parent transid verify failed on 358190825472 wanted 42547 found 42525
May 10 14:45:04 goemon kernel: [ 1478.717936] parent transid verify failed on 358316642304 wanted 42547 found 42515
May 10 14:45:04 goemon kernel: [ 1478.717939] parent transid verify failed on 358190825472 wanted 42547 found 42525
May 10 14:45:04 goemon kernel: [ 1478.718128] parent transid verify failed on 358316642304 wanted 42547 found 42515
May 10 14:45:04 goemon kernel: [ 1478.718131] parent transid verify failed on 358190825472 wanted 42547 found 42525

Then there would be kernel btrfs processes going on and on sucking CPU and I/O, doing whatever it was doing. At such moment, most file reading off one of the btrfs volumes would either take very long or freeze, and un-mounting would only freeze. At that point, considering the advantages of btrfs (in my case, mostly, snapshots) were outweighed by such issues (this wasn't my first btrfs fuck up, but by large, the most dreadful) and the fact that btrfs is just so slow compared to other filesystems, I decided I didn't want to care trying to save these filesystems from their agonizing death, and that I'd just go with ext4 on MD RAID instead. Also, I didn't want to just try (with the possibility of going through similar pain) again with a more recent kernel.

Fortunately, I had backups of most of the data (only problem being the time required to restore that amount of data), but for the few remaining things which, by force of bad timing, I didn't have a backup of, I needed to somehow get them back from these btrfs volumes. So I created new file systems to replace the btrfs volumes I could directly throw away and started recovering data from backups. I also, at the same time, tried to copy a big disk image from the remaining btrfs volume. Somehow, this worked, with the system load varying between 20 and 60... (with a lot of garbage in the logs and other services deeply impacted as well) But when trying to copy the remaining files I wanted to recover, things got worse, so I had to initiate a shutdown, and power cycle again.

Since apparently the kernel wasn't going to be very helpful, the next step was to just get other things working, and get the data back some other way. What I did was to use a virtual machine to get the data off the remaining btrfs volume. The kernel could become unusable all it wanted to, I could just hard reboot without impacting the other services.

In the virtual machine, things got "interesting". I did try various things I've seen on the linux-btrfs list, but nothing really did anything at all except spew some more parent transid messages. I should mention that the remaining btrfs volume was a RAID 0. To mount those, you'd mount one of the constituting disks like this:

$ mount /dev/sdb /mnt

Except that it would complain that it can't find a valid whatever (I don't remember the exact term, and I threw the VM away already) so it wouldn't mount the volume. But when mounting the other constituting disk, it would just work. Well, that's kind of understandable, but what is not is that on the next boot (I had to reboot a lot, see below), it would error out on the disk that worked previously, and work on the disk that was failing before.

So, here is how things went:

  • I would boot the VM and mount the volume,
  • launch an rsync of the data to recover, which I'd send onto the host system,
  • observe, from the host system, what was going on I/O wise,
  • at some point (usually after something like 10 to 50 files rsync'ed), after throwing a bunch of parent transid error messages, the VM would just stop doing any kind of I/O (even if left alone for several minutes), at which point I'd hard shutdown the VM and start over.

Ain't that fun?

The good thing is that in the end, despite the pain, I recovered all that needed to be recovered. I'm in the process of recreating my build chroots from scratch, but that's not exactly difficult. It would just have taken a lot more time to recover them the same way, 50 files at a time.

Side note: yes, I did try newer versions of btrfsck ; yes I did try newer kernels. No, nothing worked to make these btrfs volumes viable. No, I don't have an image of these completely fucked up volumes.

2011-05-13 12:13:32+0900

p.d.o

You can leave a response, or trackback from your own site.

15 Responses to “Debian Squeeze + btrfs = FAIL”

  1. Octoploid Says:

    Lesson learned:
    Never use a file system without a working fsck in production.

  2. paulcarroty Says:

    # mount -t btrfs /dev/sdb /mnt (in squeeze)
    I use btrfs 1 year and no fail.
    > Never use a file system without a working fsck in production.
    Usе btrfsck.

  3. Octoploid Says:

    paulcarroty:
    btrfsck is read-only. (IOW it doesn’t correct any problems)

  4. foo Says:

    How about the backported 2.6.38 kernel?

  5. glandium Says:

    foo: I can’t say if I’d have had the same problems had I been running a backported 2.6.38 from the start. What I can however say is that using 2.6.38 on the hosed volumes didn’t make any difference. Actually, it did, it would only allow to copy 2~5 files before stalling instead of 10~50 with 2.6.32…

  6. Alex Says:

    I had the same problem about a year ago: After a crash, the system became unusable as soon as the btrfs partition was mounted.

    Fortunately linux 2.6.34 had just come out which introduced the “subvolid” mount option. This allowed me to directly specify the id (not the name) of the subvolume I wanted to mount.

    Since I had made hourly btrfs snapshots and only the main subvolume was damaged, I only lost 20 min of work. However, finding this solution and actually recovering the data took nearly a week.

    So after all: btrfs is super-cool, but I wouldn’t even use on a not-so-important machine. Why? Because it will break eventually and you’ll spend (too) much time bringing the machine back online.

  7. gil Says:

    Does the word “experimental” rings a bell ? For a file system it means what it means: more prone to crashes and not to be used in a production environement… So, too bad but expected.

  8. glandium Says:

    gil: that’s why I didn’t put anything critical on these file systems. I wasn’t however expecting the whole system to be deeply impacted by accesses to a broken file system.

  9. Grzes Says:

    Offtopic, but maybe interesting for other people – when do you think we can expect a new Iceweasel in unstable?

  10. jidanni Says:

    Well I got vfat USB data corruption except in single user mode
    http://lists.debian.org/debian-user/2011/05/msg00872.html
    I just hope it doesn’t happen anymore.

  11. Debian Squeeze + btrfs = FAIL | Debian-News.net - Your one stop for news about Debian Says:

    […] Executive summary: Don’t use btrfs on Debian Squeeze. Longer summary: Don’t use btrfs RAID with the kernel Debian Squeeze comes with. More in this blog here […]

  12. Eric Says:

    btrfs is a mess once you actually try and use the advanced features. Snapshots are clones and come with a host of smaller issues ( can’t expose them to users or else they could be writing to them, not great for backups either since they are mutable ). Clones are not recursive, so you can’t get a point in time clone of all sub-filesystems at once. Attempting to limit the volume size just returns nonsense at the prompt. Setting up a filesystem “raid1” is not really raid1, but copies=2 which is a subtle but significant difference. Developers are defensive on IRC often attacking rather than attempting to understand these issues.

    I could go on all night long because I so desperately want checksumming and snapshoting in a day-to-day linux filesystem.

  13. LVM snapshots in a file instead of wasting pre allocated space? Says:

    […] help others, this might be relevant: http://glandium.org/blog/?p=2059 here a few excepts: —- …Debian Squeeze + btrfs = FAIL …Lesson learned: Never use a file […]

  14. fgbreel Says:

    Use a recent kernel 3.2 for example.
    DM on 2.6.35 not support dm merge, i had problem with lvm merging lvm snapshot, then i installed a 3.2 series kernel and solves.

  15. Dmitry Smirnov Says:

    The title of this article creates wrong impression that Debian is somehow responsible for your Btrfs troubles. Perhaps it would be more accurate to re-title as “Linux 2.6 + Btrfs = Fail”?
    Besides in my experience Btrfs snapshots were not working properly/reliable until late revisions of Linux 3.16 while even basic Btrfs features stabilised only around 3.12…3.14 (with some regressions in early releases)…

Leave a Reply