Author Archive

-feliminate-dwarf2-dups FAIL

DWARF-2 is a format to store debugging information. It is used on many ELF systems such as GNU/Linux. With the way things are compiled, there is a lot of redundant information in the DWARF-2 sections of an ELF binary.

Fortunately, there is an option to gcc that helps dealing with the redundant information and downsizes the DWARF-2 sections of ELF binaries. This option is -feliminate-dwarf2-dups.

Unfortunately, it doesn't work with C++.

With -g alone, libxul.so is 468 MB. With -g -feliminate-dwarf2-dups, it is... 1.5 GB. FAIL.

The good news is that as stated in the message linked above, -gdwarf-4 does indeed help reducing debugging information size. libxul.so, built with -gdwarf-4 is 339 MB. This however requires gcc 4.6 and a pretty recent gdb.

2011-07-30 11:21:01+0900

p.d.o, p.m.o | 1 Comment »

debian-rules-missing-recommended-target, dh, and dumb make

Lintian now has a warning for debian/rules missing build-arch and build-indep targets. As a dh user, I was surprised that some of my dh-using packages had this problem. And when looking at the source, I remembered how I came to this: GNU make is stupid.

Considering the following excerpt of the GNU make manual:

.PHONY

The prerequisites of the special target .PHONY are considered to be phony targets. When it is time to consider such a target, make will run its recipe unconditionally, regardless of whether a file with that name exists or what its last-modification time is.

And considering the following debian/rules:


.PHONY: build
%:
        dh $@

What do you think happens when you run debian/rules build in a directory containing a build file or directory?

make: Nothing to be done for `build'.

However, an explicit rule, like the following, works:


.PHONY: build
build:
        dh $@

It happens that many of the packages I maintain contain a build subdirectory in their source. As such, to work around the aforementioned issue, I just declared the dh rules explicitely, as in:


.PHONY: build binary binary-arch binary-indep (...)
build binary binary-arch binary-indep (...):
        dh $@

And this obviously doesn't scale for new rules such as build-arch and build-indep.

To be future-proof, I'll use the following instead:


.PHONY: build
build %:
        dh $@

I don't know why I didn't do that the first time...

2011-07-23 10:48:07+0900

p.d.o | 1 Comment »

Prepare yourself for the upcoming changes on the mozilla.debian.net repository

With the upcoming changes in the beta and aurora channels (6.0 is going to reach beta, and 7.0 to reach aurora), the mozilla.debian.net repository is going to adapt, and drop versioned archives in favor of channel archives. The channel archives for beta and aurora already existed, however what is new is the "release" channel. Currently, that channel contains Iceweasel 5.0, but as soon as 6.0 is released, that's what the "release" channel will contain.

To summarize, if you added lines containing iceweasel-x.0 where x is 4, 5, or 6 in your /etc/apt/sources.list, you need to update it to the corresponding channel (don't forget 4.0 is dead, you should use "release" instead).

The iceweasel-5.0 and iceweasel-6.0 archives still exist at the moment but will be dropped as soon as the new aurora and beta releases are ready, which should be real soon now (only waiting for actual upstream releases).

As a somehow related note, it should be noted that Iceweasel 5.0 should (finally) enter Debian unstable on the 15th of July, at which point the latest 6.0 beta will also be uploaded to Debian experimental. It is still unclear how long it will take for Iceweasel 5.0 to reach Debian testing/wheezy, because of all the reverse dependencies, but when that happens, we'll also be able to push it to backports.debian.org.

2011-07-07 12:25:21+0900

firefox | 24 Comments »

Iceweasel 5.0 in experimental

I just pushed Iceweasel 5.0 to Debian experimental. Why not unstable, some will ask? Well, because we still need to give some time after a first notice before breaking plenty of packages (Thanks to Julien Cristau for the MBF, by the way).

I also discontinued the Iceweasel 4.0 backport for Squeeze, as Iceweasel 4.0 won't be receiving security updates. Speaking of security updates, 3.6.18 was also made available on mozilla.debian.net for Wheezy, Squeeze and Lenny. However, I still have to backport the necessary patches to 3.5 in Squeeze and 3.0 in Lenny. My real life schedule wasn't compatible with the security release schedule, so I got late on the security backport train.

In the coming weeks, there will also be some additional changes to the mozilla.debian.net repository, but I'll give more details when that happens.

2011-06-22 02:41:30+0900

firefox | 16 Comments »

Faster Firefox cold startup, now in nightlies

The 20-line patch to Firefox 4 that makes startup on Windows up to 2x as fast and the stupid one-liner that does the same on Linux both grew into a full fledged preloading solution working on all our supported platforms. This involved major changes to how we initialize Firefox, and a few glitches with our leak detector, but this time it should stay for good (it had been backed out twice already).

Users shouldn't notice any change until after they reboot after upgrading to the latest nightly. It is possible to watch how things evolve with the about:startup extension.

These cold startup improvements will be available in Firefox 7.

2011-06-20 02:49:46+0900

p.m.o | 16 Comments »

Iceweasel 5.0b2

... would have been released today if mozilla.debian.net was responding. But it's moving to a new server.

2011-05-21 08:17:50+0900

firefox | 5 Comments »

Debian Squeeze + btrfs = FAIL

Executive summary: Don't use btrfs on Debian Squeeze.
Longer summary: Don't use btrfs RAID with the kernel Debian Squeeze comes with.

About six months ago, I set up a new server to handle this web site, mail, and various other things. The system and most services (including web and mail) was set to use an MD RAID 1 array across two small partitions on two separate disks, and the remaining space was setup in three different btrfs file systems:

One btrfs RAID 0 for shared data I wouldn't mind having offline while fixing issues on one disk
One btrfs RAID 1 for shared data I would mind having offline while fixing issues on one disk
One last btrfs RAID 0 for entirely throwable things such as build chroots

Three days ago, this happened:

May 10 10:18:04 goemon kernel: [3545898.548311] ata4: hard resetting link
May 10 10:18:04 goemon kernel: [3545898.867556] ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
May 10 10:18:04 goemon kernel: [3545898.874973] ata4.00: configured for UDMA/33

followed by other ATA related messages, then, garbage such as:

May 10 10:18:07 goemon kernel: [3545901.28123] sd3000 d]SneKy:AotdCmad[urn][ecitr
May 10 10:18:07 goemon kernel: 4[550.821 ecio es aawt es ecitr i e)
May 10 10:18:07 goemon kernel: 6[550.824     20 00 00 00 00 00 00 00 <>3491225     16 44 <>3491216]s ::::[d]Ad es:N diinlsneifrain<>3491216]s ::::[d]C:Ra(0:2 00 03 80 06 0<>3491217]edrqet / ro,dvsb etr2272
May 10 10:18:07 goemon kernel: 3[550.837 ad:sb:rshdln etr2252
May 10 10:18:07 goemon kernel: 6[551214]s ::::[d]Rsl:hsbt=I_Kdiebt=RVRSNE<>3491215]s ::::[d]SneKy:AotdCmad[urn][ecitr
May 10 10:18:07 goemon kernel: 4[550.833 ecitrsnedt ihsnedsrpos(nhx:<>3491216]    7 b0 00 00 c0 a8 00 00 0

Then later on:

May 10 12:01:18 goemon kernel: [3552089.226147] lost page write due to I/O error on sdb4
May 10 12:01:18 goemon kernel: [3552089.226312] lost page write due to I/O error on sdb4
May 10 12:10:14 goemon kernel: [3552624.625669] btrfs no csum found for inode 23642 start 0
May 10 12:10:14 goemon kernel: [3552624.625783] btrfs no csum found for inode 23642 start 4096
May 10 12:10:14 goemon kernel: [3552624.625884] btrfs no csum found for inode 23642 start 8192

etc. and more garbage.

At that point, I wanted to shutdown the server, check the hardware, and reboot. Shutdown didn't want to proceed completely. Btrfs just froze on the sync happening during the shutdown phase, so I had to power off violently. Nothing seemed really problematic on the hardware end, and after a reboot, both disks were properly working.

The MD RAID would resynchronize, and the btrfs filesystems would be automatically mounted. It would work for a while, until such things could be seen in the logs, with more garbage as above in between:

May 10 14:41:18 goemon kernel: [ 1253.455545] __ratelimit: 35363 callbacks suppressed
May 10 14:45:04 goemon kernel: [ 1478.717749] parent transid verify failed on 358190825472 wanted 42547 found 42525
May 10 14:45:04 goemon kernel: [ 1478.717936] parent transid verify failed on 358316642304 wanted 42547 found 42515
May 10 14:45:04 goemon kernel: [ 1478.717939] parent transid verify failed on 358190825472 wanted 42547 found 42525
May 10 14:45:04 goemon kernel: [ 1478.718128] parent transid verify failed on 358316642304 wanted 42547 found 42515
May 10 14:45:04 goemon kernel: [ 1478.718131] parent transid verify failed on 358190825472 wanted 42547 found 42525

Then there would be kernel btrfs processes going on and on sucking CPU and I/O, doing whatever it was doing. At such moment, most file reading off one of the btrfs volumes would either take very long or freeze, and un-mounting would only freeze. At that point, considering the advantages of btrfs (in my case, mostly, snapshots) were outweighed by such issues (this wasn't my first btrfs fuck up, but by large, the most dreadful) and the fact that btrfs is just so slow compared to other filesystems, I decided I didn't want to care trying to save these filesystems from their agonizing death, and that I'd just go with ext4 on MD RAID instead. Also, I didn't want to just try (with the possibility of going through similar pain) again with a more recent kernel.

Fortunately, I had backups of most of the data (only problem being the time required to restore that amount of data), but for the few remaining things which, by force of bad timing, I didn't have a backup of, I needed to somehow get them back from these btrfs volumes. So I created new file systems to replace the btrfs volumes I could directly throw away and started recovering data from backups. I also, at the same time, tried to copy a big disk image from the remaining btrfs volume. Somehow, this worked, with the system load varying between 20 and 60... (with a lot of garbage in the logs and other services deeply impacted as well) But when trying to copy the remaining files I wanted to recover, things got worse, so I had to initiate a shutdown, and power cycle again.

Since apparently the kernel wasn't going to be very helpful, the next step was to just get other things working, and get the data back some other way. What I did was to use a virtual machine to get the data off the remaining btrfs volume. The kernel could become unusable all it wanted to, I could just hard reboot without impacting the other services.

In the virtual machine, things got "interesting". I did try various things I've seen on the linux-btrfs list, but nothing really did anything at all except spew some more parent transid messages. I should mention that the remaining btrfs volume was a RAID 0. To mount those, you'd mount one of the constituting disks like this:

$ mount /dev/sdb /mnt

Except that it would complain that it can't find a valid whatever (I don't remember the exact term, and I threw the VM away already) so it wouldn't mount the volume. But when mounting the other constituting disk, it would just work. Well, that's kind of understandable, but what is not is that on the next boot (I had to reboot a lot, see below), it would error out on the disk that worked previously, and work on the disk that was failing before.

So, here is how things went:

I would boot the VM and mount the volume,
launch an rsync of the data to recover, which I'd send onto the host system,
observe, from the host system, what was going on I/O wise,
at some point (usually after something like 10 to 50 files rsync'ed), after throwing a bunch of parent transid error messages, the VM would just stop doing any kind of I/O (even if left alone for several minutes), at which point I'd hard shutdown the VM and start over.

Ain't that fun?

The good thing is that in the end, despite the pain, I recovered all that needed to be recovered. I'm in the process of recreating my build chroots from scratch, but that's not exactly difficult. It would just have taken a lot more time to recover them the same way, 50 files at a time.

Side note: yes, I did try newer versions of btrfsck ; yes I did try newer kernels. No, nothing worked to make these btrfs volumes viable. No, I don't have an image of these completely fucked up volumes.

2011-05-13 12:13:32+0900

p.d.o | 15 Comments »

Aftermath of the Linux compiler and optimizations changes

It has been two weeks since we switched to faster Linux builds. After some "fun" last week, it is time to look back.

The news that Mozilla will be providing faster Linux builds made it to quite a lot of news sites, apparently. Most of the time with titles as misleading as "as fast as Windows builds". I love that kind of journalism where "much closer to" is spelled "as fast as". Anyways, I've also seen a number of ensuing comments basically saying that we sucked and that some people had been successfully building with GCC 4.5 for a while, and now with GCC 4.6, so why can't we do that as well?

Well, for a starter, I doubt they've been building with GCC 4.6 for long, and definitely not Firefox 4.0, because we only recently fixed a bunch of C++ conformance problems that GCC 4.6 doesn't like. Update: now that I think of it, I might have mixed up things. This bunch might only become a problem when compiling in C++0x mode (which is now enabled when supported on mozilla-central).

Then, there are fundamental differences between a build someone does for her own use, and Mozilla builds:

Mozilla builds need to work on as many machines as possible, on as many Linux distros as possible,
Mozilla builds are heavily tested (yet, not enough).

Builds that run (almost) everywhere

One part of the challenge of using a modern compiler is that newer versions of GCC like to change subtle things in their C++ standard library making compiled binaries dependent on a newer version of libstdc++. This behaviour pretty much depends on the C++ standard library features used.

For quite a while, Mozilla builds have been compiled with GCC 4.3, but up to Firefox 3.6, only libstdc++ 4.1 was required. Some new code added to Firefox 4.0 however changed that and libstdc++ 4.3 is now required. This is the reason why Firefox 4.0 doesn't work on RedHat/CentOS 5 while Firefox 3.6 did, because these systems don't have libstdc++ version 4.3.

Switching to GCC 4.5 (or 4.6, for that matter), in Firefox case, means requiring libstdc++ version 4.5 (or 4.6). While this is not a problem for people building for their own system, or for distros, it is when you want the binary you distribute to work on most systems, because libstdc++ version 4.5 is less widespread.

So on one end, we had an outdated toolchain that couldn't handle Profile Guided Optimization properly, and on the other hand, a more modern toolchain that creates a dependency on a libstdc++ version that is not widespread enough.

At this point, I should point out that an easy way out exists: statically linking libstdc++. The downside is that is makes the binaries significantly bigger.

Fortunately, we found a hackish way to avoid these dependencies on newer libstdc++. It has been extended since, and now allows to build Firefox with GCC up to version 4.7, with or without the experimental C++0x mode enabled. The resulting binaries only depend on libstdc++ 4.1, meaning they should work on RedHat/CentOS 5.

Passing the test suites

We have a big test suite, which is probably an understatement: we have plenty thousands of unit tests. And we try to avoid these unit tests regressing. I don't think most people building Firefox run them. Actually most of the hundreds of Linux distributions don't.

I know, for I also happen to be the Debian maintainer, that Debian does run test suites on all its architectures, but it skips mochitests because they take too long. As Debian has switched to GCC 4.5 for a while, now, I knew there weren't regressions in these test suites it runs, at least at the optimization level used by default.

And after the switch to faster Linux builds, we haven't seen regressions either. Well, not exactly, but I'll come back on that further below.

GCC 4.5, optimization levels, and Murphy's Law

Sadly, after the switch, we weren't getting symbols in crash reports anymore. The problem was that the program used to dump debugging symbols from our binaries in a usable form for crash reports post-processing didn't output function information. This, in turn, was due to a combination of a lack of functionality in the dump program, and a bug in GCC 4.5 (which seems to be fixed in GCC 4.6) that prevented the necessary information from being present in the DWARF sections when the -freorder-block-and-partition option is used. I'll come back on this issue in a subsequent blog post. The short term (and most probably long term) solution was to remove the incriminated option.

But while searching for that root cause, we completely disabled PGO, leaving the optimization level to -O3. I had tested gcc 4.5 and -O3 without PGO a few times on the Try server with no other problems than a few unimportant rounding errors we decided to ignore by modifying the relevant tests, so I wasn't expecting anything bad.

That was without counting on Murphy's Law, in the form of a permanent Linux x86 reftest regression. But that error didn't appear in my previous tests, so it had to have been introduced by some change in the tree. After some quite painful bisecting (I couldn't reproduce the problem with local builds, so I had to resort on the Try Server, each build+test run taking between 1 and 2 hours), I narrowed it down to the first part of bug 641426 triggering a change in how GCC optimizes some code, and as a side effect, changes some floating point operations on x86, using memory instead of registers or vice versa, introducing rounding discrepancy in different parts of the code.

But while searching for that root cause, we backed out the switch to aggressive optimization and went back to -Os instead of -O3. The only remaining change from the switch was thus the GCC version. And Murphy's Law kicked in yet again, in the form of a permanent Linux x86/x86-64 a11y mochitest regression. As it turned out, that regression had already been spotted on the tracemonkey tree, during the couple days it had PGO enabled, but wasn't using -O3, and disappeared when the -O3 switch was merged from mozilla-central. But at the time, we didn't track it down. We disabled the tests to open the tree for development, but the issue is still there, just hidden. Though now that we're back to aggressive optimization and PGO, we re-enabled the test and the issue has gone away, which is kind of scary. We definitely need to find the real issue, which might be related to some uninitialized memory.

We also had a couple new intermittent failures that are thought to be related to the GCC 4.5 switch, but all of them go away if we simply re-run the test off the same build.

What does this all mean?

First, it means that in some cases it seems a newer compiler unveils some dormant bugs in our code. And that with the same compiler, different optimization options can lead to different results/breakages.

By extension, this means it is important that we carefully choose our default optimization options, especially when PGO is not used (which is most of the time for non Mozilla builds). I'm even tempted to say it would be important for us to test these non-PGO defaults, but we also can't try all possible compiler versions either.

This also means it is important that Linux distros run our test suites with their builds, especially when they use newer compilers.

A few related thoughts

While handling the transition to this new toolchain, it became clear that the lack of correlation between our code base and our mozconfig files is painful. The best demonstration is the Try server, which is now using GCC 4.5 for all builds by default. But if you push there a commit that doesn't have the necessary libstdc++ compatibility hack, the builds will fail. There are many other cases of changes in our mozconfigs requiring changes in e.g. configure.in, and these are even more reasons to get mozconfigs in our code base.

The various issues we got in the process also made me reflect on our random oranges. I think we lack one important information when we have a test failure: does it reliably happen with a given build? Chances are that most random oranges don't (like the two I mentioned further above), but those that do may point out subtle problems of compiler optimizations breaking some of our assumptions (though so far, most of the time, they just turn into permanent oranges). The self-serve API does help in that regard, allowing to re-trigger a given test suite on the same build, but I think we should enhance our test harnesses to automatically retry failing tests.

What about GCC 4.6?

I think it's too early to think about GCC 4.6. While it has some improvements over GCC 4.5, it may also bring its own set of surprises. GCC also has a pretty bad history of screwing things up in dot-zero releases, so it would be better to wait for 4.6.1, which I hear is planned for soon. And GCC 4.6 would make things even harder for the Try server and some other branches considering the C++ conformance problems I mentioned.

Also, most of the people mentioning GCC 4.6 also mention Link Time Optimization, which is the main nicety it brings. Unfortunately, linking requires gigabytes of memory, which means several things:

We need that much memory on our build bots, which I'm not sure they currently have
It actually exhausts the 32-bits address space, which means we'd need to cross compile the 32-bits builds on 64-bits hosts with a 64-bits toolchain. Which, in turn, means changing build bots, and maybe some fun with our build system.

GCC people are working on decreasing the amount of memory required to link, but it's work in progress and won't be workable until GCC 4.7 (or, who knows, even later). We might have switched to clang before that ;-)

2011-05-12 10:18:07+0900

p.m.o | 26 Comments »

Installing Iceweasel 5.0a2 on Debian GNU/Linux

Go to the Debian Mozilla Team page.
Select the Debian version you are running, "Iceweasel" and the version, "5.0".
Follow the instructions.
Profit.

Only amd64 and i386 packages are available. Note that there is another Iceweasel "version" available there: "aurora". Currently, this is the same as "5.0", but whenever Firefox 5.0 will reach the beta stage, "aurora" will be 6.0a2. Please feel free to use "aurora" if you want to keep using these pre-beta builds.

2011-05-02 19:00:42+0900

firefox | 20 Comments »

Faster Linux builds

After two failed attempts last year, and a few glitches yesterday, we finally managed to get our Linux (and, obviously, Linux64) builds to use GCC 4.5, with aggressive optimization (-O3) and profile guided optimization enabled. This means we are finally using a more modern toolchain, opening opportunities for things such as static analysis. This also means we are now producing a faster Firefox, now much closer to the Windows builds on the same hardware on various performance tests.

A nice side effect of some of the work I have done to make the switch possible is that these builds will also work on older Linux platforms such as RedHat/CentOS 5, or possibly older (as long as they come with libstdc++ from GCC 4.1).

The first Firefox release to benefit these new settings should be Firefox 6.

A few branches other than mozilla-central have also been switched, most notably Try, for which there is a known issue if you push something too old. Please make sure to read the corresponding information on wiki.m.o for a workaround. A Mercurial hook is going to be put in place to issue a warning if there are chances your build will fail (it will, however, not prevent the push).

Thanks to Chris Atlee, Rail Aliiev, Taras Glek, Justin Lebar and all those I forgot or am not aware of for their assistance and/or past involvement in the previous attempts.

2011-04-29 11:31:18+0900

p.m.o | 61 Comments »