glandium.org » Blog Archive » Aftermath of the Linux compiler and optimizations changes

Aftermath of the Linux compiler and optimizations changes

It has been two weeks since we switched to faster Linux builds. After some "fun" last week, it is time to look back.

The news that Mozilla will be providing faster Linux builds made it to quite a lot of news sites, apparently. Most of the time with titles as misleading as "as fast as Windows builds". I love that kind of journalism where "much closer to" is spelled "as fast as". Anyways, I've also seen a number of ensuing comments basically saying that we sucked and that some people had been successfully building with GCC 4.5 for a while, and now with GCC 4.6, so why can't we do that as well?

Well, for a starter, I doubt they've been building with GCC 4.6 for long, and definitely not Firefox 4.0, because we only recently fixed a bunch of C++ conformance problems that GCC 4.6 doesn't like. Update: now that I think of it, I might have mixed up things. This bunch might only become a problem when compiling in C++0x mode (which is now enabled when supported on mozilla-central).

Then, there are fundamental differences between a build someone does for her own use, and Mozilla builds:

Mozilla builds need to work on as many machines as possible, on as many Linux distros as possible,
Mozilla builds are heavily tested (yet, not enough).

Builds that run (almost) everywhere

One part of the challenge of using a modern compiler is that newer versions of GCC like to change subtle things in their C++ standard library making compiled binaries dependent on a newer version of libstdc++. This behaviour pretty much depends on the C++ standard library features used.

For quite a while, Mozilla builds have been compiled with GCC 4.3, but up to Firefox 3.6, only libstdc++ 4.1 was required. Some new code added to Firefox 4.0 however changed that and libstdc++ 4.3 is now required. This is the reason why Firefox 4.0 doesn't work on RedHat/CentOS 5 while Firefox 3.6 did, because these systems don't have libstdc++ version 4.3.

Switching to GCC 4.5 (or 4.6, for that matter), in Firefox case, means requiring libstdc++ version 4.5 (or 4.6). While this is not a problem for people building for their own system, or for distros, it is when you want the binary you distribute to work on most systems, because libstdc++ version 4.5 is less widespread.

So on one end, we had an outdated toolchain that couldn't handle Profile Guided Optimization properly, and on the other hand, a more modern toolchain that creates a dependency on a libstdc++ version that is not widespread enough.

At this point, I should point out that an easy way out exists: statically linking libstdc++. The downside is that is makes the binaries significantly bigger.

Fortunately, we found a hackish way to avoid these dependencies on newer libstdc++. It has been extended since, and now allows to build Firefox with GCC up to version 4.7, with or without the experimental C++0x mode enabled. The resulting binaries only depend on libstdc++ 4.1, meaning they should work on RedHat/CentOS 5.

Passing the test suites

We have a big test suite, which is probably an understatement: we have plenty thousands of unit tests. And we try to avoid these unit tests regressing. I don't think most people building Firefox run them. Actually most of the hundreds of Linux distributions don't.

I know, for I also happen to be the Debian maintainer, that Debian does run test suites on all its architectures, but it skips mochitests because they take too long. As Debian has switched to GCC 4.5 for a while, now, I knew there weren't regressions in these test suites it runs, at least at the optimization level used by default.

And after the switch to faster Linux builds, we haven't seen regressions either. Well, not exactly, but I'll come back on that further below.

GCC 4.5, optimization levels, and Murphy's Law

Sadly, after the switch, we weren't getting symbols in crash reports anymore. The problem was that the program used to dump debugging symbols from our binaries in a usable form for crash reports post-processing didn't output function information. This, in turn, was due to a combination of a lack of functionality in the dump program, and a bug in GCC 4.5 (which seems to be fixed in GCC 4.6) that prevented the necessary information from being present in the DWARF sections when the -freorder-block-and-partition option is used. I'll come back on this issue in a subsequent blog post. The short term (and most probably long term) solution was to remove the incriminated option.

But while searching for that root cause, we completely disabled PGO, leaving the optimization level to -O3. I had tested gcc 4.5 and -O3 without PGO a few times on the Try server with no other problems than a few unimportant rounding errors we decided to ignore by modifying the relevant tests, so I wasn't expecting anything bad.

That was without counting on Murphy's Law, in the form of a permanent Linux x86 reftest regression. But that error didn't appear in my previous tests, so it had to have been introduced by some change in the tree. After some quite painful bisecting (I couldn't reproduce the problem with local builds, so I had to resort on the Try Server, each build+test run taking between 1 and 2 hours), I narrowed it down to the first part of bug 641426 triggering a change in how GCC optimizes some code, and as a side effect, changes some floating point operations on x86, using memory instead of registers or vice versa, introducing rounding discrepancy in different parts of the code.

But while searching for that root cause, we backed out the switch to aggressive optimization and went back to -Os instead of -O3. The only remaining change from the switch was thus the GCC version. And Murphy's Law kicked in yet again, in the form of a permanent Linux x86/x86-64 a11y mochitest regression. As it turned out, that regression had already been spotted on the tracemonkey tree, during the couple days it had PGO enabled, but wasn't using -O3, and disappeared when the -O3 switch was merged from mozilla-central. But at the time, we didn't track it down. We disabled the tests to open the tree for development, but the issue is still there, just hidden. Though now that we're back to aggressive optimization and PGO, we re-enabled the test and the issue has gone away, which is kind of scary. We definitely need to find the real issue, which might be related to some uninitialized memory.

We also had a couple new intermittent failures that are thought to be related to the GCC 4.5 switch, but all of them go away if we simply re-run the test off the same build.

What does this all mean?

First, it means that in some cases it seems a newer compiler unveils some dormant bugs in our code. And that with the same compiler, different optimization options can lead to different results/breakages.

By extension, this means it is important that we carefully choose our default optimization options, especially when PGO is not used (which is most of the time for non Mozilla builds). I'm even tempted to say it would be important for us to test these non-PGO defaults, but we also can't try all possible compiler versions either.

This also means it is important that Linux distros run our test suites with their builds, especially when they use newer compilers.

A few related thoughts

While handling the transition to this new toolchain, it became clear that the lack of correlation between our code base and our mozconfig files is painful. The best demonstration is the Try server, which is now using GCC 4.5 for all builds by default. But if you push there a commit that doesn't have the necessary libstdc++ compatibility hack, the builds will fail. There are many other cases of changes in our mozconfigs requiring changes in e.g. configure.in, and these are even more reasons to get mozconfigs in our code base.

The various issues we got in the process also made me reflect on our random oranges. I think we lack one important information when we have a test failure: does it reliably happen with a given build? Chances are that most random oranges don't (like the two I mentioned further above), but those that do may point out subtle problems of compiler optimizations breaking some of our assumptions (though so far, most of the time, they just turn into permanent oranges). The self-serve API does help in that regard, allowing to re-trigger a given test suite on the same build, but I think we should enhance our test harnesses to automatically retry failing tests.

What about GCC 4.6?

I think it's too early to think about GCC 4.6. While it has some improvements over GCC 4.5, it may also bring its own set of surprises. GCC also has a pretty bad history of screwing things up in dot-zero releases, so it would be better to wait for 4.6.1, which I hear is planned for soon. And GCC 4.6 would make things even harder for the Try server and some other branches considering the C++ conformance problems I mentioned.

Also, most of the people mentioning GCC 4.6 also mention Link Time Optimization, which is the main nicety it brings. Unfortunately, linking requires gigabytes of memory, which means several things:

We need that much memory on our build bots, which I'm not sure they currently have
It actually exhausts the 32-bits address space, which means we'd need to cross compile the 32-bits builds on 64-bits hosts with a 64-bits toolchain. Which, in turn, means changing build bots, and maybe some fun with our build system.

GCC people are working on decreasing the amount of memory required to link, but it's work in progress and won't be workable until GCC 4.7 (or, who knows, even later). We might have switched to clang before that ;-)

2011-05-12 10:18:07+0900

p.m.o

You can leave a response, or trackback from your own site.

26 Responses to “Aftermath of the Linux compiler and optimizations changes”

Octoploid Says:
2011-05-12 14:40:17+0900
Have you looked into “Identical Code Folding” with gold yet?
It should save you a few hundred kb in libxul.
Basically all you need is:
“-Wl,–icf=all,–icf-iterations=3” in your LDFLAGS and
maybe “-ffunction-sections -fdata-sections” in your CFLAGS.

Chromium already uses this:
http://groups.google.com/a/chromium.org/group/chromium-dev/browse_thread/thread/701ca63b4793268f
Boris Says:
2011-05-12 14:53:36+0900
Octoploid, as the thread you link to says only gold supports that. We link with gnu ld last I checked.
firesnail Says:
2011-05-12 14:55:59+0900
I think your version numbers in GCC 4.6 section are off by one. LTO was introduced in GCC 4.5 but was unusably buggy.
4.6 enabled WHOPR by default, which should decrease memory. Are GCC developers making further optimizations?
Scott Baker Says:
2011-05-12 17:18:38+0900
I only understand about 50% of what you said, but I find it fascinating!
Evan M Says:
2011-05-12 18:12:03+0900
We (Chromium) have been hitting the 32-bit address space limit while linking a lot — we don’t build as separate .so files as you do — and don’t have much of any good solutions either. On OS X I think we just require a 64-bit linker now, but I suspect cross-compiling is easier there than on Linux.

One hack a coworker used was setting up a 64-bit system, installing a 32-bit chroot, then replacing the linker within the chroot with the 64-bit one.

BTW, I can’t say enough nice things about gold. I’m surprised to see you don’t use it.
Dan Says:
2011-05-13 00:56:33+0900
What is the disadvantage of removing -freorder-block-and-partition ? Does it have a noticeable effect?
WL Says:
2011-05-13 04:53:57+0900
Just a heads up in case we forget, Iceweasel at testing is still at 3.5.X while the rest of the world is already in 4.X sometime ago.

wheezy (testing) (web): Web browser based on Firefox
3.5.19-2: amd64 armel i386 ia64 kfreebsd-amd64 kfreebsd-i386 mips mipsel powerpc s390 sparc

“Mozilla will be providing faster Linux builds” doesn’t means much to Debian (main repo) isn’t it?
Jan Hubicka Says:
2011-05-13 08:29:16+0900
Yes, GCC LTO support is being worked on and I do test it from time to time with Mozilla. Currently the 32bit compiler fits in address space again and the link time is faster, too. Still about 4-5GB (in 32bit build) is required. I expect this to reduce quite noticeably before 4.7 release. I think main useability problems of LTO right now is the compile time and debug info quality, both are being worked on.

Main advantage of GCC 4.6 for you should be better behaviour of -Os with C++ code (GCC 4.5 regressed here compared to GCC 4.4 as was noticed only after Mozilla switch. 4.6 should be better than 4.4 in this respect). Note that with PDO the -Os is implicit for all code that is cold in your train run, so it should be interesting.

Also you get the function reordering for constructors/destructors that should improve startup, among other things.

I think whole situation could be helped with better cooperation in between Mozilla and GCC folks.
Honza
Jan Hubicka Says:
2011-05-13 09:26:42+0900
As for the other questions. -freorder-blocks-and-partition makes GCC to split out cold code from functions and put it into cold text section. I can imagine it confuses the reporting tool as you need to handle function fragments. (i.e. function is no longer a continuous interval of code starting by its entry symbol). I am not sure what bug in dwarf you are referring to, but I can double check the status in current mainline. For Mozilla I think it is not the most important optimization around since you worry mostly about overall size of binary that is not decreased by this flag.

I was building Mozilla with GCC 4.6 snapshots since June or July last year and I still do that. I sporadically update by Mozilla tree, but since my main interest is tracking GCC, I don’t do that more than once a month. C++ compaitbility problems I noticed was minor and I fixed them & reported to Mozilla bugzilla. So builting Mozilla with 4.6 and GCC development snapshots is indeed possible for a while. I try to track that status in http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375

My impression is that difficulty of your switch was because you combined compiler update with a change of default optimization flags (-Os to -O3) and the fact that you enabled profile directed optimizations.

This was partly forced by GCC 4.5 producing slower and bigger code at -Os Mozilla. Size issue is caused by fact that Mozilla contains a lot of COMDAT functions that are really not shared across unit (and thus really should be static, but for some reason they are not). While this can be declared coding style issue, it has turned out to be quite common in C++ code and the fact that GCC hesistated to inline the functions because overall program size can grow has turned out to be counterproductive. GCC 4.6+ has fix making it assume that this happens with a fixed probability and it was added not long after Taras Glek reported the problem last year.

So I would optimistically claim that switch to 4.6 should be less painful. Also Taras reported some of the problems early enough in 4.6 development cycles so they was fixed.

I would like to understand what makes it so hard to test newer GCCs in your environment. I see that libstdc++ compatibility issues is one of problems, but why you can’t just copy libstdc++ into the directory with other Mozilla libraries? I do that in my builds (since I run GCC 4.6/4.7 binaries on GCC 4.5 systems) and it works fine for me. But perhaps this question is naive, I am not libstdc++ developer. Paolo Carlinin is probably best contacted about this.

In general noticing problems GCC that was released more than year ago is much less likely going to lead to timely fix than noticing problems in the current development versions of GCC (or at least a most recent release or release branch before the release is made). It seems to me that if we more consistently tracked the development together, GCC would get a feedback on problems specific for Firefox, while Firefox code will get more exposure of compatibility problems.

I would say it should not be that hard to set a tester and get more problems noticed early.
glandium Says:
2011-05-13 10:35:21+0900
Jan:
– We actually won’t care about startup initializer position in the binary for Firefox 6, as, if everything goes according to plan, it’s going to preload its libraries.

– Long story short: -freorder-blocks-and-partition on 4.5 doesn’t emit DWARF ranges for functions. That’ll be better detailed in an upcoming blog post.

– Building with GCC 4.6 was simply not possible until recently without applying patches, patches that you’d have to know or come up with. That wouldn’t be a problem for us, but everyone is not a GCC or Mozilla hacker.

– Actually, everything (except the crash reports) was working fine by switching to GCC 4.5 and -O3 and turning on PDO/PGO. That’s because we had to disable PDO/PGO until we figure out what was wrong with crash reports that we found out about the other issues. Which we would never have seen, by the way, hadn’t we had the crash reports issue in the first place.

– As far as my testing goes, with GCC 4.5, the size issue comes at -O3, where the resulting binary is 6 to 7MB bigger than at -Os. With PDO/PGO, which compiles cold code at -Os, there’s almost no growth compared to -Os.

– Testing and using a new toolchain are different things. We’re unfortunately not currently testing new toolchains, but we are indeed planning to, and that should happen later this year. As for using a new toolchain for the release builds, we really first needed to get off GCC 4.3 to enter the modern era. Now we can go forward more easily.

Octoploid, Evan M: First things first. Now that we changed GCC, we can go forward with other toolchain changes. But it’s probably going to take some time.

WL: Your comment has nothing to do here. And please try documenting yourself on the issue beforehand.
Jan Hubicka Says:
2011-05-13 12:04:20+0900
Hi,
the -Os issue I was speaking of was the one reported by Taras here http://gcc.gnu.org/ml/gcc/2010-06/msg00715.html.
This issue setill exists on 4.5 branch, so your -Os binary with GCC 4.5 still should be slower and bigger than 4.3 binary. This is solved by -O3, but with -O3 and PDO you still have unnecesarily big and slow cold portions of the program. This may or may not really matter depending how well you train run matches your real workloads.

For dwarf ranges, please get a testcase to GCC bugzilla. I am not sure if it is bug that was fixed in meantime or just result of my reorg of functoin splitting code.

Indeed hope that move to new compilers will be smoother now ;)

I don’t know of any significant patches needed for GCC 4.6 build. The PR you link actually speaks of patches needed to build firefox with clang. But I don’t update the mozilla tree that often, so it is possible that I just got lucky and skipped the whole issue. I was however very pleasantly surprised to see mozilla compiling w/o any problems with mainline GCC at a time I started playing with this year ago. I anticipated problems but there was simply none ;)
Jan Hubicka Says:
2011-05-13 12:05:56+0900
For -Os issue we usually tend to avoid touching inline heuristic on release branches, since it tends to imply random fun issues. I can however backport the patch for your local GCC 4.5 tree if you consider it important (and given enough of backing we can update 4.5 tree, too. We already did for some of inlining issues at that release branch).
Jan Hubicka Says:
2011-05-13 12:19:26+0900
And reading your description, it seems to me that PDO masks the problem with -O3 just because it concludes that code in question is cold and does the implicit -Os switch on it.

I already wrote that to Taras, it sounds like the usual x87 80bit precision issue that is implied by x86 ABI. You may consider simply setting x87 control word to 64bit precision at Mozilla startup (that solves a lot of double rounding issues and might make some FP code bit faster on some chips) or defaulting to -mfpmath=sse -msse2 depending whether or not you care about pre-SSE2 era CPUs in your official builds.

The 64bit precision trick might run into some issues with exotic code paths in glibc relying on the extra precision, but I don’t think there are that many (it is however outside my area of expertise).
glandium Says:
2011-05-13 12:22:44+0900
Jan: We got between 5 and 20% perf improvement all over the place on our perf test suites with GCC 4.5 + -O3 + PDO/PGO compared to GCC 4.3 + -Os. I’m not sure backporting the patch is worth the risk at this point. Time would probably be better spent on updating other parts of the toolchain and starting to test GCC 4.6.

For DWARF ranges, as I said, this is fixed in GCC 4.6 (but wouldn’t have made a difference for our crash reports, our symbol dumper doesn’t understand them)
Jan Hubicka Says:
2011-05-13 13:31:03+0900
The patch is not particularly risky and should bring improvement for most of C++ code. But indeed, moving to 4.6 soon is a lot better solution and I hope it won’t cause you much headaches ;) Let me know if you run into any problems there. I got about 4fold speedups on -Os and our tramp3d C++ insanity benchmark with that change.

BTW 64bit control word should be matter of adding -mpc64 to your linking flags. Windows default to that so doing that should reduce disprepancy in between your Windows and Linux x86 test results involving floats. (x86-64 don’t care because of -mfpmath=sse default)
glandium Says:
2011-05-13 14:08:17+0900
Jan: I’d say it would be a matter of how much effort it would be to backport it versus how much there is to gain as a result (considering it only really matters for cold code that is built -Os). Anyways, it’s probably too late in the Firefox 6 development cycle to matter much.
glandium Says:
2011-05-13 18:51:22+0900
Jan: I tried -mpc64, it doesn’t solve our reftest failure with -O3 without PDO/PGO (-ffloat-store did, but we don’t want to do that).
oiaohm Says:
2011-05-14 03:33:04+0900
This makes me go what the.

Their is a dead simple solution to the libstdc++ issue. Its part of Linux Standard Base.

No static linking. Use the Linux Standard Bases dynamic linker and install libstdc++ on need. At least this path libstdc++ can be updated without need of rebuilding the binary and when the distrobution catchs up you just delete the firefox provided copy. Linux Standard Base dynamic linker adds application only lib directory. That the application looks at first and if the library is their uses that copy.

Lot simpler than pruning functions. You can be sure in time distributions will catch up. So why do a hack. Result long term is the hack has to be undo and new functionally that could give cleaner code is not used.
glandium Says:
2011-05-14 09:20:53+0900
oiaohm: You’re missing the point. If you build with GCC 4.5, you use GCC 4.5 C++ headers, and link against GCC 4.5 libstdc++. When you use some particular C++ STL constructs, you end up requiring symbols that are only present in GCC 4.5 libstdc++, and not in any older libstdc++. See my blog post on the subject.
jidanni Says:
2011-05-16 16:24:27+0900
Sounds like RISKS Digest reading.
Good thing there is diversity out there lest the whole Internet crash at once…
Jan Hubicka Says:
2011-05-16 19:23:49+0900
Sorry to hear that CW change did not help. Probably computation happens in float then and you can not globally change precision to 32bit or you upset a lot of libraries you use.

If next two releases are bound to your current compiler and then you can move up to 4.6, then backporting indeed makes little sense. As for preloading, I think it is more or less an distraction from fact that the binary touch a lot more code segment pages than it ought. So we want to solve the problem in any case.
glandium Says:
2011-05-17 08:14:37+0900
Jan: preloading will always be better than not preloading. At the moment, as you say, we touch many more code segment pages than we need, thus preloading the entire file works well. But whenever we manage to reduce that amount, we’ll also switch to targeted preloading.
Jan Hubicka Says:
2011-05-18 16:19:03+0900
OK, so the idea would be to use FDO to align code segment so startup is using just functions at the beggining (and ideally try to do something with data segment even when it is harder) and once binary is done, feed list of what to preload into the preloader via another feedback?
Or actually GCC might put marker into code segment where the functions known to be executed early at startup ends.

Honza
glandium Says:
2011-05-18 20:07:28+0900
Jan: We could use different sections or whatever. We’re not there yet.
CCG Says:
2011-06-21 11:51:41+0900
http://gcc.gnu.org/ml/gcc/2011-06/msg00273.html

GCC 4.6.1 RC is out, final release next week, hope you guys switch to it soon so we can have faster Linux builds.
CCG Says:
2011-06-27 19:14:17+0900
http://gcc.gnu.org/gcc-4.6/

GCC 4.6.1 Final released today.