Standing up the Cross-Compilation of Firefox for Windows on Linux
I've spent the past few weeks, and will spend the next few weeks, setting up cross-compiled builds of Firefox for Windows on Linux workers on Mozilla's CI. Following is a long wall of text, if that's too much for you, you may want to check the TL;DR near the end. If you're a Windows user wondering about the Windows Subsystem for Linux, please at least check the end of the post.
What is it?
Traditionally, compiling software happens mostly on the platform it is going to run on. Obviously, this becomes less true when you're building software that runs on smartphones, because you're usually not developing on said smartphone. This is where Cross-Compilation comes in.
Cross-Compilation is compiling for a platform that is not the one you're compiling on.
Cross-Compilation is less frequent for desktop software, because most developers will be testing the software on the machine they are building it with, which means building software for macOS on a Windows PC is not all that interesting to begin with.
Continuous Integration, on the other hand, in the era of "build pipelines", doesn't necessarily care that the software is built in the same environment as the one it runs on, or is being tested on.
But... why?
Five years ago or so, we started building Firefox for macOS on Linux. The main drivers, as far as I can remember, were resources and performance, and they were both tied: the only (legal) way to run macOS in a datacenter is to rack... Macs. And it's not like Apple had been producing rackable, server-grade, machines. Okay, they have, but that didn't last. So we were using aging Mac minis. Switching to Linux machines led to faster compilation times, and allowed to recycle the Mac minis to grow the pool running tests.
But, you might say, Windows runs on standard, rackable, server-grade machines. Or on virtually all cloud providers. And that is true. But for the same hardware, it turns out Linux performs better (more on that below), and the cost per hour per machine is also increased by the Windows license.
But then... why only now?
Firefox has a legacy of more than 20 years of development. That shows in its build system. All the things that allow cross-compiling Firefox for Windows on Linux only lined up recently.
The first of them is the compiler. You might interject with "mingw something something", but the reality is that binary compatibility for accessibility (screen readers, etc.) and plugins (Flash is almost dead, but not quite) required Microsoft Visual C++ until recently. What changed the deal is clang-cl, and Mozilla has stopped using MSVC for the builds of Firefox it ships with Firefox 63, about 20 months ago.
,
Another is the process of creating the symbol files used to process crash reports, which was using one of the tools from breakpad to dump the debug info from PDB files in the right format. Unfortunately, that was using a Windows DLL to do so. What recently changed is that we now have a platform-independent tool to do this, that doesn't require that DLL. And to place credit where credit is due, this was thanks to the people from Sentry providing Rust crates for most of the pieces necessary to do so.
Another is the build system itself, which assumed in many places that building for Windows meant you were on Windows, which doesn't help cross-compiling for Windows. But worse than that, it also assumed that the compiler was similar. This worked fine when cross-compiling for Android or MacOS on Linux because compiling tools for the build itself (most notably a clang plugin) and compiling Firefox use compatible compilers, that take the same kind of arguments. The story is different when one of the compilers is clang, which has command line arguments like GCC, and the other is clang-cl, which has command line arguments like MSVC. This changed recently with work to allow building Android Geckoview on Windows (I'm not entirely sure all the pieces for that are there just yet, but the ones in place surely helped me ; I might have inadvertently broken some things, though).
So how does that work?
The above is unfortunately not the whole story, so when I started looking a few weeks ago, the idea was to figure out how far off we were, and what kind of shortcuts we could take to make it happen.
It turns out we weren't that far off, and for a few things, we could work around by... just running the necessary Windows programs with Wine with some tweaks to the build system (Ironically, that means the tool to create symbol files didn't matter). For others... more on that further below.
But let's start looking how you could try this for yourself, now that blockers have been fixed.
First, what do you need?
- A copy of Microsoft Visual C++. Unfortunately, we still need some of the tools it contains, like the assembler, as well as the platform development files.
- A copy of the Windows 10 SDK.
- A copy of the Windows Debug Interface Access (DIA) SDK.
- A good old VFAT filesystem, large enough to hold a copy of all the above.
- A WOW64-supporting version of Wine (wine64).
- A full install of clang, including clang-cl (it usually comes along).
- A copy of the Windows version of clang-cl (yes, both a Linux clang-cl and a Windows clang-cl are required at the moment, more on this further below).
Next, you need to setup a .mozconfig
that sets the right target:
ac_add_options --target=x86_64-pc-mingw32
(Note: the target will change in the future)
You also need to set a few environment variables:
WINDOWSSDKDIR
, with the full path to the base of the Windows 10 SDK in your VFAT filesystem.DIA_SDK_PATH
, with the full path to the base of the Debug Interface Access SDK in your VFAT filesystem.
You also need to ensure all the following are reachable from your $PATH
:
wine64
ml64.exe
(somewhere in the copy of MSVC in your VFAT filesystem, under aHostx64/x64
directory)clang-cl.exe
(you also need to ensure it has the executable bit set)
And I think that's about it. If not, please leave a comment or ping me on Matrix (@glandium:mozilla.org), and I'll update the instructions above.
With an up-to-date mozilla-central, you should now be able to use ./mach build
, and get a fresh build of Firefox for 64-bits Windows as a result (Well, not right now as of writing, the final pieces only just landed on autoland, they will be on mozilla-central in a few hours).
What's up with that VFAT filesystem?
You probably noticed I was fairly insistive about some things being in a VFAT filesystem. The reason is filesystem case-(in)sensitivity. As you probably know, filesystems on Windows are case-insensitive. If you create a file Foo
, you can access it as foo
, FOO
, fOO
, etc.
On Linux, filesystems are most usually case-sensitive. So when some C++ file contains #include "windows.h"
and your filesystem actually contains Windows.h
, things don't align right. Likewise when the linker wants kernel32.lib
and you have kernel32.Lib
.
Ext4 recently gained some optional case-insensitivity, but it requires a very recent kernel, and doesn't work on existing filesystems. VFAT, however, as supported by Linux, has always(?) been case-insensitive. It is the simpler choice.
There's another option, though, in the form of FUSE filesystems that wrap an existing directory to expose it as case-insensitive. That's what I tried first, actually. CIOPFS does just that, with the caveat that you need to start from an empty directory, or an all-lowercase directory, because files with any uppercase characters in their name in the original directory don't appear in the mountpoint at all. Unfortunately, the last version, from almost 9 years ago doesn't withstand parallelism: when several processes access files under the mountpoint, one or several of them get failures they wouldn't otherwise get if they were working alone. So during my first attempts cross-building Firefox I was actually using -j1
. Needless to say, the build took a while, but it also made it more obvious when I hit something broken that needed fixing.
Now, on Mozilla CI, we can't really mount a VFAT filesystem or use FUSE filesystems that easily. Which brings us to the next option: LD_PRELOAD
. LD_PRELOAD
is an environment variable that can be set to tell the dynamic loader (ld.so
) to load a specified library when loading programs. Which in itself doesn't do much, but the symbols the library exposes will take precedence over similarly named symbols from other libraries. Such as libc.so
symbols. Which allows to divert e.g. open
, opendir
, etc. See where this is going? The library can divert the functions programs use to access files and change the paths the programs are trying to use on the fly.
Such libraries do exist, but I had issues with the few I tried. The most promising one was libcasefold, but building its dependencies turned out to be more work than it should have been, and the hooking it does via libsyscall_intercept is more hardcore than what I'm talking about above, and I wasn't sure we wanted to support something that hotpatches libc.so
machine code at runtime rather than divert it.
The result is that we now use our own, written in Rust (because who wants to write bullet-proof path munging code in C?). It can be used instead of a VFAT filesystem in the setup described above, but, being a hack, is not guaranteed to work in all setups.
So what's up with needing clang-cl.exe?
One of the tools Firefox needs to build is the MIDL compiler. To do its work, the MIDL compiler uses a C preprocessor, and the Firefox build system makes it use clang-cl. Something amazing that I discovered while working on this is that Wine actually supports executing Linux programs from Windows programs. So it looked like it was going to be possible to use the Linux clang-cl for that. Unfortunately, that doesn't quite work the same way executing a Windows program does from the parent process's perspective, and the MIDL compiler ended up being unable to read the output from the preprocessor.
Technically speaking, we could have made the MIDL compiler use MSVC's cl.exe
as a preprocessor, since it conveniently is in the same directory as ml64.exe
, meaning it is already in $PATH
. But that would have been a step backwards, since we specifically moved off cl.exe
.
Alternatively, it is also theoretically possible to compile with --disable-accessibility
to avoid requiring the MIDL compiler at all, but that currently doesn't work in practice. And while that would help for local builds, we still want to ship Firefox with accessibility on.
What about those compilation times, then?
Past my first attempts at -j1
, I was able to get a Windows build on my Linux machine in slightly less than twice the time for a Linux build, which doesn't sound great. Several things factor in this:
- the build system isn't parallelizing many of the calls to the MIDL compiler, and in practice that means the build sits there doing only that and nothing else (there are some known inefficiencies in the phase where this runs).
- the build system isn't parallelizing the calls to the Effect compiler (FXC), and this has the same effect on build times as the MIDL compiler above.
- the above two wouldn't actually be that much of a problem if ... Wine wasn't slow. When running full fledged applications or games, it really isn't, but there is a very noticeable overhead when running a lot of short-lived processes. That accumulates to several minutes over a full Firefox compilation.
That third point may or may not be related to the version of Wine available in Debian stable (what I was compiling on), or how it's compiled, but some lucky accident made things much faster on my machine.
See, we actually already have some Windows cross-compilation of Firefox on Mozilla CI, using mingw. Those were put in place to avoid breaking Tor Browser, because that's how they build for Windows, and because not breaking the Tor Browser is important to us. And those builds are already using Wine for the Effect compiler (FXC).
But the Wine they use doesn't support WOW64. So one of the first things necessary to setup 64-bits Windows cross-builds with clang-cl on Mozilla CI was to get a WOW64-supporting Wine. Following the Wine build instructions was more or less straightforward, but I hit a snag: it wasn't possible to install the freetype development files for both the 32-bits version and the 64-bits version because the docker images where we build Wine are still based on Debian 9 for reasons, and the freetype development package was not multi-arch ready on Debian 9, while it now is on Debian 10.
Upgrading to Debian 10 is most certainly possible, but that has a ton more implications than what I was trying to achieve is supposed to. You might ask "why are you building Wine anyways, you could use the Debian package", to which I'd answer "it's a good question, and I actually don't know. I presume the version in Debian 9 was too old (it is older than the one we build)".
Anyways, in the moment, while I happened to be reading Wine's configure script to get things working, I noticed the option --without-x
and thought "well, we're not running Wine for any GUI stuff, how about I try that, that certainly would make things easy". YOLO, right?
Not only did it work, but testing the resulting Wine on my machine, compilation times were now down to only be 1 minute slower than a Linux build, rather than 4.5 minutes! That was surely good enough to go ahead and try to get something running on CI.
Tell us about those compilation times already!
I haven't given absolute values so far, mainly because my machine is not representative (I'll have a blog post about that soon enough, but you may have heard about it on Twitter, IRC or Slack, but I won't give more details here), and because the end goal here is Mozilla automation, for both the actual release of Firefox (still a long way to go there), and the Try server. Those are what matters more to my fellow developers. Also, I actually haven't built under Windows on my machine for a fair comparison.
So here it comes:
Let's unwrap a little:
- The yellowish and magenta points are native Windows "opt" builds, on two different kinds of AWS instances.
- The other points are Cross-Compilations with the same "opt" configuration on three different kinds of AWS instances, one of which is the same as one used for Windows, and another one having better I/O than all the others (the cyan circles).
- We use a tool to share a compilation cache between builds on automation (sccache), which explains the very noisy nature of the build times, because they depend on the amount of source code changes and of the cache misses they induce.
- The Cross-Compiled builds were turned on around the 27th of February and started about as fast as the native Windows builds were at the beginning of the graph, but they had just seen a regression.
- The regression was due to a recent change that made the clang plugin change in every build, which led to large numbers of cache misses.
- After fixing the regression, the build times came back to their previous level on the native jobs.
- Sccache handled clang-cl arguments in a way that broke cross-compilation, so when we turned on the cross-compiled jobs on automation, they actually had the cache turned off!
- Let me state this explicitly because that wasn't expected at all: the cross-compiled jobs WITHOUT a cache were as fast as native jobs WITH a cache!
- A day later, after fixing sccache, we turned it on for the cross-compiled jobs, and build times dropped.
- The week-end passed, and with more realistic work loads where actual changes to compiled code happen and invalidate parts of the cache, build times get more noisy but stay well under what they are on native Windows.
But the above only captures build times. On automation, a job does actually more than build. It also needs to get the source code, and install the tools needed to build. The latter is unfortunately not tracked at the moment, but the former is:
Now, for some explanation of the above graph:
- The colors don't match the previous graph. Sorry about that.
- The colors vary by AWS instance type, and there is no actual distinction between Windows and Linux, so the instance type that is shared between them has values for both, which explain why it now looks bimodal.
- It can be seen that the ones with better I/O (in red) are largely faster to get the source code, but also that for the shared instance type, Linux is noticeably faster.
It would be fair to say that independently of Windows vs. Linux, way too much time is spent getting the source code, and there's other ongoing work to make things better.
TL;DR
Overall, the fast end of native Windows builds on Mozilla CI, including Try server, is currently around 45 minutes. That is the time taken by the entire job, and the minimum time between a developer pushing and Windows tests starting to run.
With Cross-Compilation, the fast end is, as of writing, 13 minutes, and can improve further.
As of writing, no actual Windows build job has switched over to Cross-compilation yet. Only an experimental, tier 2, job has been added. But the main jobs developers rely on on the Try server are going to switch real soon now™ (opt and debug for 32-bits, 64-bits and aarch64). Running all the test suites on Try against them yields successful results (modulo the usual known intermittent failures).
Actually shipping off Cross-compiled builds will take longer. We first need to understand the extent of the differences with the native builds and be confident that no subtle breakage happens. Also, PGO and LTO haven't been tested so far. Everything will come in time.
What about Windows Subsystem for Linux (WSL)?
The idea to allow developers on Windows to build Firefox from WSL has floated for a while. The work to stand up Cross-compiled builds on automation has brought us the closest ever to actually being able to do it! If you're interested in making it pass the finish line, please come talk to me in #build:mozilla.org on Matrix, there shouldn't be much work left and we can figure it out (essentially, all the places using Wine would need to do something else, and... that's it(?)). That should yield faster build times than natively with MozillaBuild.
2020-03-05 15:31:45+0900