Archive for October, 2011

Improving binary layout for progressive decompression

As I mentioned in my previous post, we recently built a prototype that would only uncompress the bits we do require from our libraries, as opposed to the complete library. Or at least try to.

The idea

The original idea was to do selective decompression of blocks as they are required. Such that instead of uncompressing the whole file, and then use it, we'd only uncompress blocks we do need, at the moment we need them. So, for example, instead of uncompressing 32KB, and then use the first and last 4KB (completely ignoring 24KB in the middle), we'd only uncompress and use the first and last 4KB.

Unfortunately, compressed streams in ZIP files are not seekable, so with the compressed data we have currently, we just can't do that. Which is why we're going to work on figuring out how to best do it with an index and how the zlib library will allow us to do so. Fortunately, one of the zlib author already has some working code doing something similar, though it's probably not usable as-is.

Anyways, as we needed to put things together for a prototype quite rapidly and had experience with binary reordering in the past, we figured a way to get the same kind of benefit would be to improve the binary layout so that we can uncompress subsets of the file progressively, as it is used. So for example, if the order in which we need blocks looks like [10, 11, 7, 13, 17], we'd first uncompress up to the 10th blocks, then the 11th, then use the 7th, that was uncompressed earlier, then uncompress up to the 13th, and so on.

In fact, the way the prototype worked was by having a background thread doing decompression while other thread(s) would do the actual work, and when it hits a block that hasn't been uncompressed yet, wait for the required data to be reached by the background thread. The implementation was however probably racy, which led to multiple random crashes. There may also be some issues with the signal handler being used. These are details that we are expecting to figure in time.

Relocated data

The first obstacle is that when a library is loaded, relocations are performed. Relocations are basically what allows pointers in a library to point to the right place in memory. They usually apply to data sections. They can apply to code sections when the code is not PIC (Position Independent Code), but this is fortunately not the case anymore on Firefox for Android. So they apply to data sections, and data sections are at the end of libraries. Which means, under our progressive decompression scheme as explained above, we'd still have to uncompress almost the entire file.

On the long term, we should be able to apply these relocations as we uncompress each blocks, but that requires our linker to handle elfhack relocations itself, and some more tweaks to access the appropriate relocations without having to scan the entire list. But even then, data is accessed during startup. Thus until we can seek in the compressed data stream, data needs to be placed before code.

As I was using the gold linker for various other reasons, and as gold doesn't have much flexibility for these things, I added a little something to elfhack to rewrite the ELF file such that data sections would be before code sections in the file, but still after when loaded in memory. Nice trick, but unfortunately, a lot of tools are not very happy with it (even less than what elfhack already does).

Reordering functions

Once data is dealt with by being placed before code, we need to make sure code required during startup is grouped together. The linker, however, is not able to reorder functions without some special treatment when compiling the source to object files: it requires the -ffunction-sections option to be given to GCC.

Normally, when compiling a source file to an object file, GCC creates a unique .text section containing all the code. If in a given source file, there are two functions, foo and bar, and foo calls bar, the function call from foo to bar will be hard-coded to use the distance from the call location to the start of bar.

When using -ffunction-sections, each function is given a separate .text.function_name section, and the compiler places relocations for calls between them. Simply put, instead of having recorded the distance from the call location to the start of bar, we'd have a relocation, applying on the call location, with a symbolic information telling the destination is the start of bar.

The linker will then apply the relocations, and the resulting code will be using the distance between the functions, as without -ffunction-sections. The difference comes from the fact that the linker now is able to move these function sections independently and adjust the various calls accordingly.

Unfortunately, there are several parts that -ffunction-sections doesn't affect, and thus can't be reordered without some effort:

Assembler source
(Static) system libraries (e.g. libgcc, libstlport)

For the former, the various inline assembly and separate assembler sources must be annotated with individual sections for each function defined. For the latter, the static libraries can be rebuilt with -ffunction-sections.

Another problem that was spotted was that the javascript engine wasn't compiled with -ffunction-sections.

Thumb-2/ARM trampolines

ARM processors we target with Firefox for Android have two instruction sets: ARM and Thumb-2. Switching from one of them to the other is called interwork. While we build Firefox for Android with the Thumb-2 instruction set, system libraries may use the ARM instruction set. As such, they must be able to call system library functions with interworking.

Technically, interworking can be triggered from Thumb-2 or ARM, to call either one. However, the Procedure Linkage Table, containing code that either calls the linker to resolve symbols, or the system library functions, contains ARM code. Theoretically, in a program or library that is entirely Thumb-2, it should be possible to use Thumb-2 code, but neither GNU ld nor gold do that. They however both (currently) have a different behavior.

GNU ld generates PLT entries that start as Thumb-2, switch to ARM and then call whatever they need to call.

Gold generates PLT entries that are fully ARM, but adds trampolines into which Thumb-2 code jumps to interwork and jump back to the ARM PLT.

These trampolines are unfortunately "randomly" placed in the resulting binary. They're not actually randomly positioned, but the result is that these trampolines end up in places that are definitely not near the beginning of the file. In my Firefox for Android builds, they were usually grouped in various places, some of which were at the end of the code section, which went completely against our goal. For the purpose of making the prototype quickly available, we had to compile Firefox for Android as ARM, which effectively makes the binary larger (ARM instructions are larger than Thumb-2 instructions), thus longer to uncompress.

Identical Code Folding

Identical Code Folding is one of the features we were using gold for. It allows to replace various identical implementations of functions with a single occurrence. This is particularly interesting on C++ code, where templates are heavily used. For example, a templated function used for bools and for ints will likely result in the same machine code.

In some other cases, simple enough functions can end up completely identical at the machine code level despite not looking as such in C++.

The problem is that gold handles function (sections) reordering after ICF. Once the ICF pass is done, only one of the original functions is known of gold for function reordering (even if in the end it does expose all the symbols appropriately). And it may well be the variant that does not appear in the function order list.

The current workaround is to link libxul.so once, with an option printing the functions being folded together, and adding all of them in the order list, at the position where it makes most sense. For instance, if foo and bar are folded and bar appears in the order list, foo and bar would be put in place of bar in the list.

Section names

As it may have transpired from above explanations, gold doesn't actually do function reordering. It does section reordering. In the simple cases, sections for functions are just named .text.function_name. However, there are also other variants, such as .text.startup.function_name, .text.hot.function_name or .text.unlikely.function_name. Even more subtle, there can also be .text.function_name.part.number sections, but for these, the .part.number is actually part of the function name as seen in the symbols table.

Tools that do the kind of profiling required to do function ordering like we wanted to do will give function names, not section names, because the section names are only present in the object files, not in the final libraries and programs. Thus after getting a list of function names as they are called during startup, cross-referencing with the object files to get the corresponding section names is required.

Valgrind

We used Valgrind in order to get the order in which functions were called during startup. While our icegrind plugin would have allowed it, various changes in Valgrind in the past year apparently broke icegrind. But ARM support for Valgrind is work in progress and is based on current trunk, so we needed a recent build, which excludes using icegrind, except if we update it.

But it also turns out Valgrind has a built-in feature that more or less allows to get the order in which functions are called. Valgrind can show a trace of each basic blocks the first time it sees them executed (--trace-flags=10000000, and to be on the safe side, --vex-guest-chase-thresh=0). Basic blocks are subsets of functions, so Valgrind likely will report the same function several times, but it's pretty easy to filter out those duplicates.

Unfortunately, we encountered a couple problems with this built-in tracing, not reporting names for some functions when the binary is compiled as Thumb-2, and not reporting at all some functions that we know are being called when the binary is compiled as ARM (it may also be the case for Thumb-2, haven't verified). In the end, the order list was updated manually for the prototype.

Hardware-specific code paths

There are various places in the Mozilla code base where different implementations of the same functions are available for different kind of hardware, and the most suitable one is chosen at runtime, depending on the hardware detected. On x86, we have some implementations using SSE2 instructions, and fall-back implementations in plain x86. On ARM, there are implementations using NEON instructions, and fall-back implementations in plain ARM or Thumb-2. Sometimes, there aren't even fall-backs.

This means that which function is going to be used at runtime will very much depend on the hardware it will be executed on. So if you profile Firefox for Android on a tablet with a Tegra-2 CPU (which doesn't have NEON), you won't be exercising the same code paths as when running the same Firefox for Android build on a Nexus S (which CPU does have NEON).

If we get the order of functions called at startup on the Tegra-2 tablet, we won't be getting any of the NEON-specific functions in the functions list, and the resulting reordered binary may have the NEON-specific functions at an inconvenient position for progressive decompression. So the runtime benefits we'd see on the Tegra-2 tablet would go away on the Nexus S. And vice-versa.

Callgraph-based reordering

The GCC google branch has patches adding a new interesting feature: callgraph-based reordering. It makes the compiler add call graph information extracted from PGO/FDO profile data to object files, and the linker use that information to reorder the binary. Theoretically, this would allow us to reorder our binaries without relying on valgrind or something similar, and even take advantage of the fact that we already do PGO/FDO (except we currently don't, for Android builds).

The main drawback, however, is exactly the same as for the -ffunction-sections option: assembler sources and static (system) libraries are not going to be covered. And unlike with -ffunction-sections, it's hard to work around.

It also most likely suffers the same problem as section ordering w.r.t identical code folding, and also has the same problem with ARM/Thumb-2 trampolines and hardware-specific code paths.

Conclusion

It's still a long way before having efficiently reordered binaries for Android builds. It involves toolchain fixes and tweaks, and build engineering tricks to allow running the build on some device and during the cross-compiled build. We should however be able to benefit from incremental decompression with seekable compressed streams well before that. Stay tuned.

2011-10-31 18:53:23+0900

p.m.o | 7 Comments »

Making Firefox start faster on Android

On my Nexus S, the time line when starting up current nightly looks like this:

0ms		Icon tapped
550ms	(+550ms)	The application entry point in Java code is reached
1350ms	(+800ms)	The application entry point in native code is reached
1700ms	(+350ms)	Start creating the top level window (i.e. no XUL has been loaded yet)
2500ms	(+800ms)	Something is painted on screen

That's a quite good scenario, currently. On the very first run after the installation, even more time is spent extracting a bunch of files and creating the profile. And when opening an url, it also is slower because of the time that's needed to initialize the content process: tapping the icon will bring about:home, which doesn't start a separate content process.
The Nexus S also isn't exactly a slow device, even if there are much better devices nowadays. This means there are devices out there where startup is even slower than that.

Considering Android devices have a much more limited amount of memory than most Desktop computers, and that as a result switching between applications is very likely to get some background applications stopped, this makes startup time a particularly important issue on Android.

The Mobile team is working on making most of the last 800ms go away. The Performance team is working on the other bits.

Reaching Java code

550ms to reach our code sounds like both outrageous, and impossible to solve: we don't control neither how nor when our code is called by Android after the user has tapped on our icon. In fact, we actually have some control.

Not all Firefox builds are made equal, and it turns out some builds don't have the outrageous 550ms overhead. They instead have a 300ms overhead. The former builds contain multiple languages. The latter only contain english.

Android applications come in the form of APK files, which really are ZIP files. It turns out Android is taking a whole lot of time to read the entire list of files in the APK and/or handling the corresponding manifest file at startup, and the more files there are, the more time is spent. A multi-languages build contains 3651 files. A single language build contains "only" 1428 files.

We expect getting down to about 100ms by packing chrome, components and some other resources together in a single file, effectively storing a ZIP archive (omni.jar) inside the APK. We (obviously) won't compress at both levels.

Reaching native code

Native code, like on other platforms, comes in the form of libraries. These libraries are stored in the APK, and uncompressed in memory when Firefox starts. Not so long ago, when space permitted, Firefox would actually store the uncompressed libraries on the internal flash, but as this wasn't a clear win in actual use cases (on top of making first startup abysmally slow), it was disabled. We however still have the possibility to enable it for use with various tools such as valgrind or oprofile, which don't support the way we load libraries directly off the APK.

These 800ms between reaching Java code and reaching native code are spent uncompressing the libraries, applying relocations and running static initialization code. Most of these 800ms are actually spent on uncompressing the main library file (libxul.so) alone.

But we don't actually need all that code from the very start. We don't even need all that code to display the user interface and the web page we first load. The analysis I conducted last year on desktop Firefox very much applies on mobile too.

By only loading the pieces that are needed, we expect to cut the loading time at least in half, and even more on multi-core devices. Last week, we built a prototype with promising results, but the experience also uncovered difficulties. More details will follow in a subsequent post.

Initializing Gecko

Another significant part of the startup phase is initializing Gecko. This includes, but is not limited to:

Initializing the Javascript engine
Registering and initializing components
Initializing the various subsystems using sqlite databases

The switch to a native Android user interface is going to significantly alter how filesystem accesses (most notably sqlite) are going to happen. While we have tested asynchronous I/O for sqlite (and having it break Places), we can't currently go much further until native UI reaches a certain level of maturity.

We however need to identify what particular pieces of initialization are contributing the most to the 350ms between reaching native code and starting to create the top level window.

2011-10-24 15:59:09+0900

p.m.o | 5 Comments »

Firefox Mobile debugging on Android got a bit easier

Two changes landed recently that are going to make Firefox Mobile debugging on Android a bit easier:

Debug info files are now available for nightly builds since friday, aurora builds since saturday, and beta builds starting from 8.0b3. Until now, debugging required to use custom builds, as the debug info wasn't available anywhere.
Our custom Android linker now makes GDB happy and properly reports our libraries to GDB, without requiring to extract them. This means you don't need to run with the debug intent to be able to attach and use GDB. This is only on mozilla-central for now. Update: unfortunately, it also only works on "debuggable" builds, so only custom builds :(

For convenience, I modified the fetch-symbols script to allow to download the debug info files for Android builds (Note this version of the script also solves the problem of getting debug info files for x86 builds on a x86-64 host). Instead of giving it the Firefox directory like for desktop builds, give it the APK file:

$ python fetch-symbols.py fennec-10.0a1.en-US.android-arm.apk http://symbols.mozilla.org/ /path/for/symbols

2011-10-10 10:51:12+0900

p.m.o | 3 Comments »