Archive for the 'p.m.o' Category

faulty.lib vs. Firefox for Android

As mentioned in the previous post, Firefox for Android now has a fresh new linker that allows on-demand decompression. All the necessary code is currently in mozilla-central, but on-demand decompression is disabled by default. For good reason, see below.

Building with on-demand decompression only requires adding the following to your .mozconfig:

export MOZ_ENABLE_SZIP=1

This will enable on-demand decompression for libxul.so only.

With this, you can build Firefox for Android normally and ... not feel much of an improvement. The main reason is that the compressed libxul.so library is bigger (because of the chunk compression, see previous post), and read almost entirely. That, in turn, is due to how the code and data is laid out in the library.

Since the way this on-demand decompression of libraries works relies on chunks, each time you need even a few bytes at some not-yet decompressed location, you end up decompressing a complete chunk. And since the static linker (the one used when building libraries) is not smart enough to lay out everything in an order optimal for startup, we end up reading a lot of things at random locations.

There are essentially two phases of Firefox startup related to libraries that are important to us: loading/linking the library, and using it. The sooner the former finishes, the sooner the latter can start.

With a freshly built mozilla-central, this is what the map of decompressed chunks looks like after loading and linking the library:

Each red square is a decompressed chunk. In the above map, 348 chunks are decompressed, out of 998. The main reason for this happening is static initializers.

Then by the time Gecko has been initialized and a web page loaded and started painting, here is what the map looks like:

That is a lot of red: 715 decompressed chunks, out of 998.

Educating the static linker

As explained above, the static linker is not laying out code and data in an on-demand decompression friendly way. We've had various experiences with binary reordering in the past. This unfortunately currently involves some manual work, but we hope our release engineering team will be able to automate the process, and I sure hope we'll be able to make it more reliable and faster. I will detail further below the current state of the art.

If you're not interested in technical details, you can skip to the results.

The mozilla build system parts are almost there. One part of binary reordering involves building with the right flags. It so happens that we're currently not building the javascript engine with the necessary -ffunction-sections -fdata-sections flags. I hope to be able to land that patch later this week.

Another part involves educating the linker of the order in which we want it to put symbols. Unfortunately, neither GNU ld nor gold allow to order by symbol names. Gold allows to order by section with the --section-ordering-file argument (thus the -ffunction-sections -fdata-sections flags), and GNU ld... not quite. However, GNU ld has a very powerful scripting facility. And with a little bit of effort, one can reorder sections with it. So the remaining grunt work is to map symbols to sections.

It would be nice if that was enough, but it turns out it is not. I haven't tried with gold, but it turns out GNU ld, or at least the version shipped in the Android NDK (which, by the way, doesn't come with gold, which is mainly why I made the effort to support GNU ld linker scripts) fails to link when non-virtual thunks are too far away from the corresponding function, even when it's not far enough to grant such a failure (I think that's a known bug of older GNU ld versions). So I went around grouping such related functions together, which is a good idea anyway.

For those interested, the trick for GNU ld to reorder functions is to add -Wl,-T,filename on the link command, with the file filename containing the following:

SECTIONS {
  .text : {
    *(.text.symbol1)
    *(.text.symbol2)
(...)
  }
}
INSERT BEFORE .fini

Not all symbols need to be present in that file. The same can be done with other sections such as .data, provided INSERT BEFORE is adjusted, and several sections can be put in the same file.

The patch integrating all this should hopefully land soon (please note that it requires a backout of changeset 283408b8d8a3).

Profiling environment

Reordering symbols when linking requires that we know in which order symbols are accessed during startup. As far as I know there are no existing tool to do that besides valgrind. Julian Seward has been working on getting valgrind to work on Android devices, and it currently works well on the pandaboard, Nexus S and Xoom.

While a full featured valgrind requires debug info for some system libraries and a lot of memory, for the purpose of binary reordering, we actually won't be using any valgrind tool ; only valgrind core. As such, there are much less constraints on the Android builds required. The only constraint is to have an Android build that has the zygote hooks to wrap applications when starting them. Recent AOSP and Linaro Android has them (those based on Ice Cream Sandwich). Memory might be a problem on the Nexus S, but not on the pandaboard or the xoom.

Pick your weapon of choice. Mine was a pandaboard running Linaro Android 12.02, which works so much better than the gingerbread based version I was using before. I had a few glitches, though. Since it didn't want to use my mouse, I couldn't click through to enable ethernet networking, so I had to enable it manually through adb shell with the following shell commands:

# netcfg etho dhcp
# setprop net.dns1 x.x.x.x

Where x.x.x.x, obviously, is the DNS server IP address.

Another glitch I ran into is that after some time not using it, it would go to sleep, and fail to resume from it. This is a known issue and has a workaround.

Building Valgrind

All necessary tweaks we've needed to get useful data from valgrind have landed on valgrind trunk. So all is needed to build valgrind is to check out svn://svn.valgrind.org/valgrind/trunk and follow the instructions from README.android. Just make sure you have at least revision 12408.

Setup to profile Firefox

After building Firefox with on-demand decompression enabled, we need to create an APK for use with valgrind. This requires that elfhack is disabled, because it confuses valgrind. This can be done at packaging time, just make sure you didn't create a package already, because this leaves libraries in elfhacked state.

$ make -f client.mk package USE_ELF_HACK=

Once you have an APK, you can install it on the Android device, but make sure you have no leftovers first:

$ adb uninstall org.mozilla.fennec_$USER
$ make -f client.mk install

At this point, we're only doing on-demand decompression for libxul.so. There aren't much benefits from on-demand decompression for the other smaller libraries. Valgrind is going to require debug information for that library, so we need to push an unstripped libxul.so onto the device:

$ adb shell mkdir -p /sdcard/symbols/data/data/org.mozilla.fennec_$USER/cache
$ adb push $objdir/dist/lib/libxul.so /sdcard/symbols/data/data/org.mozilla.fennec_$USER/cache/

Finally, we'll need to setup the hook:

$ cat << EOF > /tmp/start_valgrind_fennec
#!/system/bin/sh
export TMPDIR=/data/data/org.mozilla.fennec_$USER
exec /data/local/Inst/bin/valgrind --log-file=/sdcard/valgrind.log --tool=none --vex-guest-chase-thresh=0 --trace-flags=10000000 --demangle=no \$*
EOF
$ adb push /tmp/start_valgrind_fennec /data/local/
$ adb shell chmod 755 /data/local/start_valgrind_fennec
$ adb shell setprop wrap.org.mozilla.fennec_$USER /data/local/start_valgrind_fennec

Profiling Firefox

As mentioned earlier and in the previous post, any time we go through an unusual code path, we're at risk of jumping randomly in the code. The more we do so, the more chunks we're going to have to decompress. We want to avoid that the more we can during startup. I got the best results from running three different profiles and aggregating them:

  • Starting Firefox for the very first time (no existing profile on internal flash, etc.)
  • Starting Firefox again
  • And finally starting it yet again with an url to open

Each time, we need to wait for Firefox to settle, then close it, adb pull /sdcard/valgrind.log and keep the log separate.

But since we built with on-demand decompression enabled, valgrind is not going to be happy, so we also need to say to the dynamic linker to extract the libraries to internal flash instead.

Starting Firefox can thus be done with the following command:

$ adb shell am start -n org.mozilla.fennec_$USER/.App --es env0 MOZ_LINKER_EXTRACT=1

And starting it with an url:

$ adb shell am start -n org.mozilla.fennec_$USER/.App -d http://www.mozilla.org/ --es env0 MOZ_LINKER_EXTRACT=1

As far as I know, we don't have a nice command line allowing to close Firefox. What I was able to do on the pandaboard is to open the application menu with the Menu key from an usb keyboard (mouse didn't work, but keyboard did) and select Quit from there. Another approach I tested was to add a Quit intent to our java code base, but it didn't seem to work properly, even though it did bring the process down and left zygote happy.

Relinking

At this point, we have three valgrind logs. We need to transform them into an aggregated symbols list:

$ awk -F'[ +]' '$10 ~ /libxul.so/ { print $9 }' log1 log2 log3 > /tmp/list

The script dealing with symbols ordering in the mozilla build system will take care of deduplicating entries. Please note merging the symbols lists more cleverly may give better results.

Now we can use that list of symbols to relink libxul.so:

$ rm $objdir/toolkit/library/libxul.so
$ make -C $objdir/toolkit/library SYMBOL_ORDER=/tmp/list

And finalize by packaging it, with elfhack not disabled, then installing:

$ make -f client.mk package install

Results

After loading and linking the library: (146 decompressed chunks out of 998)

A great part of the remaining decompressed chunks above is due to relocation. It is in our roadmap to do them on-demand as well. When that comes, we'll probably need to reorder data as we are currently reordering functions.

Some of the randomly placed red squares are, I think, static initializers from statically linked libraries such as stlport, which aren't built with -ffunction-section.

After starting to display a web page: (356 decompressed chunks out of 998)

The result is pretty good: we've cut the number of decompressed chunks in half. There is still a factor of randomness, which has to be investigated, whether this comes from valgrind missing things, the linker not being able to move some symbols, the linker generating things at inconvenient locations, or something else.

Less decompression means less time spent decompressing, and thus means faster startup. Unfortunately, many other things happen on the java side during startup that make the win from doing less decompression smaller than it should be. We're going to work on reducing, or delaying these things, whatever they are. Especially when starting with an url.

2012-03-02 11:38:08+0900

faulty.lib, p.m.o | 1 Comment »

Introducing faulty.lib

TL;DR link: faulty.lib is on github.

Android applications are essentially zip archives, with the APK extension instead of ZIP. While most of Android is java, and java classes are usually loaded from a ZIP archive (usually with the JAR extension), Android applications using native code need to have native libraries on the file system. These native libraries are found under /data/data/$appid/lib, where $appid is the package name, as defined in the AndroidManifest.xml file.

So, when Android installs an application, it puts that APK file under /data/app. Then, if the APK contains native libraries under a lib/$ABI subdirectory (where $ABI is armeabi, armeabi-v7a or x86), it also decompresses the files and places them under /data/data/$appid/lib. This means native libraries are actually stored twice on internal flash: once compressed and once decompressed.

This is why Chrome for Android takes almost 50MB of internal flash space after installation.

Firefox for Android used to have that problem, and we decided we should stop doing that. Michael Wu thus implemented a custom dynamic linker, which would load most of Firefox libraries directly off the APK. This involves decompressing the zipped data somewhere in memory, and doing ld.so's job to make the library usable (please note that on Android, ld.so is actually named linker). There were initially circumstances under which we would decompress into a file and reuse it the next time Firefox starts, but we subsequently removed that possibility (except for debugging purpose) because it ended up being slower than decompressing each time (thanks to internal flash being so slow).

Anyways, in order to do ld.so's job, our custom linker was directly derived from Android's system linker, with many tweaks. This custom linker has done its job quite well for some time, now, but has been recently replaced, see further below.

Considering Firefox can't do anything useful involving Gecko until its libraries are loaded, in practice, this means Firefox can't display a web page faster than completely decompressing the libraries. Or can it?

Just don’t sit down 'cause I’ve moved your chair

We know that a lot of code and data is not used during Firefox startup. Based on that knowledge, we started working on only loading the necessary bits. The core of the idea is, when a library is requested to be loaded, to reserve anonymous memory for its decompressed size, and... that's all. That memory is protected such that any access to it triggers a segmentation fault. When a segmentation fault happens, the required bits are decompressed, and execution is resumed where it was before the segmentation fault.

The original prototype was decompressing from a normal zip deflated stream, which means it was impossible to seek in it. So, if an access was made at the end of the library, it was necessary to decompress the whole library. With some nasty binary reordering, and some difficulty, it was possible to avoid accessing the end of the library, but the approach is very much fragile. It only takes an unexpected code path to make things much slower than they should be.

Consequently, for the past months, I've been working on improving the original idea and, with some assistance from Julian Seward, implemented the scheme with seekable compressed streams. Instead of letting the zip archive creation tool deflate libraries, we store specially crafted files. Essentially, files are cut in small chunks, and each chunk is compressed individually. This means a less efficient compression, but it also means random access to chunks is possible.

However, instead of stacking on top of our existing custom linker, I started over, from the ground up. First, because it needed a serious clean up (a good part of linker.c is leftovers from the Android linker that we don't use, and APKOpen.cpp is a messy mix of JNI stubs, library decompression handling (which in itself was also a mess) and Gecko initialization code) and most importantly, because it relied on some Android system linker internals and thus required binary compatibility with the system linker. Which, according to Google engineers that contacted us a few months ago, was going to break in what we now know will be called Android Jelly Bean.

The benefit of the clean slate approach is that the new code is not tied to Gecko at all and was designed to work on Android as well as on desktop Linux systems (which made debugging much much easier). We're thus releasing the code as a separate project: faulty.lib. It is licensed under the Mozilla Public License version 2.0. Please feel free to try, contribute, and/or fork it.

This dynamic linker is not meant to completely follow standard ELF rules (most notably for symbol resolution), and as a result does some assumptions. It's also still work in progress, with some obvious optimizations pending (like, avoiding to resolve the same symbols again and again during relocations), and some features missing (for example, symbol versioning).

The next blog post will give some information about how to build Firefox for Android to benefit from on-demand decompression. I will also detail a few of the tricks involved in this dynamic linker in subsequent blog posts.

2012-03-01 17:10:11+0900

faulty.lib, p.d.o, p.m.o | 5 Comments »

Boot to Gecko: more than open web, open source

By now, most tech sites have talked about Boot to Gecko, following MWC announcement and demos, and how it's based on standards from the open web to build the entire user interface.

One thing I would like to stress is that more than being tied to the open web, Boot to Gecko is also open source at its core. If you want to build Boot to Gecko today and use it on your phone, you can. Sure, you may have some difficulties if your phone is not a Galaxy S2, but I'm pretty sure the community at large (XDA and others) will soon have a solution for a lot of other phones or if you are skilled enough, you can do it yourself. And that is the great strength of Boot to Gecko's open source nature.

And that strength is also a major difference with Android. Android is open source, but it also is developed behind closed doors. Boot to Gecko code has been available on github for months. Android 5.0 Jelly Bean, which is expected for the end of the year, has been in development at Google for months too. Want to test it? Want to see the source code? Want to participate? Well, you'll have to wait for it to be shipped on an actual phone, since Google doesn't release the source code until then.

2012-02-29 11:11:33+0900

p.m.o | 2 Comments »

Fun with weak symbols

Consider the following foo.c source file:

extern int bar() __attribute__((weak));
int foo() {
  return bar();
}

And the following bar.c source file:

int bar() {
  return 42;
}

Compile both sources:

$ gcc -o foo.o -c foo.c -fPIC
$ gcc -o bar.o -c bar.c -fPIC

In the resulting object for foo.c, we have an undefined symbol reference to bar. That symbol is marked weak.

In the resulting object for bar.c, the bar symbol is defined and not weak.

What we expect from linking both objects is that the weak reference is fulfilled by the existence of the bar symbol in the second object, and that in the resulting binary, the foo function calls bar.

$ ld -shared -o test1.so foo.o bar.o

And indeed, this is what happens.

$ objdump -T test1.so | grep "\(foo\|bar\)"
0000000000000260 g    DF .text  0000000000000007 foo
0000000000000270 g    DF .text  0000000000000006 bar

What do you think happens if the bar.o object file is embedded in a static library?

$ ar cr libbar.a bar.o
$ ld -shared -o test2.so foo.o libbar.a
$ objdump -T test2.so | grep "\(foo\|bar\)"
0000000000000260 g    DF .text  0000000000000007 foo
0000000000000000  w   D  *UND*  0000000000000000 bar

The bar function is now undefined and there is a weak reference for the symbol. Calling foo will crash at runtime.

This is apparently a feature of the linker. If anyone knows why, I would be interested to hear about it. Good to know, though.

2012-02-23 10:46:50+0900

p.d.o, p.m.o | 10 Comments »

How to waste a lot of space without knowing

const char *foo = "foo";

This was recently mentioned on bugzilla, and the problem is usually underestimated, so I thought I would give some details about what is wrong with the code above.

The first common mistake here is to believe foo is a constant. It is a pointer to a constant. In practical ELF terms, this means the pointer lives in the .data section, and the string constant in .rodata. The following code defines a constant pointer to a constant:

const char * const foo = "foo";

The above code will put both the pointer and the string constant in .rodata. But keeping a constant pointer to a constant string is pointless. In the above examples, the string itself is 4 bytes (3 characters and a zero termination). On 32-bits architectures, a pointer is 4 bytes, so storing the pointer and the string takes 8 bytes. A 100% overhead. On 64-bits architectures, a pointer is 8 bytes, putting the total weight at 12 bytes, a 200% overhead.

The overhead is always the same size, though, so the longer the string, the smaller the overhead, relatively to the string size.

But there is another, not well known, hidden overhead: relocations. When loading a library in memory, its base address varies depending on how many other libraries were loaded beforehand, or depending on the use of address space layout randomization (ASLR). This also applies to programs built as position independent executables (PIE). For pointers embedded in the library or program image to point to the appropriate place, they need to be adjusted to the base address where the program or library is loaded. This process is called relocation.

The relocation process requires information which is stored in .rel.* or .rela.* ELF sections. Each pointer needs one relocation. The relocation overhead varies depending on the relocation type and the architecture. REL relocations use 2 words, and RELA relocations use 3 words, where a word is 4 bytes on 32-bits architectures and 8 bytes on 64-bits architectures.

On x86 and ARM, to mention the most popular 32-bits architectures nowadays, REL relocations are used, which makes a relocation weigh 8 bytes. This puts the pointer overhead for our example string to 12 bytes, or 300% of the string size.

On x86-64, RELA relocations are used, making a relocation weigh 24 bytes! This puts the pointer overhead for our example string to 32 bytes, or 800% of the string size!

Another hidden cost of using a pointer to a constant is that every time it is used in the code, there will be pointer dereference. A function as simple as

int bar() { return foo; }

weighs one instruction more when foo is defined const char *. On x86, that instruction weighs 2 bytes. On x86-64, 3 bytes. On ARM, 4 bytes (or 2 in Thumb). That weight can vary depending on the additional instructions required, but you get the idea: using a pointer to a constant also adds overhead to the code, both in time and space. Also, if the string is defined as a constant instead of being used as a literal in the code, chances are it's used several times, multiplying the number of such instructions. Update: Note that in the case of const char * const, the compiler will optimize these instruction and avoid reading the pointer, since it's never going to change.

The symbol for foo is also exported, making it available from other libraries or programs, which might not be required, but also adds its own overhead: an entry in the symbols table (5 words), an entry in the string table for the symbol name (strlen("foo") + 1) and an entry in the symbols hash chain table (4 bytes if only one type of hash table (sysv or GNU) is present, 8 if both are present), and possibly an entry in the symbols hash bucket table, depending on the other exported symbols (4 or 8 bytes, as chain table). It can also affect the size of the bloom filter table in the GNU symbol hash table.

So here we are, with a seemingly tiny 3 character string possibly taking 64 bytes or more! Now imagine what happens when you have an array of such tiny strings. This also doesn't only apply to strings, it applies to any kind of global pointer to constants.

In conclusion, using a definition like

const char *foo = "foo";

is almost never what you want. Instead, you want to use one of the following forms:

  • For a string meant to be exported:

    const char foo[] = "foo";

  • For a string meant to be used in the same source file:

    static const char foo[] = "foo";

  • For a string meant to be used across several source files for the same library:

    __attribute__((visibility("hidden"))) const char foo[] = "foo";

2012-02-18 09:17:21+0900

p.d.o, p.m.o | 15 Comments »

Some more about the progressive decompression experiment

There are a few things I forgot to mention in my previous post about the progressive decompression experiment, so here they are.

Implicit section grouping

As I mentioned, GCC may now generate function sections with names starting with .text.hot., .text.unlikely., etc. It also expects the linker to group each category together. GNU ld does it, but Gold doesn't, yet.

While grouping these can be interesting for runtime performance, it can also go against startup performance, or only go half-way.

The .text.unlikely. sections being grouped together means that code that is not unlikely will be grouped together, which helps reducing cache pressure at runtime. But something unlikely to be called is not something that is never called, and a lot of unlikely functions are being called during startup. Fortunately, GNU ld places them first.

The .text.startup. sections being grouped together means that static initializers are all stacked together. This is good, because it helps a lot with the awful I/O patterns that can be observed during startup without this grouping. But it only goes half-way because functions that are being called by these static initializers (think class constructors, for example), are not marked as .text.startup. and can still be placed anywhere.

Profiling the right thing

Like with hardware-specific code paths, there are various code paths that can be triggered at startup during the profiling phase that may or may not be triggered in real case scenarios. Having or not having a pre-existing profile, having or not having extensions installed, opening the default home page or opening an url, etc. all have an impact on what specific code paths are exercised during startup.

Anything that happens on actual use cases that weren't part of the profiling scenario may call code that was not covered by any reordering, and go against the startup time improvements progressive decompression can get us. This is why, more than any kind of binary layout tweaks, being able to uncompress any random part of the binary at any time is critical to bring startup improvements to everyone, in every cases.

2011-11-02 08:46:10+0900

p.m.o | No Comments »

Improving binary layout for progressive decompression

As I mentioned in my previous post, we recently built a prototype that would only uncompress the bits we do require from our libraries, as opposed to the complete library. Or at least try to.

The idea

The original idea was to do selective decompression of blocks as they are required. Such that instead of uncompressing the whole file, and then use it, we'd only uncompress blocks we do need, at the moment we need them. So, for example, instead of uncompressing 32KB, and then use the first and last 4KB (completely ignoring 24KB in the middle), we'd only uncompress and use the first and last 4KB.

Unfortunately, compressed streams in ZIP files are not seekable, so with the compressed data we have currently, we just can't do that. Which is why we're going to work on figuring out how to best do it with an index and how the zlib library will allow us to do so. Fortunately, one of the zlib author already has some working code doing something similar, though it's probably not usable as-is.

Anyways, as we needed to put things together for a prototype quite rapidly and had experience with binary reordering in the past, we figured a way to get the same kind of benefit would be to improve the binary layout so that we can uncompress subsets of the file progressively, as it is used. So for example, if the order in which we need blocks looks like [10, 11, 7, 13, 17], we'd first uncompress up to the 10th blocks, then the 11th, then use the 7th, that was uncompressed earlier, then uncompress up to the 13th, and so on.

In fact, the way the prototype worked was by having a background thread doing decompression while other thread(s) would do the actual work, and when it hits a block that hasn't been uncompressed yet, wait for the required data to be reached by the background thread. The implementation was however probably racy, which led to multiple random crashes. There may also be some issues with the signal handler being used. These are details that we are expecting to figure in time.

Relocated data

The first obstacle is that when a library is loaded, relocations are performed. Relocations are basically what allows pointers in a library to point to the right place in memory. They usually apply to data sections. They can apply to code sections when the code is not PIC (Position Independent Code), but this is fortunately not the case anymore on Firefox for Android. So they apply to data sections, and data sections are at the end of libraries. Which means, under our progressive decompression scheme as explained above, we'd still have to uncompress almost the entire file.

On the long term, we should be able to apply these relocations as we uncompress each blocks, but that requires our linker to handle elfhack relocations itself, and some more tweaks to access the appropriate relocations without having to scan the entire list. But even then, data is accessed during startup. Thus until we can seek in the compressed data stream, data needs to be placed before code.

As I was using the gold linker for various other reasons, and as gold doesn't have much flexibility for these things, I added a little something to elfhack to rewrite the ELF file such that data sections would be before code sections in the file, but still after when loaded in memory. Nice trick, but unfortunately, a lot of tools are not very happy with it (even less than what elfhack already does).

Reordering functions

Once data is dealt with by being placed before code, we need to make sure code required during startup is grouped together. The linker, however, is not able to reorder functions without some special treatment when compiling the source to object files: it requires the -ffunction-sections option to be given to GCC.

Normally, when compiling a source file to an object file, GCC creates a unique .text section containing all the code. If in a given source file, there are two functions, foo and bar, and foo calls bar, the function call from foo to bar will be hard-coded to use the distance from the call location to the start of bar.

When using -ffunction-sections, each function is given a separate .text.function_name section, and the compiler places relocations for calls between them. Simply put, instead of having recorded the distance from the call location to the start of bar, we'd have a relocation, applying on the call location, with a symbolic information telling the destination is the start of bar.

The linker will then apply the relocations, and the resulting code will be using the distance between the functions, as without -ffunction-sections. The difference comes from the fact that the linker now is able to move these function sections independently and adjust the various calls accordingly.

Unfortunately, there are several parts that -ffunction-sections doesn't affect, and thus can't be reordered without some effort:

  • Assembler source
  • (Static) system libraries (e.g. libgcc, libstlport)

For the former, the various inline assembly and separate assembler sources must be annotated with individual sections for each function defined. For the latter, the static libraries can be rebuilt with -ffunction-sections.

Another problem that was spotted was that the javascript engine wasn't compiled with -ffunction-sections.

Thumb-2/ARM trampolines

ARM processors we target with Firefox for Android have two instruction sets: ARM and Thumb-2. Switching from one of them to the other is called interwork. While we build Firefox for Android with the Thumb-2 instruction set, system libraries may use the ARM instruction set. As such, they must be able to call system library functions with interworking.

Technically, interworking can be triggered from Thumb-2 or ARM, to call either one. However, the Procedure Linkage Table, containing code that either calls the linker to resolve symbols, or the system library functions, contains ARM code. Theoretically, in a program or library that is entirely Thumb-2, it should be possible to use Thumb-2 code, but neither GNU ld nor gold do that. They however both (currently) have a different behavior.

GNU ld generates PLT entries that start as Thumb-2, switch to ARM and then call whatever they need to call.

Gold generates PLT entries that are fully ARM, but adds trampolines into which Thumb-2 code jumps to interwork and jump back to the ARM PLT.

These trampolines are unfortunately "randomly" placed in the resulting binary. They're not actually randomly positioned, but the result is that these trampolines end up in places that are definitely not near the beginning of the file. In my Firefox for Android builds, they were usually grouped in various places, some of which were at the end of the code section, which went completely against our goal. For the purpose of making the prototype quickly available, we had to compile Firefox for Android as ARM, which effectively makes the binary larger (ARM instructions are larger than Thumb-2 instructions), thus longer to uncompress.

Identical Code Folding

Identical Code Folding is one of the features we were using gold for. It allows to replace various identical implementations of functions with a single occurrence. This is particularly interesting on C++ code, where templates are heavily used. For example, a templated function used for bools and for ints will likely result in the same machine code.

In some other cases, simple enough functions can end up completely identical at the machine code level despite not looking as such in C++.

The problem is that gold handles function (sections) reordering after ICF. Once the ICF pass is done, only one of the original functions is known of gold for function reordering (even if in the end it does expose all the symbols appropriately). And it may well be the variant that does not appear in the function order list.

The current workaround is to link libxul.so once, with an option printing the functions being folded together, and adding all of them in the order list, at the position where it makes most sense. For instance, if foo and bar are folded and bar appears in the order list, foo and bar would be put in place of bar in the list.

Section names

As it may have transpired from above explanations, gold doesn't actually do function reordering. It does section reordering. In the simple cases, sections for functions are just named .text.function_name. However, there are also other variants, such as .text.startup.function_name, .text.hot.function_name or .text.unlikely.function_name. Even more subtle, there can also be .text.function_name.part.number sections, but for these, the .part.number is actually part of the function name as seen in the symbols table.

Tools that do the kind of profiling required to do function ordering like we wanted to do will give function names, not section names, because the section names are only present in the object files, not in the final libraries and programs. Thus after getting a list of function names as they are called during startup, cross-referencing with the object files to get the corresponding section names is required.

Valgrind

We used Valgrind in order to get the order in which functions were called during startup. While our icegrind plugin would have allowed it, various changes in Valgrind in the past year apparently broke icegrind. But ARM support for Valgrind is work in progress and is based on current trunk, so we needed a recent build, which excludes using icegrind, except if we update it.

But it also turns out Valgrind has a built-in feature that more or less allows to get the order in which functions are called. Valgrind can show a trace of each basic blocks the first time it sees them executed (--trace-flags=10000000, and to be on the safe side, --vex-guest-chase-thresh=0). Basic blocks are subsets of functions, so Valgrind likely will report the same function several times, but it's pretty easy to filter out those duplicates.

Unfortunately, we encountered a couple problems with this built-in tracing, not reporting names for some functions when the binary is compiled as Thumb-2, and not reporting at all some functions that we know are being called when the binary is compiled as ARM (it may also be the case for Thumb-2, haven't verified). In the end, the order list was updated manually for the prototype.

Hardware-specific code paths

There are various places in the Mozilla code base where different implementations of the same functions are available for different kind of hardware, and the most suitable one is chosen at runtime, depending on the hardware detected. On x86, we have some implementations using SSE2 instructions, and fall-back implementations in plain x86. On ARM, there are implementations using NEON instructions, and fall-back implementations in plain ARM or Thumb-2. Sometimes, there aren't even fall-backs.

This means that which function is going to be used at runtime will very much depend on the hardware it will be executed on. So if you profile Firefox for Android on a tablet with a Tegra-2 CPU (which doesn't have NEON), you won't be exercising the same code paths as when running the same Firefox for Android build on a Nexus S (which CPU does have NEON).

If we get the order of functions called at startup on the Tegra-2 tablet, we won't be getting any of the NEON-specific functions in the functions list, and the resulting reordered binary may have the NEON-specific functions at an inconvenient position for progressive decompression. So the runtime benefits we'd see on the Tegra-2 tablet would go away on the Nexus S. And vice-versa.

Callgraph-based reordering

The GCC google branch has patches adding a new interesting feature: callgraph-based reordering. It makes the compiler add call graph information extracted from PGO/FDO profile data to object files, and the linker use that information to reorder the binary. Theoretically, this would allow us to reorder our binaries without relying on valgrind or something similar, and even take advantage of the fact that we already do PGO/FDO (except we currently don't, for Android builds).

The main drawback, however, is exactly the same as for the -ffunction-sections option: assembler sources and static (system) libraries are not going to be covered. And unlike with -ffunction-sections, it's hard to work around.

It also most likely suffers the same problem as section ordering w.r.t identical code folding, and also has the same problem with ARM/Thumb-2 trampolines and hardware-specific code paths.

Conclusion

It's still a long way before having efficiently reordered binaries for Android builds. It involves toolchain fixes and tweaks, and build engineering tricks to allow running the build on some device and during the cross-compiled build. We should however be able to benefit from incremental decompression with seekable compressed streams well before that. Stay tuned.

2011-10-31 18:53:23+0900

p.m.o | 7 Comments »

Making Firefox start faster on Android

On my Nexus S, the time line when starting up current nightly looks like this:

0ms Icon tapped
550ms (+550ms) The application entry point in Java code is reached
1350ms (+800ms) The application entry point in native code is reached
1700ms (+350ms) Start creating the top level window (i.e. no XUL has been loaded yet)
2500ms (+800ms) Something is painted on screen

That's a quite good scenario, currently. On the very first run after the installation, even more time is spent extracting a bunch of files and creating the profile. And when opening an url, it also is slower because of the time that's needed to initialize the content process: tapping the icon will bring about:home, which doesn't start a separate content process.
The Nexus S also isn't exactly a slow device, even if there are much better devices nowadays. This means there are devices out there where startup is even slower than that.

Considering Android devices have a much more limited amount of memory than most Desktop computers, and that as a result switching between applications is very likely to get some background applications stopped, this makes startup time a particularly important issue on Android.

The Mobile team is working on making most of the last 800ms go away. The Performance team is working on the other bits.

Reaching Java code

550ms to reach our code sounds like both outrageous, and impossible to solve: we don't control neither how nor when our code is called by Android after the user has tapped on our icon. In fact, we actually have some control.

Not all Firefox builds are made equal, and it turns out some builds don't have the outrageous 550ms overhead. They instead have a 300ms overhead. The former builds contain multiple languages. The latter only contain english.

Android applications come in the form of APK files, which really are ZIP files. It turns out Android is taking a whole lot of time to read the entire list of files in the APK and/or handling the corresponding manifest file at startup, and the more files there are, the more time is spent. A multi-languages build contains 3651 files. A single language build contains "only" 1428 files.

We expect getting down to about 100ms by packing chrome, components and some other resources together in a single file, effectively storing a ZIP archive (omni.jar) inside the APK. We (obviously) won't compress at both levels.

Reaching native code

Native code, like on other platforms, comes in the form of libraries. These libraries are stored in the APK, and uncompressed in memory when Firefox starts. Not so long ago, when space permitted, Firefox would actually store the uncompressed libraries on the internal flash, but as this wasn't a clear win in actual use cases (on top of making first startup abysmally slow), it was disabled. We however still have the possibility to enable it for use with various tools such as valgrind or oprofile, which don't support the way we load libraries directly off the APK.

These 800ms between reaching Java code and reaching native code are spent uncompressing the libraries, applying relocations and running static initialization code. Most of these 800ms are actually spent on uncompressing the main library file (libxul.so) alone.

But we don't actually need all that code from the very start. We don't even need all that code to display the user interface and the web page we first load. The analysis I conducted last year on desktop Firefox very much applies on mobile too.

By only loading the pieces that are needed, we expect to cut the loading time at least in half, and even more on multi-core devices. Last week, we built a prototype with promising results, but the experience also uncovered difficulties. More details will follow in a subsequent post.

Initializing Gecko

Another significant part of the startup phase is initializing Gecko. This includes, but is not limited to:

  • Initializing the Javascript engine
  • Registering and initializing components
  • Initializing the various subsystems using sqlite databases

The switch to a native Android user interface is going to significantly alter how filesystem accesses (most notably sqlite) are going to happen. While we have tested asynchronous I/O for sqlite (and having it break Places), we can't currently go much further until native UI reaches a certain level of maturity.

We however need to identify what particular pieces of initialization are contributing the most to the 350ms between reaching native code and starting to create the top level window.

2011-10-24 15:59:09+0900

p.m.o | 5 Comments »

Firefox Mobile debugging on Android got a bit easier

Two changes landed recently that are going to make Firefox Mobile debugging on Android a bit easier:

  • Debug info files are now available for nightly builds since friday, aurora builds since saturday, and beta builds starting from 8.0b3. Until now, debugging required to use custom builds, as the debug info wasn't available anywhere.
  • Our custom Android linker now makes GDB happy and properly reports our libraries to GDB, without requiring to extract them. This means you don't need to run with the debug intent to be able to attach and use GDB. This is only on mozilla-central for now. Update: unfortunately, it also only works on "debuggable" builds, so only custom builds :(

For convenience, I modified the fetch-symbols script to allow to download the debug info files for Android builds (Note this version of the script also solves the problem of getting debug info files for x86 builds on a x86-64 host). Instead of giving it the Firefox directory like for desktop builds, give it the APK file:

$ python fetch-symbols.py fennec-10.0a1.en-US.android-arm.apk http://symbols.mozilla.org/ /path/for/symbols

2011-10-10 10:51:12+0900

p.m.o | 3 Comments »

Building a custom kernel for the Nexus S

There are several reasons why someone would want to build a custom kernel for their Android phone. In my case, this is because I wanted performance counters (those used by the perf tool that comes with the kernel source). In Julian Seward's case, he wanted swap support to overcome the limited memory amount on these devices in order to run valgrind. In both cases, the usual suspects (AOSP, CyanogenMod) don't provide the wanted features in prebuilt ROMs.

There are also several reasons why someone would NOT want to build a complete ROM for their Android phone. In my case, the Nexus S is what I use to work on Firefox Mobile, but it is also my actual mobile phone. It's a quite painful and long process to create a custom ROM, and another long (but arguably less painful thanks to ROM manager) process to backup the phone data, install the ROM, restore the phone data. And if you happen to like or use the proprietary Google Apps that don't come with the AOSP sources, you need to add more steps.

There are however tricks that allow to build a custom kernel for the Nexus S and use it with the system already on the phone. Please note that the following procedure has only been tested on two Nexus S with a 2.6.35.7-something kernel (one with a stock ROM, but unlocked, and another one with an AOSP build). Also please note that there are various ways to achieve many of the steps in this procedure, but I'll only mention one (or two in a few cases). Finally, please note some steps rely on your device being rooted. There may be ways to do without, but I'm pretty sure it requires an unlocked device at the very least. This post doesn't cover neither rooting nor unlocking.

Preparing a build environment

To build an Android kernel, you need a cross-compiling toolchain. Theoretically, any will do, provided it targets ARM. I just used the one coming in the Android NDK:

$ wget http://dl.google.com/android/ndk/android-ndk-r6b-linux-x86.tar.bz2
$ tar -jxf android-ndk-r6b-linux-x86.tar.bz2
$ export ARCH=arm
$ export CROSS_COMPILE=$(pwd)/android-ndk-r6b/toolchains/arm-linux-androideabi-4.4.3/prebuilt/linux-x86/bin/arm-linux-androideabi-

For the latter, you need to use a directory path containing prefixed versions (such as arm-eabi-gcc or arm-linux-androideabi-gcc), and include the prefix, but not "gcc".

You will also need the adb tool coming from the Android SDK. You can install it this way:

$ wget http://dl.google.com/android/android-sdk_r12-linux_x86.tgz
$ tar -zxf android-sdk_r12-linux_x86.tgz
$ android-sdk-linux_x86/tools/android update sdk -u -t platform-tool
$ export PATH=$PATH:$(pwd)/android-sdk-linux_x86/platform-tools

Building the kernel

For the Nexus S, one needs to use the Samsung Android kernel tree, which happens to be unavailable at the moment of writing due to the kernel.org outage. Fortunately, there is a clone used for the B2G project, which also happens to contain the necessary cherry-picked patch to add support for the PMU registers on the Nexus S CPU that are needed for the performance counters.

$ git clone -b devrom-2.6.35 https://github.com/cgjones/samsung-android-kernel
$ cd samsung-android-kernel

You can then either start from the default kernel configuration:

$ make herring_defconfig

or use the one from the B2G project, which enables interesting features such as oprofile:

$ wget -O .config https://raw.github.com/cgjones/B2G/master/config/kernel-nexuss4g

From then, you can use the make menuconfig or similar commands to further configure your kernel.

One of the problems you'd first encounter when booting such a custom kernel image is that the bcm4329 driver module that is shipped in the system partition (and not in the boot image) won't match the kernel, and won't be loaded. The unfortunate consequence is the lack of WiFi support.

One way to overcome this problem is to overwrite the kernel module in the system partition, but I didn't want to have to deal with switching modules when switching kernels.

There is however a trick allowing the existing module to be loaded by the kernel: compile a kernel with the same version string as the one already on the phone. Please note this only really works if the kernel is really about the same. If there are differences in the binary interface between the kernel and the modules, it will fail in possibly dangerous ways.

To use that trick, you first need to know what kernel version is running on your device. Settings > About phone > Kernel version will give you that information on the device itself. You can also retrieve that information with the following command:

$ adb shell cat /proc/version

With my stock ROM, this looks like the following:

Linux version 2.6.35.7-ge382d80 (android-build@apa28.mtv.corp.google.com) (gcc version 4.4.3 (GCC) ) #1 PREEMPT Thu Mar 31 21:11:55 PDT 2011

In the About phone information, it looks like:

2.6.35.7-ge382d80
android-build@apa28

The important part above is -ge382d80, and that is what we will be using in our kernel build. Make sure the part preceding -ge382d80 does match the output of the following command:

$ make kernelversion

The trick is to write that -ge382d80 in a .scmversion file in the kernel source tree (obviously, you need to replace -ge382d80 with whatever your device has):

$ echo -ge382d80 > .scmversion

The kernel can now be built:

$ make -j$(($(grep -c processor /proc/cpuinfo) * 3 / 2))

The -j... part is the general rule I use when choosing the number of parallel processes make can use at the same time. You can pick whatever suits you better.

Before going further, we need to get back to the main directory:

$ cd ..

Getting the current boot image

The Android boot image living in the device doesn't contain only a kernel. It also contains a ramdisk containing a few scripts and binaries, that starts the system initialization. As we will be using the ramdisk coming with the existing kernel, we need to get that ramdisk from the device flash memory:

$ adb shell cat /proc/mtd | awk -F'[:"]' '$3 == "boot" {print $1}'

The above command will print the mtd device name corresponding to the "boot" partition. On the Nexus S, this should be mtd2.

$ adb shell
$ su
# dd if=/dev/mtd/mtd2 of=/sdcard/boot.img bs=4096
2048+0 records in
2048+0 records out
8388608 bytes transferred in x.xxx secs (xxxxxxxx bytes/sec)
# exit
$ exit

In the above command sequence, replace mtd2 with whatever the previous command did output for you. Now, you can retrieve the boot image:

$ adb pull /sdcard/boot.img

Creating the new boot image

We first want to extract the ramdisk from that boot image. There are various tools to do so, but for convenience, I took unbootimg, on github, and modified it slightly to seemlessly support the page size on the Nexus S. For convenience as well, we'll use mkbootimg even if fastboot is able to create boot images.

Building unbootimg, as well as the other tools rely on the Android build system, but since I didn't want to go through setting it up, I figured a minimalistic way to build the tools:

$ git clone https://github.com/glandium/unbootimg.git
$ git clone git://git.linaro.org/android/platform/system/core.git

The latter is a clone of git://android.git.kernel.org/platform/system/core.git, which is down at the moment.

$ gcc -o unbootimg/unbootimg unbootimg/unbootimg.c core/libmincrypt/sha.c -Icore/include -Icore/mkbootimg
$ gcc -o mkbootimg core/mkbootimg/mkbootimg.c core/libmincrypt/sha.c -Icore/include
$ gcc -o fastboot core/fastboot/{protocol,engine,bootimg,fastboot,usb_linux,util_linux}.c core/libzipfile/{centraldir,zipfile}.c -Icore/mkbootimg -Icore/include -lz

Once the tools are built, we can extract the various data from the boot image:

$ unbootimg/unbootimg boot.img
section sizes incorrect
kernel 1000 2b1b84
ramdisk 2b3000 22d55
second 2d6000 0
total 2d6000 800000
...but we can still continue

Don't worry about the error messages about incorrect section sizes if it tells you "we can still continue". The unbootimg program creates three files:

  • boot.img-mk, containing the mkbootimg options required to produce a working boot image,
  • boot.img-kernel, containing the kernel image,
  • boot.img-ramdisk.cpio.gz, containing the gzipped ramdisk, which we will reuse as-is.

All that is left to do is to generate the new boot image:

$ eval ./mkbootimg $(sed s,boot.img-kernel,samsung-android-kernel/arch/arm/boot/zImage, boot.img-mk)

Booting the image

There are two ways you can use the resulting boot image: one-time boot or flash. If you want to go for the latter, it is best to actually do both, starting with the one-time boot, to be sure you won't be leaving your phone useless (though recovery is there to the rescue, but is not covered here).

First, you need to get your device in the "fastboot" mode, a.k.a. boot-loader:

$ adb reboot bootloader

Alternatively, you can power it off, and power it back on while pressing the volume up button.

Once you see the boot-loader screen, you can test the boot image with a one-time boot:

$ ./fastboot boot boot.img
downloading 'boot.img'...
OKAY [ 0.xxxs]
booting...
OKAY [ 0.xxxs]
finished. total time: 0.xxxs

As a side note, if fastboot sits "waiting for device", it either means your device is not in fastboot mode (or is not connected), or that you have permissions issues on the corresponding USB device in /dev.

Your device should now be starting up, and eventually be usable under your brand new kernel (and WiFi should be working, too). Congratulations.

If you want to use that kernel permanently, you can now flash it after going back in the bootloader:

$ adb reboot bootloader
$ ./fastboot flash boot boot.img
sending 'boot' (2904 KB)...
OKAY [ 0.xxxs]
writing 'boot'...
OKAY [ 0.xxxs]
finished. total time: 0.xxxs
$ ./fastboot reboot

Voilà.

2011-09-14 09:23:47+0900

p.d.o, p.m.o | 8 Comments »