Author Archive

Debian Mozilla組のAPTリポジトリの変化

お久しぶりです。

Debianアーカイブでまだ配布出来ないパッケージの配布の仕方を変化させていただきました。これからは4.0ベータを利用するには下記のAPTのソースを追加してください:

deb http://mozilla.debian.net/ experimental iceweasel-4.0

Experimentalのリポジトリも必要ですので、APTのsources.listに追加してください。その設定で4.0ベータのインストールが下記の通りに簡単なはずです:

# apt-get install -t experimental iceweasel

Squeezeでもunstableでも利用出来るはずです。

Debian Lenny向けのIceweasel 3.6のbackportも配布します。インストールするには下記のAPTのソースを追加してください:

deb http://mozilla.debian.net/ lenny-backports iceweasel-3.6

Lenny-backportsのリポジトリも必要ですので、APTのsources.listに追加してください。Experimentalのようにインストールが下記の通りに簡単なはずです:

# apt-get install -t lenny-backports iceweasel

APTが公開鍵を利用出来ない場合は公開鍵をAPTキーリングに追加する説明を読んでみてください(英語)。

2011-01-14 17:36:41+0900

mozilla | No Comments »

Changes to the Debian Mozilla team APT archive

I made some changes as to how packages from the Debian Mozilla team that can't yet be distributed in the Debian archives are distributed to users. Please update your APT sources and now use the following for 4.0 beta packages:

deb http://mozilla.debian.net/ experimental iceweasel-4.0

You'll also need the experimental repository in your sources, but the overall installation is much easier now:

# apt-get install -t experimental iceweasel

This should work for squeeze and unstable users.

I also added Iceweasel 3.6 backports for Debian Lenny users. For these, add the following APT source:

deb http://mozilla.debian.net/ lenny-backports iceweasel-3.6

You'll also need the lenny-backports repository in your sources. As for the experimental packages above, installation should be as easy as:

# apt-get install -t lenny-backports iceweasel

If your APT complains about the archive key, please check the instructions to add the key to your APT keyring.

2011-01-14 10:25:48+0900

mozilla | 27 Comments »

Attempting to track I/O with systemtap

There are several ways a program can hit the disk, and it can be hard to know exactly what's going on, especially when you want to take into account the kernel caches. This includes, but is not limited to:

  • any access to a file, which may lead to I/O to read its parent directories if they are not already in the inode or dentry caches
  • enumerating a directory with readdir(), which may lead to I/O on the directory for the same reason
  • read()/write() on a file, which may lead to I/O on the file if it is not in the page cache
  • accesses in a mmap()ed area of memory, which may lead to I/O on the underlying file if it is not in the page cache
  • etc.

There are various ways to track system calls (e.g. strace) allowing to track what a program is doing with files and directories, but that doesn't tell you if you're actually hitting the disk or not. There are also various ways to track block I/O (e.g. blktrace) allowing to track actual I/O on the block devices, but it is then hard to back-track what part of which files or directories these I/O relate to. To the best of my knowledge, there are unfortunately no tools to do such tracking easily.

Systemtap, however, allows to access the kernel's internals and to gather almost any kind of information from any place in a running kernel. The downside is that it means you need to know how the kernel works internally to gather the data you need ; that will limit the focus of this post.

I had been playing, in the past, with Taras' script, which he used a while ago to track I/O during startup. Unfortunately, it became clear something was missing in the picture, so I had to investigate what's going on in the kernel.

Setting up systemtap

On Debian systems, you need to install the following packages:

  • systemtap
  • linux-image-2.6.x-y-$arch-dbg (where x, y, and $arch correspond to the kernel package you are using)
  • linux-headers-2.6.x-y-$arch (likewise)
  • make

That should be enough to pull all the required dependencies. You may want to add yourself to the stapdev and stapusr groups, if you don't want to run systemtap as root. If, like in my case, you don't have enough space left for all the files in /usr/lib/debug, you can trick dpkg into not unpacking files you don't need:

# echo path-exclude /usr/lib/debug/lib/modules/*/kernel/drivers/* > /etc/dpkg/dpkg.cfg.d/kernel-dbg

The file in /etc/dpkg/dpkg.cfg.d can obviously be named as you like, and you can adjust the path-exclude pattern to what you (don't) want. In the above case, kernel drivers debugging symbols will be ignored. Please note that this feature requires dpkg version 1.15.8 or greater.

Small digression

One of the first problems I had with Taras' script is that systemtap would complain that it doesn't know the kernel.function("ext4_get_block") probe. It is due to a very unfortunate misfeature of systemtap, where the kernel.* probes refer to whatever is in the vmlinux image. Modules probes have a separate namespace, namely module("name").*.

So for the ext4_get_block() function, this means you need to set a probe for either kernel.function("ext4_get_block") or module("ext4").function("ext4_get_block"), depending how your kernel was compiled. And you can't even use both in your script, because systemtap will complain about either being an unknown probe...

Tracking the right thing

I was very recently pointed to a Red Hat document containing 4 useful systemtap scripts ported from dtrace, which gives me a good occasion to explain the issue at hand with the first of these scripts.

This script attempts to track I/O by following read() and write() system calls. Which is not tracking I/O, it is merely tracking some system calls (Taras' script had the same kind of problem with read()/write() induced I/O). You could just do the very same with existing tools like strace, and that wouldn't even require some system privileges.

To demonstrate the script doesn't actually track the use of storage devices as the document claims, consider the following source code:

#include <fcntl.h>
#include <sys/stat.h>
#include <unistd.h>

/* Ensure the resulting binary is decently sized */
const char dummy[1024 * 1024] = "a";

int main(int argc, char *argv[]) {
  struct stat st;
  char buf[65536];
  int i, fd = open(argv[0], O_RDONLY);
  for (i = 0; i < 100000; i++) {
    read(fd, buf, 65536);
    lseek(fd, 0, SEEK_SET);
    read(fd, buf, 65536);
    lseek(fd, 512 * 1024, SEEK_SET);
  }
  close(fd);
  return 0;
}

All it does is reading some parts of the executable a lot of times (notice the trick to make the executable size at least 1MB). Not a lot of programs will actually do something as bold as reading the same data again and again (though we could probably be surprised), but this easily points the problem. Here is what the output of the systemtap script looks like for this program (stripping other unrelevant processes):

         process     read   KB tot    write   KB tot
            test   400002 25600001        0        0

Now, do you really believe 25MB have actually been read from the disk? (Note that the read() count seems odd as well, as there should only be around 200000 calls)

Read-ahead, page cache and mmap()

What the kernel actually does is that a read() on a file is going to check the page cache first. If there is nothing corresponding to the read() request in the page cache, then it goes down to the disk, and fills page cache. But as loading only a few bytes or kilobytes from the disk could be wasteful, the kernel also reads a few more blocks ahead, apparently with some heuristic.

But read() and write() aren't the sole way a program may hit the disk. On UNIX systems, a file can be mapped in memory with the mmap() system call, and all accesses in the memory range corresponding to this mapping will be reflected on the file. There are exceptions, depending on how the mapping is established, but let's keep it simple. There is a lot of litterature on the subject if you want to document yourself on mmap().

The way the kernel will read from the file, however, is quite similar to that of read(), and uses page cache and read ahead. The systemtap script debunked above doesn't track these.

I'll skip write accesses, because for now, they haven't been in my scope.

Tracking some I/O with systemtap

What I've been trying to track so far has been limited to disk reads, which happen to be the only accesses occurring on shared library files. Programs and shared libraries are first being read() from, so that the dynamic linker gets the ELF headers and knows what to load, and then are mmap()ed following PT_LOAD entries in these headers.

As far as my investigation in the Linux kernel code goes, fortunately, both accesses, before they end up actually hitting the disk, go through the __do_page_cache_readahead kernel function (this function was tracked in Taras' script). Unfortunately, while it is called with an offset and a number of pages to read for a given file, it turns out the last pages in that number are not necessarily being read from the disk. I don't know for sure, because the latter had an effect on my observations, but some might even already be in the page cache.

Going further down, we reach the VFS layer, which ends up being file-system specific. But fortunately, a bunch of (common) file-systems actually share page mapping code, and commonly use the mpage_readpage and mpage_readpages functions, both calling do_mpage_readpage to do the actual work. And this function seems to be properly called only once for each page not in the page cache already.

If my reading of the kernel source is right, this however is not really where it ends, and do_mpage_readpage doesn't actually hit the disk. It seems to only gather some information (basically, a mapping between storage blocks and memory) that is going to be submitted to the block I/O layer, which itself may do some fancy stuff with it, such as reordering the I/Os depending on other requests it got from other processes, the position of the disk heads, etc.

And when I say do_mpage_readpage doesn't actually hit the disk, I'm again simplifying the issue, because it actually might, as there might be a need to read some metadata from the disk to know where some blocks are located. But tracking metadata reads is much harder, and I haven't investigated it.

Anyways, skipping metadata, going further down after do_mpage_readpage is hard because it's difficult to back-track which block I/O is related to which read-ahead, corresponding to what read at what position in which file. do_mpage_readpage already has part of this problem because it is not called with any reference to the corresponding file. But __do_page_cache_readahead is.

So knowing all the above, here is my script, the one I used to get the most recent Firefox startup data you can find in my last posts:

global targetpid;
global file_path;

probe begin {
  targetpid = target();
}

probe kernel.function("__do_page_cache_readahead") {
  if (targetpid == pid())
    file_path[tid()] = d_path(&$filp->f_path);
}

probe kernel.function("do_mpage_readpage") {
  if (targetpid == pid() && (tid() in file_path)) {
    now = gettimeofday_us();
    printf("%d %s %d\n", now, file_path[tid()], $page->index*4096);
  }
}

probe kernel.function("__do_page_cache_readahead").return {
  if (targetpid == pid())
    delete file_path[tid()];
}

This script needs to be used with a command given to systemtap, with the -c option, such as in the following command line:

# stap readpage.stp -c firefox

Each line of output represents a page (i.e. 4,096 bytes) being read, and contains a timestamp, the name of the file being read, and the offset in the file. As discussed above, do_mpage_readpage is not really the place the I/O actually occurs, so the timestamps are not entirely accurate, and the actual read order from disk might be slightly different, but it still is a quite reliable view in that the result should be reproducible with the same files even when they're not located on the same blocks on disk, and provided their page cache status is the same when starting the program.

This systemtap script ignores writes, as well as metadata accesses (including but not limited to inodes, dentries, bitmap blocks and indirect blocks). It also doesn't account accesses to files opened with the O_DIRECT flag or similar constructs (raw devices, etc.)

Read-ahead in action

Back to the small example program, my systemtap script records 102 page accesses, that is, 417,792 bytes, much less than the actual binary size on my system (1,055,747 bytes). We are still far from the 25MB figure from the other systemtap script. But we are also far from the 128KiB the program actively reads (64KiB twice, leaving a hole between the two blocks).

At this point, it is important to note that the ELF headers and program code all fit within a single page (4 KiB), and following the 1MiB section corresponding to the dummy variable, there are only 5,219 bytes of other ELF sections, including the .dynamic section. So even counting everything the dynamic linker needs to read, and the program code itself, we're still far from what my systemtap script records.

Grouping consecutive blocks with consecutive timestamps, here is what can be observed:

Offset Length
0 16,384
983,040 73,728
16,384 262,144
524,288 65,536

(By now, you should have guessed why I wanted that big hole between the read()s ; if you want to reproduce at home, I suggest you also use my page cache flush helper)

As earlier investigations showed, the first accesses by the dynamic loader are to read the ELF headers and .dynamic section. As mentioned above, these are really small. Yet, the kernel actually reads much more: 16KiB at the beginning of the file when reading the ELF headers, and 72KiB at the end when reading the .dynamic section. Subsequent accesses from the dynamic loader are obviously already covered by these accesses.

Next accesses are those due to the program itself. The program actively reads 64KiB at the beginning of the file, then 64KiB starting at offset 524,288. For the first read, the kernel already had 16KiB in the page cache, so it didn't read them again, but instead of reading the remainder, it reads 256KiB. For the second read, however, it only reads the requested 64KiB.

As you can see, this is far from being something like "you wanted n KiB, I'll read that fixed amount now".

Further testing with different read patterns (e.g. changing the hole size, read size, or reading from the dummy variable directly instead of read()ing the binary) is left as an exercise to the reader.

2011-01-12 20:55:56+0900

p.d.o, p.m.o | 8 Comments »

Reducing Firefox startup I/O using system libraries

After establishing how I/O affects startup and exploring some ways to improve the situation, I wanted to check another obvious, yet not so practical for Mozilla, way of reducing startup I/O.

Firefox is a quite featured web browser, and fortunately, all the code in the Firefox tree doesn't come from Mozilla (though a large part does). Some parts of the code actually come from third party libraries, such as libpng for PNG support, libjpeg for (obviously) JPEG support, libsqlite3 for SQLite support (used, for example, for the awesomebar), or libcairo for part of the layout rendering.

Some of these libraries are also used in other software. For instance, Gtk+, the widget toolkit Firefox uses on UNIX systems, and used by the GNOME desktop environment, needs, directly or indirectly, libcairo for rendering, and libpng for icons. But these other softwares don't embed their own copy of the libraries: they use shared, system libraries.

There are several advantages doing so: they reduce memory footprint, because the library is loaded only once in memory (modulo relocations on data sections), they reduce maintenance overhead (you only need to update one copy for stability or security issues), and, you should have guessed where I'm going by now, they help reduce startup I/O: a system library that has been loaded by some other software doesn't need to be read from disk again when you start another application using the same system library.

Unfortunately, Mozilla distributes binaries that can run on a large variety of systems, which are not necessarily using the right versions of the libraries, and sometimes even, Mozilla patches some libraries it embeds to fix bugs or add features. One such example, the worst case, actually, is libpng, which is patched to handle APNG files. Libcairo also comes with some bug fixes, most of which, after a while, end up in the original libcairo. Anyways, as Mozilla can't really control which version of cairo, sqlite, etc. are installed on systems where Firefox is going to be used, the only way that works for us is to ship binaries that link against these internal (possibly modified) libraries.

Though the same logic could be applied to some other libraries which we do link against system versions of (such as Gtk+), they're, in practice, not so much a problem as sqlite and cairo, which are the most problematic libraries at the moment.

On non GNU/Linux platforms (excluding some other uncommon unix variants for Firefox users), system libraries are unfortunately not a common practice, and even if they were, most of the libraries Firefox uses internally are not shipped with the system. On (more or less) controlled environments, however, the story is slightly different. When you know what version of which libraries is shipped, it's easier to start using system libraries. GNU/Linux distributions are such environments, mobile OS (Android, Maemo/Meego) are, too.

I'm not interested in discussing here the support implications of GNU/Linux distributions shipping Firefox binaries linked against system libraries, though in practice, it creates less problems than some people may think. This post only aims at showing how using system libraries impacts startup times, how it could or could not be an interesting thing to investigate on mobile, and how GNU/Linux distributions practices are actually not hurting.

As a reminder, here are the startup times with a plain Firefox 4.0b8, as collected in previous posts:

Average startup time (ms)
x86 3,228.76 ± 0.57%
x86-64 3,382.0 ± 0.51%

The first set of system libraries I enabled are those that the default GNOME desktop environment as it comes in Debian Squeeze uses, except libpng, because the system library doesn't support APNG files (note that I installed libcairo from Debian experimental, as Squeeze doesn't have 1.10):

ac_add_options --with-system-zlib
ac_add_options --with-system-bz2
ac_add_options --enable-system-sqlite
ac_add_options --enable-system-cairo

Average startup time (ms)
x86 3,179.86 ± 0.44%
x86-64 3,326.5 ± 0.51%

It turns out there is not much gain, but at least, it's not a regression. Please note that libbz2 is actually not used in Firefox itself. It's only used in the upgrade program.

Let's go further: the clock applet coming with the gnome-panel uses a system copy of libnspr4 and libnss3, through libedataserver. Provided that these libraries are recent enough on the system, Firefox can actually use them, so let's add these:

ac_add_options --with-system-nspr
ac_add_options --with-system-nss

Average startup time (ms)
x86 3,086.74 ± 0.48%
x86-64 3,226.3 ± 0.59%

This had a much more noticeable positive impact. This is most probably due to the size and number of libraries involved, as currently, nspr and nss are respectively 3 and 8 separate library files.

Let's now go even further and enable all currently supported flags remaining, except libpng, as already explained:

ac_add_options --with-system-libevent
ac_add_options --with-system-libvpx
ac_add_options --with-system-jpeg
ac_add_options --enable-system-hunspell

Average startup time (ms)
x86 3,152.26 ± 0.74%
x86-64 3,270.9 ± 0.52%

Unsurprisingly, it has a negative impact, most probably due to the additional disk seeks that each of these library induces. The good news is that it doesn't get slower than the original Firefox build, and is actually still slightly faster.

There are more libraries that Firefox embeds and could be using system libraries instead, but there aren't flags for them yet. At the very least, and if I recall correctly, libogg is loaded by the GNOME desktop environment through libcanberra for event sounds, so there could be some more possible gain there.

However, it's not entirely clear from the above figures whether that would be worth on mobile OS, but I'm not putting too much hope there. I'll first need to collect some more data first, such as which libraries are used after a system boot, and what versions are provided.

On desktops, though, we can go even much further. In the 3.x days, and earlier, Firefox's javascript engine was shipped as a separate library in the Firefox directory. In recent 4.0 betas, it is now statically linked into libxul.so. But, the future GNOME desktop, based on GNOME shell, will be using Firefox's javascript engine. Some distributions plan to use a separately packaged javascript library for that purpose. In Debian, I'm still planning to keep Firefox shipping the javascript engine as a separate, system library, so let's see what that means for startup time:

ac_add_options --enable-shared-js

Average startup time (ms)
x86 2,887.48 ± 0.55%
x86-64 2,990.64 ± 0.65%

(disclaimer: this was not measured with an actuall GNOME shell, but with the current GNOME desktop, and a script reading the javascript library file to force it in page cache. Arguably, an actual use of the library may be loading less of it in page cache)

That's totally worth it.

2011-01-10 16:03:04+0900

p.m.o | 18 Comments »

Reducing Firefox startup I/O in libxul.so

I've been working, during the past few months, on Firefox startup performance, with a particular focus on I/O, even more particularly focusing on the core library: libxul. As I've been saying all along, my current work is focused on GNU/Linux systems, though some of it has a direct impact on OSX, and similar work is most probably applicable on Windows.

The data in this post has been gathered with the same setup as described in the introduction of previous post. That post also contains a view of how much impact I/O has on startup. Unfortunately, the data gathered here pre-dates the findings about the combined effect of I/O and CPU scaling, making the results here completely dependent on my system, though the figures are still interesting. For comparison, when I was doing preliminary timings using the Core 2 Duo instead of the i7 system, I was experiencing greater improvements in terms of % of startup time.

All startup times in this post are cold cache (after boot) startup times. For reference, here are the startup times for a vanilla Firefox 4.0b8 build on both x86 and x86-64:

Average startup time (ms)
x86 3,228.76 ± 0.57%
x86-64 3,382.0 ± 0.51%

Thanks to a systemtap script I will detail in a subsequent post, I also got what is being accessed from libxul.so and when, and got the following access pattern graphs (click to see at bigger sizes):


libxul.so access patterns on x86

libxul.so access patterns on x86-64

For those following this blog, it might look kind of familiar, though the graphs here have been simplified for the purpose of this post.

  • The red dots represent pages read from disk.
  • The light red lines somehow represent disk seeks within the file, though not always, as sometimes the disk seeks to entirely different locations for other files. It however helps visualizing how things are going.
  • The coloured arrays represent respectively .rel.dyn/.rela.dyn, .text, .rodata, .data.rel.ro and .data sections from the libxul.so file.

Please note that these graphs are not necessarily in full correlation with the startup times, first because I didn't average over 50 startup like for the timings, and second because systemtap might be introducing some noise. They are still helpful to visualize what is happening with the various patches I'll be testing below. Please also note that all graphs for a given architecture have the same horizontal axis scale, though the scale itself is not specified.

Another interesting metric to follow along the way is the amount of data the kernel reads from the libxul.so file during startup. These are obtained from the same systemtap script. For the same vanilla 4.0b8 builds as above, this is what we get:

Pages read Bytes read
x86 4,787 19,607,552
x86-64 5,874 24,059,904

Static Initializers

One big part of the startup time in the above graphs looks like, in the thumbnails, as a decreasing straight line, and it is the part of the library initialization that runs static initializers. One way to decrease I/O at startup could thus be to reduce the number of static initializers.

When I started investigating them, there were 237 static initializers, 147 of which were due to cycle collection globals. We landed some patches from bug 569629 and bug 502176, which got the overall number down, but in 4.0b8, it jumped to 316 while the number of cycle collection globals went up to 181... while is kind of depressing, and means I'll have to go back to my patch queue to see what can be done about these, and will do some writing about how innocent looking constructs can create static initializers, and how to avoid them.

I however came up with a hack to reduce the number of static initializers due to cycle collection to only one, which already gives a good reduction. Here are the startup results for 4.0b8 with the patches attached to that bug applied (plus a two-liner patch due to two additional cycle collected classes):

Average (ms) Pages read Bytes read
x86 3,216.1 ± 0.59% 4,656 19,070,976
x86-64 3,488.14 ± 0.75% 5,759 23,588,864

The results are not exactly thrilling, and there actually is a regression on x86-64. While the amount of data read slightly decreased, the startup time regressed because of the I/O patterns.


libxul.so access patterns on x86

libxul.so access patterns on x86-64

While I/O due to static initializers was reduced (and now has a tortured shape), the remaining I/O got much worse because some things that were previously read during static initialization are not anymore. Static initializers reduction thus looks like a worthwhile goal, because it represents a large portion of the startup time, but it needs some help in order to be more efficient.

Reordering objects

Another way to reduce I/O at startup is to arrange the binary such that a read-ahead from the kernel grabs something that is going to be required soon. Taras investigated this a long time ago, but this unfortunately requires a toolchain change. Well, technically, it doesn't, but GNU ld is so much of a pain to reorder functions that it's simpler to wait for Mozilla to switch to gold. In the interim, there is still a way to get some nice improvements in the binary layout: reordering objects.

On both OSX and GNU/Linux, the static linker doesn't do much in terms of binary layout, and just takes the files in the order they come on the command line. The OSX linker has some optimizations for data sections, and newer versions of the GNU linker have some other niceties, but overall, the order on the command line still prevails.

Using Taras' icegrind, I gathered the order in which object files' code sections are being accessed during startup and hacked the build system to reorder the objects files on the static linkage command line.

The improvements look quite good:

Average (ms) Pages read Bytes read
x86 2,939.18 ± 0.81% 4,129 16,912,384
x86-64 3,247.64 ± 0.68% 5,254 21,520,384

I/O patterns look better, too:


libxul.so access patterns on x86

libxul.so access patterns on x86-64

The profile used to get the objects order here was actually quite old, which means getting a new profile from 4.0b8 would have given better results. Further tweaking would make them look even better.

I only tested a couple times but this reordering also seems to improve OSX startup times, though an OSX-specific profile would be better.

Combining both

We saw earlier that the reduction of static initializers lead to more I/O later on, making it not that interesting. But what happens when the binary layout is improved as above? It turns out that gives a bigger win:

Average (ms) Pages read Bytes read
x86 2,851.06 ± 0.54% 4,125 16,896,000
x86-64 3,085.04 ± 0.76% 5,230 21,422,080

As above, the profile is quite old, but the I/O patterns still look much better:


libxul.so access patterns on x86

libxul.so access patterns on x86-64

There is another pending work to reverse the order of the static initializers, which I expect would improve things slightly. I didn't time it, however.

Packing relocations

Yet another way to reduce I/O at startup is to avoid going back and forth between data sections and relocation sections of libxul, which can be seen on the left side of all the graphs above. While it doesn't entirely do so, the tool I came up with (corresponding bug, and some explanations) reduces these zigzags and also reduces the dynamic relocations section size significantly (reducing the binary size is also an obvious way to reduce I/O).

Average (ms) Pages read Bytes read
x86 3,149.32 ± 0.62% 4,443 18,198,528
x86-64 3,191.58 ± 0.62% 4,733 19,386,368

The improvement is greater on x86-64 because of how much bigger the relocation section is on this architecture, as explained earlier.


libxul.so access patterns on x86

libxul.so access patterns on x86-64

The I/O patterns are not quite optimal there, which can be seen when looking closely at the above graphs, yet giving a nice improvement, while only altering the envelope. There are plans to make the hack tool even better in the future.

All of the above

Overall, we get a substantial startup time improvement by looking at only one file, while there is still room for further improvement.

Average (ms) Pages read Bytes read
Before
x86 3,228.76 ± 0.57% 4,787 19,607,552
x86-64 3,382.0 ± 0.51% 5,874 24,059,904
After
x86 2,820.84 ± 0.83% 3,782 15,491,072
x86-64 2,884.36 ± 0.52% 4,090 16,752,640

libxul.so access patterns on x86 (before)

libxul.so access patterns on x86 (after)

libxul.so access patterns on x86-64 (before)

libxul.so access patterns on x86-64 (after)

2011-01-04 20:06:58+0900

p.m.o | 9 Comments »

The measured effect of I/O on application startup

I did some experiments with the small tool I wrote in previous post in order to gather some startup data about Firefox. It turns out it can't flush directories and other metadata from the page cache, making it unfortunately useless for what I'm most interested in.

So, I gathered various startup information about Firefox, showing how page cache (thus I/O) has a huge influence on startup. The data in this post are mean times and 95% confidence interval for 50 startups with an existing but fresh profile, in a mono-processor GNU/Linux x86-64 virtual machine (using kvm) with 1GB RAM and a 10GB raw hard drive partition over USB, running, except when said otherwise, on an i7 870 (up to 3.6GHz with Intel Turbo Boost). The Operating System itself is an up-to-date Debian Squeeze running the default GNOME environment.

Firefox startup time is measured as the difference between the time in ms right before starting Firefox and time in ms as returned by javascript in a data:text/html page used as home page.

Startup vs. page cache

Average startup time (ms)
Entirely cold cache (drop_caches) 5887.62 ± 0.88%
Cold cache after boot 3382.0 ± 0.51%
Selectively cold cache (see below) 2911.46 ± 0.48%
Hot cache (everything previously in memory) 250.74 ± 0.18%

The selectively cold cache case makes use of the flush program from previous post and a systemtap script used to get the list of files read during startup. This script will be described in a separate post.

As you can see, profiling startup after echo 3 > /proc/sys/vm/drop_caches takes significantly more time than in the normal conditions users would experience, because of all system libraries that would normally be in the page cache being flushed, hence biasing the view one can have of the actual startup performance. Mozilla build bots were running, until recently, a ts_cold startup test that, as I understand it, had this bias (which is part of why it was stopped).

The Hot cache value is also interesting because it shows that the vast majority of cold startup time is due to hard disk I/O (and no, there is no page faults number difference).

I/O vs CPU

Interestingly, testing on a less beefy machine (Core 2 Duo 2.2GHz) with the same USB disk and kvm setup shows something not entirely intuitive:

Average (ms)
Entirely cold cache 6748.42 ± 1.01%
Cold cache after boot 3973.64 ± 0.53%
Selectively cold cache 3445.7 ± 0.43%
Hot cache 570.58 ± 0.70%

I, for one, would have expected I/O bound startups to only be slower by around 320ms, which is roughly the hot cache startup difference, or, in other words, the CPU bound startup difference. But I figured I was forgetting an important factor.

I/O vs. CPU scaling

Modern processors do frequency scaling, which allows the processor to run slowly when underused, and faster when used, thus saving power. It was first used on laptop processors to reduce power drawn from the battery, allowing batteries to last longer, and is now used on desktop processors to reduce power consumption. It unfortunately has a drawback, in that it introduces some latency when the scaling kicks in.

A not so nice side effect of frequency scaling is that when a process is waiting for I/O, the CPU is underused, making the CPU usually run at its slowest frequency. When the I/O ends and the process runs again, the CPU can go back to full speed. This means every I/O can induce, on top of latency because of, e.g. disk seeks, CPU scaling latency. And it actually has much more impact than I would have thought.

Here are results on the same Core 2 Duo, with frequency scaling disabled, and the CPU forced to its top speed :

Average (ms)
Entirely cold cache 5824.1 ± 1.13%
Cold cache after boot 3441.8 ± 0.43%
Selectively cold cache 3025.72 ± 0.29%
Hot cache 576.88 ± 0.98%

(I would have liked to do the same on the i7, but Intel Turbo Boost complicates things and I would have needed to get two new sets of data)

Update: I actually found a way to force one core at its max frequency and run kvm processes on it, giving the following results:

Average (ms)
Entirely cold cache 5395.94 ± 0.83%
Cold cache after boot 3087.47 ± 0.31%
Selectively cold cache 2673.64 ± 0.21%
Hot cache 258.52 ± 0.35%

I haven't gathered enough data to have accurate figures, but it also seems that forcing the CPU frequency to a fraction of the fastest supported frequency gives the intuitive results where the difference between all I/O bound startup times is the same as the difference for hot cache startup times. As such, I/O bound startup improvements would be best measured as an improvement in the difference between cold and hot cache startup times, i.e. (cold2 - hot2) - (cold1 - hot1), at a fixed CPU frequency.

Startup vs. desktop environment

We saw above that the amount of system libraries in page cache directly influences application startup times. And not all GNU/Linux systems are made equal. While the above times were obtained under a GNOME environment, some other desktop environments don't use the same base libraries, which can make Firefox require to load more of them at cold startup. The most used environment besides GNOME is KDE, and here is what cold startup looks like under KDE:

Average startup time (ms)
GNOME cold cache 3382.0 ± 0.51%
KDE cold cache 4031.9 ± 0.48%

It's significantly slower, yet not as slow as the entirely cold cache case. This is due to KDE not using (thus not pre-loading) some of the GNOME core libraries, yet using libraries in common, like e.g. libc (obviously), or dbus.

2011-01-03 16:59:17+0900

firefox, p.m.o | 4 Comments »

Efficient way to get files off the page cache

There's this great feature in modern operating systems called the page cache. Simply put, it keeps in memory what normally is stored on disk and helps both read and write performance. While it's all nice for a day to day use, it often gets in the way when one wants to track performance issues with "cold cache" (when the files you need to access are not in the page cache yet).

A commonly used command to flush the Linux page cache is the following:

# echo 3 > /proc/sys/vm/drop_caches

Unfortunately, its effect is broad, and it flushes the whole page cache. When working on cold startup performance of an application like Firefox, what you really want is to have your page cache in a state close to what it was before you started the application.

One way to get in a better position than flushing the entire page cache is to reboot: the page cache will be filled with system and desktop environment libraries during the boot process, making the application startup closer to what you want. But it takes time. A whole lot of it.

In one of my "what if" moments, I wondered what happens to the page cache when using posix_fadvise with the POSIX_FADV_DONTNEED hint. Guess what? It actually reliably flushes the page cache for the given range in the given file. At least it does so with Debian Squeeze's Linux kernel. Provided you have a list of files your application loads that aren't already in the page cache, you can flush these files and only these from the page cache.

The following source code compiles to a tool to which you give a list of files as arguments, and that flushes these files:

#include <unistd.h>
#include <fcntl.h>
#include <sys/stat.h>
int main(int argc, char *argv[]) {
  int i, fd;
  for (i = 1; i < argc; i++) {
    if ((fd = open(argv[i], O_RDONLY)) != -1) {
      struct stat st;
      fstat(fd, &st);
      posix_fadvise(fd, 0, st.st_size, POSIX_FADV_DONTNEED);
      close(fd);
    }
  }
  return 0;
}

It's actually a pretty scary feature, especially on multi-user environments, because any user can flush any file she can open, repeatedly, possibly hitting system performance. By the way, on systems using lxc (and maybe other containers, I don't know), running the echo 3 > /proc/sys/vm/drop_caches command from a root shell in a container does flush the host page cache, which could be dangerous for VPS hosting services using these solutions.

Update: I have to revise my judgment, it appears posix_fadvise(,,,POSIX_FADV_DONTNEED) doesn't flush (parts of) files that are still in use by other processes, which still makes it very useful for my usecase, but also makes it less dangerous than I thought. The drop_cache problem is still real with lxc, though.

2010-12-29 19:34:37+0900

miscellaneous, p.d.o, p.m.o | 11 Comments »

Easy Firefox-on-xulrunner builds

A lot of GNU/Linux distributions prefer to build Firefox as a xulrunner application, which works most of the time, except for a few breakages now and then, most usually because of the misuse of resource:/// urls instead of resource://gre/.

While building distributions packages to check how Firefox-on-xulrunner goes can be complicated, it's actually pretty straightforward using the Mozilla build system.

Just use the following .mozconfig, and start a build the usual way:

mk_add_options MOZ_BUILD_PROJECTS="xulrunner browser"
mk_add_options MOZ_MAKE_FLAGS=-j
ac_add_options --with-ccache
ac_add_app_options browser --enable-application=browser
ac_add_app_options browser --enable-default-toolkit=cairo-gtk2
ac_add_app_options browser --with-libxul-sdk=../xulrunner/dist
ac_add_app_options browser --enable-chrome-format=jar
ac_add_app_options xulrunner --enable-application=xulrunner
ac_add_app_options xulrunner --enable-url-classifier

A few notes about some of the options:

  • You may adjust MOZ_MAKE_FLAGS and remove --with-ccache
  • --enable-default-toolkit=cairo-gtk2 is obviously a GNU/Linux option. It may be replaced by the adequate toolkit for the platform you want to build for. It's only there to work around some shortcomings of the configure script and SDK (both of which should eventually be fixed).
  • --enable-chrome-format=jar is to be used because xulrunner doesn't support omni.jar yet (but will soon).
  • --enable-url-classifier is necessary because xulrunner doesn't include it by default, though Firefox does require it.

Once your build is finished, you just run the Firefox executable in $objdir/browser/dist/bin.

PS: Maybe I should add that to the MDN page about build options, or some other place (?).

2010-12-20 18:35:56+0900

p.m.o | 1 Comment »

Improving libxul startup I/O by hacking the ELF format

If you have been following this blog during the past months, you might have gathered that I've been focusing on I/O patterns on library initialization, more specifically libxul.so. The very patterns there's not much to do about, except enduring them, or hacking ld.so and/or the toolchain. The latter is not necessarily easy, and won't benefit everyone until all linux distros use an improved glibc and/or toolchain. This could take years before any actual result would reach end users.

But with a little creativity, one can overcome these limitations and improve the situation today. Here is a summary of my recent hacks.

Backwards static initializers

As Taras showed in the past, static initializers are executed in reverse order. It turns out it is an unfortunate design choice in gcc, and it also turns out not to impact all ELF binary files.

As far as I know there are currently two different implementations of the static initializers call loop, one provided by gcc, and one provided by ld.so.

In the gcc case, the compiler creates a .ctors section, starting with 0xffffffff, ending with 0x00000000, and filled with pointers to the various static initializer functions, in ascending order. The compiler also injects an initializer function, called from .init, going through this .ctors section, starting from the end (__do_global_ctors_aux in crtend.o). A similar .dtors section is created for functions to run when the library is unloaded, and the pointers in it are called in ascending order. There is actually no importance in the order in which .ctors and .dtors are executed. All that matters is that they are executed in reverse order from each other, because that limits the risks of bad ordering. It is only unfortunate that .ctors was chosen to be the one backwards.

In the ld.so case, the compiler creates a .init_array section, filled with pointers to the various static initializer functions, in ascending order. It also stores the location and the size of this section in the PT_DYNAMIC ELF header. At runtime, ld.so itself reads this information and executes the static initializers in ascending orders. A similar .fini_array section is created for functions to run when the library in unloaded, and ld.so goes through it in descending order.

The .init_array section is newer, and for backwards compatibility reasons, is not used on systems that have been using the .ctors approach forever. On the other hand, some newer ABIs, such as arm EABI, uses the new facility, which means they are not victim of backwards static initialization. I however wonder if backwards compatibility is still needed, considering how long .init_array support has been around (more than 11 years, according to glibc git repository).

While we could obviously bug the gcc maintainers to revert their .ctors execution order, or ditch .ctors support on glibc systems, which means we wouldn't get any immediate global benefit, we can also just revert the .ctors section content. Please note the source code out there doesn't take care of reverting the .dtors section accordingly, because it is currently empty in libxul.so.

Relocations

As I mentioned in an earlier post, relocations are what make it possible to load libraries at arbitrary locations. Unfortunately, a great number of relocations means awful I/O patterns: ld.so reads and applies them one by one, and while the kernel reads ahead by 128 KB chunks, when your relocation section exceeds that amount by more than an order of magnitude, it means I/O goes back and forth a lot between the relocation section and where the relocations take place.

There are two classes of relocations: REL and RELA. In both cases, there are several types of relocations, some of which require an addend. The difference between both classes is in REL relocations, the addend is stored at the offset the relocation takes place, and in RELA relocations, the addend is stored in the relocation entry. It means that RELA relocations effectively waste space, as the addend ends up actually being stored twice: it is also stored at the offset the relocation takes place, like with REL relocations, though it could technically be nulled out, there ; in any case, the space is wasted.

REL relocations thus take 2 words of storage, and RELA relocations, 3. On x86, the ELF ABI only uses REL relocations, and words are 32 bits, making each relocation take 8 bytes ; on x86-64, the ELF ABI only uses RELA relocations, and words are 64 bits, thus making each relocation take 24 bytes, 3 times that of x86 relocations. On ARM, both may be used, but apparently the compiler and the linker only use REL relocations.

These words of storage are used to store 3 or 4 bits of informations:

  • r_offset: the offset at which the relocation takes place
  • r_info: containing both the relocation type and a symbol reference for the relocation. On 32 bits systems, the symbol reference is stored on 24 bits, and the relocation type on 8 bits. On 64 bits systems, the symbol reference is stored on 32 bits, and the relocation type on 32 bits (which means at least 24 bits are never going to be used: there are currently no ABI using more than 256 relocation types).
  • r_addend: The addend, only in RELA relocations

The linker splits relocations in two main categories: those for dynamic linking at startup and those for procedure linkage table (PLT) resolution at runtime. Here is what the repartition look like on libxul.so (taken from the mozilla-central nightly from Nov, the 22nd):

dynamic relocations PLT relocations
Architecture number size (bytes) number size (bytes)
x86 231,836 1,854,688 3,772 30,176
x86-64 235,893 5,661,432 3,773 90,528

So, PLT relocations are quite negligible, and dynamic relocations are just huge, especially on x86-64. Compared to the binary size, it's pretty scary:

Architecture libxul.so size relocations size %
x86 21,869,684 1,884,864 8.61%
x86-64 29,629,040 5,751,984 19.41%

Let me remind you that not only do we need to read these relocations at startup (at least the dynamic ones, but the PLT ones are just negligible), but we only read them by chunks, and need to read other data in between.

Of the many existing types of relocations, only a few are actually used in dynamic libraries. Here is a summary of those in use in libxul.so:

x86 relocation type number x86-64 relocation type number
dynamic relocations
R_386_TLS_DTPMOD32 1 R_X86_64_DTPMOD64 1
R_386_GLOB_DAT 213 R_X86_64_GLOB_DAT 238
R_386_32 27,643 R_X86_64_64 27,611
R_386_RELATIVE 203,979 R_X86_64_RELATIVE 208,043
PLT relocations
R_386_JUMP_SLOT 3,772 R_X86_64_JUMP_SLOT 3,773

The above types can be separated in two distinct categories:

  • Requiring symbols resolution (R_*_GLOB_DAT, R_*_JUMP_SLOT, R_386_32, R_X86_64_64)
  • Not requiring symbols resolution (R_*_RELATIVE, R_386_TLS_DTPMOD32, R_X86_64_TLS_DTPMOD64)

In the second category, it means we are wasting all the bits for the symbol reference in each entry: 24 on x86 and 32 on x86-64. Unfortunately, the vast majority of the relocations in libxul.so are from that category. At this point, this is how much waste we identified:

description num. entries waste per entry (bytes) total % of relocation size
x86
Relocation without symbol ref. 203,980 3 611,937 32.46%
x86-64
Duplicated addend 239,666 8 1,917,328 33.33%
Relocation type encoded on 32 bits 239,666 3 718,998 12.49%
Relocation without symbol ref. 208,044 4 832,176 14.46%
Total 3,468,502 60.30%

On top of that, one could argue that we don't need 24 or 32 bits for symbol reference (which is the entry number in the symbol table), and that we're really far from requiring 64 bits for r_offset on 64 bits systems, as libxul.so is far from being bigger than 4GB. Anyways, that means we have room for improvement here, and it means that the ELF format, especially on 64 bits systems with RELA relocations, really doesn't help with big programs or libraries with a lot of relocations. (For that matter, the mach-o format used on OS-X has some kind of similar issues ; I don't know for the DLL format)

Another interesting data point is looking at how relocations' r_offset are following each other, once relocations are reordered (to avoid too much symbol resolution, they are actually grouped by symbol). It turns out 206,739 relocations' r_offset are at 4 bytes from the previous one on x86 (i.e. strictly following it), and 210,364 at 8 bytes from the previous on x86-64. The main reason why this happens is that most relocations are happening in pointer tables, and libxul.so has lots of them:

  • That's what you get with virtual methods in C++.
  • That's also what you get in .ctors and .dtors sections, which are relocated as well (but not .init_array and .fini_array sections).
  • That's what you get when you use function tables.
  • Obvously, that's what you get when you use pointer tables.

And when I say libxul.so has lots of them, that's not an understatement: there are 184,320 relocations due to vtables (function tables you get when using virtual methods in C++) on x86 and 184,098 on x86-64. That's respectively 78.23% and 76.81% of the total number of relocations. Interestingly, while most of these relocations are R_*_RELATIVE, 26,853 are R_386_32, and 26,829 are R_X86_64_64. Except for a very few, these non relative relocations are for the __cxa_pure_virtual symbol, which is a dummy function that does nothing, used for pure virtual methods, such as the following:

class Abstract {
public:
virtual void pure_virtual() = 0;
};

As for the remaining relative relocations, there are a few pointer tables responsible for several thousand relocations, such as JSString::length2StringTable, gEntries, GkAtoms_info, Html5Atoms_info, kUnixCharsets, nsIDOMCSS2Properties_properties, etc.

Packing relocations

While they are not detrimental to most binaries, we saw that relocations can have a bad effect on startup I/O for big binaries. Reducing how much space they take would have a direct impact on these startup I/Os. One way to do so is obviously to reduce the number of relocations. But as we saw above, the vast majority are due to vtables. While reducing vtables would be a nice goal, it's a long way before this can have a significant impact on the relocations number. So another possible way is to try to avoid wasting so much space in relocations.

The main problem is that relocations are applied by ld.so, and that, as mentioned in the introduction, while we can hack ld.so to do what we want, it won't work everywhere before a long time. Moreover, it's difficult to test various implementations, and binary compatibility needs to be guaranteed once something is applied in the upstream libc.

The idea I started to test a few weeks ago, is to effectively apply relocations by hand instead of relying on ld.so doing it. Considering the first thing ld.so does with a library after applying relocations is to call its DT_INIT function, the idea was to replace the dynamic relocations we can with a new DT_INIT function we would inject that'd apply those relocations, while storing them in a more efficient manner. The function would obviously need to call the original DT_INIT function once done. Some parts of the PT_DYNAMIC section also obviously need to be adjusted accordingly.

The first iteration I tested a few weeks ago helped validating the idea, but didn't change much of the binary: the dynamic relocation section still was taking all the space it took before, though its content was mostly empty (we'll see further below how efficient even the current dumb format is). Once the theory that an injected DT_INIT function could apply the relocations without breaking everything, I went on to implement a program that would actually strip the ELF binary. I won't get too much into the gory details, but it required PT_LOAD splitting, thus program headers growth, which means moving the following sections, then removing a huge part of the relocations, shrinking the relocations section, injecting our code and the packed relocations as sections, and moving the remaining sections after these new sections, to avoid empty space, but still respecting the proper offsets for both PT_LOAD and the code itself.

So far, I only took care of the R_*_RELATIVE relocations. Some but not all other relocations could be taken care of, too, but everything involving symbol resolution from other libraries would be really difficult to handle from the injected code. The relocations are currently stored with a quite dumb format:

  • a 32 bits word containing r_offset
  • a 32 bits word containing the number of relocations to apply consecutively

As we only take care of one type of relocations, we can skip any information about the type. Also, the addend already being stored at r_offset, we just strip it from RELA relocations ; the binary being far from 4GB big, 32 bits is enough on 64 bits systems too. We also saw that a lot of relocations were next to the following ones, so instead of storing r_offset for each one, storing some kind of range is more efficient. As I've so far been focusing on getting something that actually works more than on getting something really efficient, I only tested this format, but the results are already nice:

x86 x86-64
before
dynamic relocations (bytes) 1,854,688 5,661,432
libxul.so size 21,869,684 29,629,040
after
dynamic relocations (bytes) 222,856 668,400
injected code size 129 84
packed relative relocations (bytes) 224,816 228,568
libxul.so size 20,464,836 (-6.42%) 24,865,520 (-16.07%)

Getting rid of the __cxa_pure_virtual relocations would help getting this even further down, and an even better storage format could be used, too. Gathering new startup coverage data should be interesting as well.

The code is available via mercurial. Please note the Makefile deeply sucks, and that the code is in the middle of its third refactoring. Anyways, once you get the code to compile, you just need to run test-x86 or test-x64 depending on your architecture, from the build directory, and give it the libxul.so file as an argument. It will create a libxul.so.new file that you can use to replace libxul.so. You may successfully use the tool on other libraries, but chances are you'll bump into an assert(), or in the worst case, that the resulting library ends up broken. The tool might work on PIE executables, but most probably not on other executables. If you wonder, Chrome is not PIE. It's not even relocatable (which means ASLR won't work with it) ; or at least the x86-64 binaries I just got aren't.

More possible ELF hacking

The code I wrote to pack relocations allows to do various tricks on ELF files, and it opens various opportunities, such as:

  • Aligning .bss. As I wrote on this blog a while ago, ld.so fills the last page from the .data section with zeroes, starting at the offset of the .bss section. Unfortunately, this means reading ahead at most 128KB of data nobody needs, and the corresponding disk seek. Inserting a section full of zeroes between .data and .bss would get rid of that, while minimally wasting space (less than 4KB in a 20MB+ binary).
  • Trying to move the PT_DYNAMIC section. It is one of the first bunch of data to be read from programs and libraries, yet it is near the end of the binary.
  • Replacing .ctors/.dtors with .init_array/.fini_array (thus removing the corresponding relocations, though there's only around 300 of them).
  • Replacing the __cxa_pure_virtual relocations with relative relocations pointing to an injected dummy function (it could even just point to any ret code in any existing function).
  • Replacing some internal symbol resolutions with relative relocations (which should be equivalent to building with -Bsymbolic)
  • etc.

Some of the above could probably be done with linker scripts, but using linker scripts is unnecessarily over complicated: as soon as you start using a linker script to fiddle with sections, you need to reimplement the entire default linker script (the one you can see with ld --verbose). And it's not guaranteed to work (Taras told me he had a hard time failing to align the .bss section). And as far as I know, you can't use the same thing depending whether you use ld or gold (AIUI, gold doesn't support the whole ld linker script syntax).

2010-11-23 15:44:19+0900

firefox, p.m.o | 24 Comments »

Debian Mozilla team APT archive now signed

You may have been using the repository as described on a previous blog post. You may also have noticed that despite a lack of announcement, all beta releases but 4.0beta6 have been subsequently made available on that repository (moreover, within 24 hours of the upstream release). I just made a change to that repository which will most likely make your APT complain, but all the necessary instructions are given on the repository itself.

2010-11-20 14:45:57+0900

firefox | 5 Comments »