Archive for January, 2011

Reducing Firefox startup I/O in libxul.so

I've been working, during the past few months, on Firefox startup performance, with a particular focus on I/O, even more particularly focusing on the core library: libxul. As I've been saying all along, my current work is focused on GNU/Linux systems, though some of it has a direct impact on OSX, and similar work is most probably applicable on Windows.

The data in this post has been gathered with the same setup as described in the introduction of previous post. That post also contains a view of how much impact I/O has on startup. Unfortunately, the data gathered here pre-dates the findings about the combined effect of I/O and CPU scaling, making the results here completely dependent on my system, though the figures are still interesting. For comparison, when I was doing preliminary timings using the Core 2 Duo instead of the i7 system, I was experiencing greater improvements in terms of % of startup time.

All startup times in this post are cold cache (after boot) startup times. For reference, here are the startup times for a vanilla Firefox 4.0b8 build on both x86 and x86-64:

Average startup time (ms)
x86 3,228.76 ± 0.57%
x86-64 3,382.0 ± 0.51%

Thanks to a systemtap script I will detail in a subsequent post, I also got what is being accessed from libxul.so and when, and got the following access pattern graphs (click to see at bigger sizes):


libxul.so access patterns on x86

libxul.so access patterns on x86-64

For those following this blog, it might look kind of familiar, though the graphs here have been simplified for the purpose of this post.

  • The red dots represent pages read from disk.
  • The light red lines somehow represent disk seeks within the file, though not always, as sometimes the disk seeks to entirely different locations for other files. It however helps visualizing how things are going.
  • The coloured arrays represent respectively .rel.dyn/.rela.dyn, .text, .rodata, .data.rel.ro and .data sections from the libxul.so file.

Please note that these graphs are not necessarily in full correlation with the startup times, first because I didn't average over 50 startup like for the timings, and second because systemtap might be introducing some noise. They are still helpful to visualize what is happening with the various patches I'll be testing below. Please also note that all graphs for a given architecture have the same horizontal axis scale, though the scale itself is not specified.

Another interesting metric to follow along the way is the amount of data the kernel reads from the libxul.so file during startup. These are obtained from the same systemtap script. For the same vanilla 4.0b8 builds as above, this is what we get:

Pages read Bytes read
x86 4,787 19,607,552
x86-64 5,874 24,059,904

Static Initializers

One big part of the startup time in the above graphs looks like, in the thumbnails, as a decreasing straight line, and it is the part of the library initialization that runs static initializers. One way to decrease I/O at startup could thus be to reduce the number of static initializers.

When I started investigating them, there were 237 static initializers, 147 of which were due to cycle collection globals. We landed some patches from bug 569629 and bug 502176, which got the overall number down, but in 4.0b8, it jumped to 316 while the number of cycle collection globals went up to 181... while is kind of depressing, and means I'll have to go back to my patch queue to see what can be done about these, and will do some writing about how innocent looking constructs can create static initializers, and how to avoid them.

I however came up with a hack to reduce the number of static initializers due to cycle collection to only one, which already gives a good reduction. Here are the startup results for 4.0b8 with the patches attached to that bug applied (plus a two-liner patch due to two additional cycle collected classes):

Average (ms) Pages read Bytes read
x86 3,216.1 ± 0.59% 4,656 19,070,976
x86-64 3,488.14 ± 0.75% 5,759 23,588,864

The results are not exactly thrilling, and there actually is a regression on x86-64. While the amount of data read slightly decreased, the startup time regressed because of the I/O patterns.


libxul.so access patterns on x86

libxul.so access patterns on x86-64

While I/O due to static initializers was reduced (and now has a tortured shape), the remaining I/O got much worse because some things that were previously read during static initialization are not anymore. Static initializers reduction thus looks like a worthwhile goal, because it represents a large portion of the startup time, but it needs some help in order to be more efficient.

Reordering objects

Another way to reduce I/O at startup is to arrange the binary such that a read-ahead from the kernel grabs something that is going to be required soon. Taras investigated this a long time ago, but this unfortunately requires a toolchain change. Well, technically, it doesn't, but GNU ld is so much of a pain to reorder functions that it's simpler to wait for Mozilla to switch to gold. In the interim, there is still a way to get some nice improvements in the binary layout: reordering objects.

On both OSX and GNU/Linux, the static linker doesn't do much in terms of binary layout, and just takes the files in the order they come on the command line. The OSX linker has some optimizations for data sections, and newer versions of the GNU linker have some other niceties, but overall, the order on the command line still prevails.

Using Taras' icegrind, I gathered the order in which object files' code sections are being accessed during startup and hacked the build system to reorder the objects files on the static linkage command line.

The improvements look quite good:

Average (ms) Pages read Bytes read
x86 2,939.18 ± 0.81% 4,129 16,912,384
x86-64 3,247.64 ± 0.68% 5,254 21,520,384

I/O patterns look better, too:


libxul.so access patterns on x86

libxul.so access patterns on x86-64

The profile used to get the objects order here was actually quite old, which means getting a new profile from 4.0b8 would have given better results. Further tweaking would make them look even better.

I only tested a couple times but this reordering also seems to improve OSX startup times, though an OSX-specific profile would be better.

Combining both

We saw earlier that the reduction of static initializers lead to more I/O later on, making it not that interesting. But what happens when the binary layout is improved as above? It turns out that gives a bigger win:

Average (ms) Pages read Bytes read
x86 2,851.06 ± 0.54% 4,125 16,896,000
x86-64 3,085.04 ± 0.76% 5,230 21,422,080

As above, the profile is quite old, but the I/O patterns still look much better:


libxul.so access patterns on x86

libxul.so access patterns on x86-64

There is another pending work to reverse the order of the static initializers, which I expect would improve things slightly. I didn't time it, however.

Packing relocations

Yet another way to reduce I/O at startup is to avoid going back and forth between data sections and relocation sections of libxul, which can be seen on the left side of all the graphs above. While it doesn't entirely do so, the tool I came up with (corresponding bug, and some explanations) reduces these zigzags and also reduces the dynamic relocations section size significantly (reducing the binary size is also an obvious way to reduce I/O).

Average (ms) Pages read Bytes read
x86 3,149.32 ± 0.62% 4,443 18,198,528
x86-64 3,191.58 ± 0.62% 4,733 19,386,368

The improvement is greater on x86-64 because of how much bigger the relocation section is on this architecture, as explained earlier.


libxul.so access patterns on x86

libxul.so access patterns on x86-64

The I/O patterns are not quite optimal there, which can be seen when looking closely at the above graphs, yet giving a nice improvement, while only altering the envelope. There are plans to make the hack tool even better in the future.

All of the above

Overall, we get a substantial startup time improvement by looking at only one file, while there is still room for further improvement.

Average (ms) Pages read Bytes read
Before
x86 3,228.76 ± 0.57% 4,787 19,607,552
x86-64 3,382.0 ± 0.51% 5,874 24,059,904
After
x86 2,820.84 ± 0.83% 3,782 15,491,072
x86-64 2,884.36 ± 0.52% 4,090 16,752,640

libxul.so access patterns on x86 (before)

libxul.so access patterns on x86 (after)

libxul.so access patterns on x86-64 (before)

libxul.so access patterns on x86-64 (after)

2011-01-04 20:06:58+0900

p.m.o | 9 Comments »

The measured effect of I/O on application startup

I did some experiments with the small tool I wrote in previous post in order to gather some startup data about Firefox. It turns out it can't flush directories and other metadata from the page cache, making it unfortunately useless for what I'm most interested in.

So, I gathered various startup information about Firefox, showing how page cache (thus I/O) has a huge influence on startup. The data in this post are mean times and 95% confidence interval for 50 startups with an existing but fresh profile, in a mono-processor GNU/Linux x86-64 virtual machine (using kvm) with 1GB RAM and a 10GB raw hard drive partition over USB, running, except when said otherwise, on an i7 870 (up to 3.6GHz with Intel Turbo Boost). The Operating System itself is an up-to-date Debian Squeeze running the default GNOME environment.

Firefox startup time is measured as the difference between the time in ms right before starting Firefox and time in ms as returned by javascript in a data:text/html page used as home page.

Startup vs. page cache

Average startup time (ms)
Entirely cold cache (drop_caches) 5887.62 ± 0.88%
Cold cache after boot 3382.0 ± 0.51%
Selectively cold cache (see below) 2911.46 ± 0.48%
Hot cache (everything previously in memory) 250.74 ± 0.18%

The selectively cold cache case makes use of the flush program from previous post and a systemtap script used to get the list of files read during startup. This script will be described in a separate post.

As you can see, profiling startup after echo 3 > /proc/sys/vm/drop_caches takes significantly more time than in the normal conditions users would experience, because of all system libraries that would normally be in the page cache being flushed, hence biasing the view one can have of the actual startup performance. Mozilla build bots were running, until recently, a ts_cold startup test that, as I understand it, had this bias (which is part of why it was stopped).

The Hot cache value is also interesting because it shows that the vast majority of cold startup time is due to hard disk I/O (and no, there is no page faults number difference).

I/O vs CPU

Interestingly, testing on a less beefy machine (Core 2 Duo 2.2GHz) with the same USB disk and kvm setup shows something not entirely intuitive:

Average (ms)
Entirely cold cache 6748.42 ± 1.01%
Cold cache after boot 3973.64 ± 0.53%
Selectively cold cache 3445.7 ± 0.43%
Hot cache 570.58 ± 0.70%

I, for one, would have expected I/O bound startups to only be slower by around 320ms, which is roughly the hot cache startup difference, or, in other words, the CPU bound startup difference. But I figured I was forgetting an important factor.

I/O vs. CPU scaling

Modern processors do frequency scaling, which allows the processor to run slowly when underused, and faster when used, thus saving power. It was first used on laptop processors to reduce power drawn from the battery, allowing batteries to last longer, and is now used on desktop processors to reduce power consumption. It unfortunately has a drawback, in that it introduces some latency when the scaling kicks in.

A not so nice side effect of frequency scaling is that when a process is waiting for I/O, the CPU is underused, making the CPU usually run at its slowest frequency. When the I/O ends and the process runs again, the CPU can go back to full speed. This means every I/O can induce, on top of latency because of, e.g. disk seeks, CPU scaling latency. And it actually has much more impact than I would have thought.

Here are results on the same Core 2 Duo, with frequency scaling disabled, and the CPU forced to its top speed :

Average (ms)
Entirely cold cache 5824.1 ± 1.13%
Cold cache after boot 3441.8 ± 0.43%
Selectively cold cache 3025.72 ± 0.29%
Hot cache 576.88 ± 0.98%

(I would have liked to do the same on the i7, but Intel Turbo Boost complicates things and I would have needed to get two new sets of data)

Update: I actually found a way to force one core at its max frequency and run kvm processes on it, giving the following results:

Average (ms)
Entirely cold cache 5395.94 ± 0.83%
Cold cache after boot 3087.47 ± 0.31%
Selectively cold cache 2673.64 ± 0.21%
Hot cache 258.52 ± 0.35%

I haven't gathered enough data to have accurate figures, but it also seems that forcing the CPU frequency to a fraction of the fastest supported frequency gives the intuitive results where the difference between all I/O bound startup times is the same as the difference for hot cache startup times. As such, I/O bound startup improvements would be best measured as an improvement in the difference between cold and hot cache startup times, i.e. (cold2 - hot2) - (cold1 - hot1), at a fixed CPU frequency.

Startup vs. desktop environment

We saw above that the amount of system libraries in page cache directly influences application startup times. And not all GNU/Linux systems are made equal. While the above times were obtained under a GNOME environment, some other desktop environments don't use the same base libraries, which can make Firefox require to load more of them at cold startup. The most used environment besides GNOME is KDE, and here is what cold startup looks like under KDE:

Average startup time (ms)
GNOME cold cache 3382.0 ± 0.51%
KDE cold cache 4031.9 ± 0.48%

It's significantly slower, yet not as slow as the entirely cold cache case. This is due to KDE not using (thus not pre-loading) some of the GNOME core libraries, yet using libraries in common, like e.g. libc (obviously), or dbus.

2011-01-03 16:59:17+0900

firefox, p.m.o | 4 Comments »