Author Archive

Preloading for dummies

We know disk seeks hurt. A few weeks ago, a 20 line patch made the news because it was cutting startup time on Windows significantly. The patch basically preloads the 2 main libraries used by Firefox.

There has been an open bug for a while to try to get rid of the startup script on UNIX systems, that would allow the same kind of trick on these systems. Unfortunately, it's too late in the 4.0 development process to get somewhere with these. But what about a stupid one-liner preloading all Firefox libraries ? It turns out to be a simple way to improve things quite significantly:

	4.0b8	4.0b8 with preload	Difference
x86	3,228.76 Â± 0.57%	2,347.18 Â± 0.67%	881.58 (27.30%)
x86-64	3,382.0 Â± 0.51%	2,709.82 Â± 0.54%	672.18 (19.86%)

Please note that the above values are for plain 4.0b8 startup. Since then, relocations packing landed, which reduces the binaries size, thus helping further, since there is less data to preload.

We could try to preload only the parts that we need, but that would mean a bigger change, with absolutely no chance of getting in before 4.0.

Now, to help my cause of having this patch applied before 4.0, let's see why it works so well. There are 11 .so files in the Firefox directory, some of which are really small, and a few being bigger, libxul.so being the champion. Of these 11 files, only 2 don't end up being entirely read by kernel readahead on both x86 and x86-64, and most probably all other Linux platforms:

File name	File size (blocks)	Read ahead size (pages)	Proportion
x86
libnss3.so	214	68	31.78%
libxul.so	5,304	4,787	90.25%
others	288	288	100%
total	5,806	5,143	88.58%
x86-64
libnss3.so	259	81	31.27%
libxul.so	7,244	5,874	81.09%
others	333	333	100%
total	7,836	6,288	80.25%

In the above table, a block and a page are both 4096 bytes.

Even when the filesystem is fragmented, there are good chances that the files are stored in chunks of significant size, and that these chunks are more or less ordered on disk. In our case, where the files are read in great part during startup (more than 80%), it's obviously going to be much faster to read them entirely than to randomly read small chunks from them. Which is why it works so well. Even on extremely fragmented file systems, I don't expect this stupid trick to make things worse (but you are free to prove me wrong).

If these files weren't almost entirely read by the kernel during startup, there would have been chances that the extra reads had outweighed the saved disk seeks, making the technique ineffective. When we get to the point where we actually reorder the objects or functions in the main library, this little patch will likely lose its positive effect. Then, it will become important to preload more cleverly, and limit ourselves to the used parts only.

2011-02-08 18:55:56+0900

p.m.o | 2 Comments »

Knowing how much disk seeks hurt

We all know disk seeks hurt. But we usually don't have a precise idea how much. How about getting that idea?

Here is a little experiment I ran a few days ago. I took the output from the systemtap script tracking I/O I wrote, on a Firefox startup after boot. Each line of that output, which gives a timestamp (which we don't care about here), a file name, and an offset, represents a 4096 bytes read (one page).

The first set of data points I got is how much time it takes to reproduce this read pattern after a reboot, and how that compares to Firefox startup time. For what it's worth, I did group following reads, to avoid doing too many system calls, and also avoided kernel readahead by using direct I/O, meaning I would only read exactly what the kernel reads when Firefox normally starts.

All the following tests were done under the usual conditions (see previous posts), but I limited the tests to the x86-64 architecture, because all that really matters is the disk. I'll mention, however, that the amount of data read in these tests is 34,729,984 bytes, and that the original I/O pattern looks like this:

Zooming around the main location where most I/O happen, we can see the pattern is still bumpy:

This somehow looks familiar, doesn't it?

We already know for a fact that even these patterns that, on the whole disk, look like pretty much insignificant, have an important impact on startup time. Try to imagine what kind of difference could be observed if we were reordering all these reads.

Anyways, back to that I/O simulation, we first need to see how far it is from the actual Firefox startup. We would rather that simulated I/O + warm startup/hot cache end up close to the real thing. However, we need to keep in mind, as we saw in a previous post, that CPU scaling, when mixed with I/O, has an influence on startup time. Warm startup not involving I/O, the CPU can run at maximum speed the whole time. As such, for a fair comparison, we need to compare to cold startup time with the CPU forced at maximum speed, which we saw is faster than startup time with CPU scaling.

	Average time (ms)
Simulated cold startup	2,764.26 Â± 0.42%
Warm startup	250.74 Â± 0.18%
Simulated cold + warm	3015
Real cold startup	3087.47 Â± 0.31%
Difference	72.47 (2.35%)

Close enough, I'd say. The difference is most probably caused by metadata reads and a few other things, that isn't in the systemtap script scope.

Now we know our simulated I/O is close to reality, what happens if we reorder all these reads according to the position on disk?

	Average time (ms)	Corresponding transfer rate
Simulated I/O	2,764.26 Â± 0.42%	12.56 MB/s
Reordered I/O	1,473.34 Â± 0.43%	23.57 MB/s
Difference	1,290.92 (46.7%)	n/a

That's almost twice as fast ! And the disk doesn't even have a big throughput (around 30MB/s). Let's see what it does with a disk with a bigger throughput (85MB/s).

	Average time (ms)	Corresponding transfer rate
Simulated I/O	1,898.66 Â± 0.16%	18.29 MB/s
Reordered I/O	644.0 Â± 0.15%	53.93 MB/s
Difference	1,254.66 (66.08%)	n/a

That's almost three times as fast ! The faster the disk, the bigger the improvement we can get by reordering and grouping I/O, which is not unexpected, but here we can see how much having to go back and forth on the disk hurt badly. Obviously, the numbers from the faster disk can't be directly compared to the ones from the slower disk, because the data was not arranged the same way on the disk, and file system fragmentation, as well as how the file system is filled also have their own share to add to the problem.

And because it's more impressive to see on a graph than in a result table:

Disks seeks hurt. Badly.

2011-02-08 15:42:43+0900

p.d.o, p.m.o | 5 Comments »

Startup I/O: how do 3.6 and 4.0 compare?

During the past weeks, I've been posting a lot of data about what various changes can bring in terms of startup time improvement. It is now time to look back on how thing have changed between 3.6 and the upcoming 4.0.

With the same setup I've been using in the past, and the same modus operandi, with a fresh profile:

	3.6.14pre	4.0b8	Difference
x86	2,933.46 Â± 0.72%	3,228.76 Â± 0.57%	295.30 (+10.06%)
x86-64	3,150.5 Â± 0.59%	3,382.0 Â± 0.51%	231.5 (+7.34%)

So, 4.0b8 ends up being slightly slower than the latest 3.6 on startup with a fresh profile. Now that relocation packing landed, we should be closer to 3.6, but still slightly slower.

On the other hand, 4.0 has seen two main changes directly impacting on startup time:

omni.jar: most chrome, preferences, javascript modules, and components are now all packed in a single file in the Firefox directory.
packed extensions: most extensions are not unpacked anymore in the profile directory,

Let's thus see how each of these is making a difference, starting with omni.jar:

	4.0b8 without omni.jar	4.0b8	Difference
x86	3,420.4 Â± 0.92%	3,228.76 Â± 0.57%	191.64 (-5.60%)
x86-64	3,554.22 Â± 0.82%	3,382.0 Â± 0.51%	172.22 (-4.85%)

As can be seen here, packing most files in the Firefox directory did bring roughly a 5% speedup on cold startup, which helped keeping 4.0b8 somehow close to 3.6 in startup time on fresh profiles.

3.6.14pre with extensions	4.0b8 with extensions	Difference
4,457.18 Â± 1.79%	4,235.14 Â± 0.60%	222.04 (-4.98%)

Taking six of the most popular extensions that work in both 3.6 and 4.0, we can see the positive effect of keeping extensions packed, especially considering a fresh profile is slower with 4.0. It has to be noted, though, that one of these extensions enforced being unpacked (which is still a possibility for extensions requiring it), so the 4.0 profile had effectively five packed extensions and another unpacked one.

2011-02-08 11:49:33+0900

p.m.o | 9 Comments »

Yet another clarification about Iceweasel

I'm glad that 5 years after the facts, people are still not getting them straight.

The Firefox logo was not under a free copyright license. Therefore, Debian was using the Firefox name with the "earth" logo (without the fox), which was and still is under a free copyright license. Then Mozilla didn't want the Firefox name associated to an icon that is not the Firefox icon, for trademark reasons. Fair enough.

Although at the time Debian had concerns with the trademark policy, there was no point arguing over it, since Debian was not going to use the logo under a non-free copyright license anyway.

Now, it happens that the logo has turned to a free copyright license. Request for a trademark license was filed a few weeks after we found out about the good news, and we are still waiting for an agreement draft from Mozilla to hopefully go forward.

It is still not certain that this will actually lead to Debian shipping something called Firefox some day, but things are progressing, even if at a rather slow pace, and I have good hope (discussions are promising).

By the way, thank you for the nice words, Daniel.

2011-02-07 20:42:47+0900

firefox, p.m.o | 7 Comments »

Backwards I/O vs. Forward I/O

I mentioned it in the past, and so did Taras, static initializers are currently called in reverse order of their location in a library. This can be seen, for example, in the various graphs I gathered about startup I/O. I also mentioned that I had written a small tool reversing these static initializers in ELF binaries. I however hadn't checked the impact on startup. Until today.

The testing setup still remains the same as in previous posts and the results are still the average and 95% confidence interval for 50 startups of an unmodified Firefox 4.0b8 build.

	With backwards static initializers (ms)	With forward static initializers (ms)	Difference
x86	3,228.76 Â± 0.57%	2,888.44 Â± 0.55%	340.32 (10.5%)
x86-64	3,382.0 Â± 0.51%	3,102.46 Â± 0.51%	279.54 (8.26%)

I'm actually surprised by the result. I did expect that forward reads would be slightly faster than backwards reads, I wasn't expecting that much difference.

I guess I should work on bug 606137, then. Combined with relocations packing that landed after beta 10, it should have a nice startup impact.

2011-02-01 17:45:23+0900

p.m.o | No Comments »

Effect of packing relocations revisited

A couple weeks ago, I was checking how packing relocations affected startup time. Now that we have some additional information about startup, it is time to revisit the startup times with relocations packing, and more precisely, how the time spent before XRE_main is affected.

This time I didn't bother to collect 50 startup times to get more accurate figures, mostly because as of writing, I don't have scripts to gather these data automatically (especially on mobile devices).

Platform	time spent before `XRE_main` without relocations packing (ms)	with relocations packing (ms)
GNU/Linux x86	1,362	1,273
GNU/Linux x86-64	1,643	1,318
Maemo 5, n900	1,717	1,427
Android 2.2, HTC Desire	4,250	3,568

All the numbers above were taken after a fresh boot with a more or less recent nightly (n900 was from a week ago, others are from today). The Android number with relocations packing was gotten from a build where it miraculously started without crashing (relocations packing apparently unveils a dynamic linker problem) ; it might be wrong.

2011-01-24 19:10:46+0900

p.m.o | No Comments »

Little extension to expose startup time

A few days ago, Taras landed a new API to gather startup timings for a few events occurring when Firefox starts.

The timings that are currently reported through this API are the following:

process is when the Firefox process starts
main is when the XRE_main function is called (one of the first functions actively called)
firstPaint is when a web page has been displayed for the first time to the user
sessionRestored is pretty much self describing

There are apparently still a few rough edges, but it is still quite valuable information. As such, I wrote a little (restart-less) extension that displays these information when you go to the about:startup url. It doesn't really display the raw values, but instead the number of milliseconds elapsed since the process startup until each further event above.

Install extension.

In the long run, we should have a fully fledged extension doing that.

2011-01-19 16:12:12+0900

p.m.o | 20 Comments »

Dear lazyweb

I would like to replace my current blog with a system that mostly generates static pages, with support for comments. I'd like it to take files as input for blog posts (I'd like to store them in git), instead of database tables, and to have a flexible markup language (flexible in that it'd allow to customize the HTML output), and flexible templates.

Ikiwiki might come close to that, though I haven't looked into details. Dear lazyweb, would you know other software that'd fulfill my needs, or come close?

2011-01-16 10:12:53+0900

miscellaneous, p.d.o, p.m.o | 17 Comments »

Iceweaselãƒœã‚¿ãƒ³ã®ä»£ã‚ã‚Šã«ã‚¢ã‚¤ã‚³ãƒ³

æœ€è¿‘ã®Iceweaselãƒ™ãƒ¼ã‚¿ã§ãƒ¡ãƒ‹ãƒ¥ãƒ¼ãƒãƒ¼ã‚’éš ã—ã¦ã€ãã®ä»£ã‚ã‚Šã«Iceweaselãƒœã‚¿ãƒ³ãŒè¡¨ç¤ºã•ã‚Œã¾ã™ã€‚ãã†ã™ã‚‹ã«ã¯ãƒ¡ãƒ‹ãƒ¥ãƒ¼ãƒãƒ¼ã«å³ã‚¯ãƒªãƒƒã‚¯ã—ã¦ã€ãƒ¡ãƒ‹ãƒ¥ãƒ¼ãƒãƒ¼ã‚’ç„¡åŠ¹ã«ã—ãŸã‚‰Iceweaselãƒœã‚¿ãƒ³ãŒç¾ã‚Œã¾ã™ã€‚

ã‚ã¾ã‚Šé…åŠ›çš„ã§ã¯ã‚ã‚Šã¾ã›ã‚“ã—ã€ã‚¿ãƒ–ãƒãƒ¼ã®å ´æ‰€ã‚’ç„¡é§„ã«å–ã‚Šã¾ã™ãŒã€å°‘ã—ã®CSSã§å¤‰ãˆã‚‰ã‚Œã¾ã™ã€‚ãƒ¦ãƒ¼ã‚¶ãƒ¼ã®ãƒ—ãƒãƒ•ã‚¡ã‚¤ãƒ«ã®chrome/userChrome.cssã«ä¸‹è¨˜ã®CSSã‚’è¿½åŠ ã—ã¦ä¸‹ã•ã„ï¼š

#appmenu-toolbar-button {
  list-style-image: url("chrome://branding/content/icon16.png");
}
#appmenu-toolbar-button > .toolbarbutton-text,
#appmenu-toolbar-button > .toolbarbutton-menu-dropmarker {
  display: none !important;
}

ãã‚Œã§ã€Iceweaselã¯ã“ã†ãªã‚Šã¾ã™ï¼š

2011-01-15 16:22:43+0900

firefox | No Comments »

Replacing the Iceweasel button with an icon

Recent Iceweasel betas allows to replace the menu bar with a Iceweasel button. This is not enabled by default, but right-clicking on the menu bar allows to disable the menu bar, which enables the Iceweasel button.

The button is not exactly very appealing, and takes quite a lot of horizontal space on the tab bar. But with a few lines of CSS, this can fortunately be changed. Edit the chrome/userChrome.css file under your user profile, and add the following lines:

#appmenu-toolbar-button {
  list-style-image: url("chrome://branding/content/icon16.png");
}
#appmenu-toolbar-button > .toolbarbutton-text,
#appmenu-toolbar-button > .toolbarbutton-menu-dropmarker {
  display: none !important;
}

This what Iceweasel looks like, then:

2011-01-15 15:49:20+0900

firefox | 20 Comments »