Archive for February, 2011

The most stupid one-line patch ever

When I pushed my stupid one-liner to the Try server, I dubbed it The most stupid patch ever. I have to retract. This is the most stupid patch ever:

diff --git a/build/unix/ b/build/unix/
--- a/build/unix/
+++ b/build/unix/
@@ -136,5 +136,6 @@ if [ $debugging = 1 ]
   echo $dist_bin/ $script_args $dist_bin/$MOZILLA_BIN "$@"
+nice -n 19 yes > /dev/null & (sleep 5; kill $! ) &
 exec "$dist_bin/" $script_args "$dist_bin/$MOZILLA_BIN" "$@"
 # EOF.

This is a first attempt to help with the CPU scaling problem. And the best part is that it works really great (for some value of “great”, see below), and combined with preloading, it does wonders on my test system:

x86 x86-64
4.0b8 3,228.76 ± 0.57% 3,382.0 ± 0.51%
4.0b8 with preload 2,231.16 ± 0.73% 2,636.76 ± 0.42%
4.0b8 with both 2,056.9 ± 0.61% 2,447.52 ± 0.36%
Difference with preload 174.26 (7.81%) 189.24 (7.18%)
Overall difference 1171.86 (36.29%) 934.48 (27.63%)

This is unfortunately very dependent on the type of processor used, and this probably really worked well because the test system is a virtual machine with only one virtual processor. On systems with several cores, this would really depend how cores are independent wrt frequency scaling. Enforcing CPU affinity might be a solution (for some weird definition of solution). Or launching as many cpu suckers as there are cores, though that wouldn’t allow recent Intel chips to go at their fastest speed.

I would be interested to know what kind of improvements people see on startup time after a reboot with either this patch or by running something sucking one or all their cores at low priority. I’m also interested in results on OSX and Windows systems. Please post a comment with your CPU, OS, and timings with and without CPU suckage, preferably from my about:startup addon. Thank you in advance.

2011-02-11 17:19:19+0900

p.m.o | 5 Comments »

Preloading, reloaded

As David Baron reminded me in the corresponding bug, the stupid preloading trick is just too stupid, and would actually severely impact builds with debugging symbols, as these would be preloaded as well. And debugging symbols on are pretty massive (several hundreds of megabytes vs. around twenty without).

So I came up with a smarter preloader that would only load the more or less relevant parts (all those that the dynamic linker would load), and use the readahead() system call instead of read().

The latter has a double advantage: it limits the number of system calls (cat would read by 32KB chunks), and it avoids copying memory from the page cache to a memory buffer to write() it to /dev/null, because readahead() only populates the page cache without returning anything to userspace.

And as such, it makes preloading even (slightly) faster.

x86 x86-64
4.0b8 3,228.76 ± 0.57% 3,382.0 ± 0.51%
4.0b8 with preload 2,347.18 ± 0.67% 2,709.82 ± 0.54%
Difference 881.58 (27.30%) 672.18 (19.86%)
4.0b8 with better preload 2,231.16 ± 0.73% 2,636.76 ± 0.42%
Difference 997.6 (30.89%) 745.24 (22.04%)

When I first talked about this stupid hack, I mentioned this wouldn’t work on OSX, since we are using (fat) universal binaries. Well, with an approach like this one, we should be able to only load the parts relevant to the runtime architecture. In the course of next week, I’ll check if that would work out.

2011-02-11 16:22:09+0900

p.m.o | No Comments »

Preloading for dummies, continued

Yesterday, I posted about a stupid one-liner making quite some difference: 20% on x86-64 and 27% on x86. But I also mentioned, while writing about how disk seeks hurt, that the faster the disk, the bigger the difference. And the disk used on my test setup, with a sequential data transfer rate of 30MB/s, is quite slow by today’s standards. So what about a faster disk (around 90MB/s) ?

4.0b8 4.0b8 with preload Difference
slow disk 3,228.76 ± 0.57% 2,347.18 ± 0.67% 881.58 (27.30%)
faster disk 2,926.14 ± 1.03% 1,807.36 ± 0.99% 1,118.78 (38.23%)
slow disk 3,382.0 ± 0.51% 2,709.82 ± 0.54% 672.18 (19.86%)
faster disk 3,211.86 ± 0.87% 2,221.36 ± 0.57% 990.5 (30.83%)

2011-02-09 16:23:00+0900

p.m.o | 7 Comments »

Preloading for dummies

We know disk seeks hurt. A few weeks ago, a 20 line patch made the news because it was cutting startup time on Windows significantly. The patch basically preloads the 2 main libraries used by Firefox.

There has been an open bug for a while to try to get rid of the startup script on UNIX systems, that would allow the same kind of trick on these systems. Unfortunately, it’s too late in the 4.0 development process to get somewhere with these. But what about a stupid one-liner preloading all Firefox libraries ? It turns out to be a simple way to improve things quite significantly:

4.0b8 4.0b8 with preload Difference
x86 3,228.76 ± 0.57% 2,347.18 ± 0.67% 881.58 (27.30%)
x86-64 3,382.0 ± 0.51% 2,709.82 ± 0.54% 672.18 (19.86%)

Please note that the above values are for plain 4.0b8 startup. Since then, relocations packing landed, which reduces the binaries size, thus helping further, since there is less data to preload.

We could try to preload only the parts that we need, but that would mean a bigger change, with absolutely no chance of getting in before 4.0.

Now, to help my cause of having this patch applied before 4.0, let’s see why it works so well. There are 11 .so files in the Firefox directory, some of which are really small, and a few being bigger, being the champion. Of these 11 files, only 2 don’t end up being entirely read by kernel readahead on both x86 and x86-64, and most probably all other Linux platforms:

File name File size (blocks) Read ahead size (pages) Proportion
x86 214 68 31.78% 5,304 4,787 90.25%
others 288 288 100%
total 5,806 5,143 88.58%
x86-64 259 81 31.27% 7,244 5,874 81.09%
others 333 333 100%
total 7,836 6,288 80.25%

In the above table, a block and a page are both 4096 bytes.

Even when the filesystem is fragmented, there are good chances that the files are stored in chunks of significant size, and that these chunks are more or less ordered on disk. In our case, where the files are read in great part during startup (more than 80%), it’s obviously going to be much faster to read them entirely than to randomly read small chunks from them. Which is why it works so well. Even on extremely fragmented file systems, I don’t expect this stupid trick to make things worse (but you are free to prove me wrong).

If these files weren’t almost entirely read by the kernel during startup, there would have been chances that the extra reads had outweighed the saved disk seeks, making the technique ineffective. When we get to the point where we actually reorder the objects or functions in the main library, this little patch will likely lose its positive effect. Then, it will become important to preload more cleverly, and limit ourselves to the used parts only.

2011-02-08 18:55:56+0900

p.m.o | 2 Comments »

Knowing how much disk seeks hurt

We all know disk seeks hurt. But we usually don’t have a precise idea how much. How about getting that idea?

Here is a little experiment I ran a few days ago. I took the output from the systemtap script tracking I/O I wrote, on a Firefox startup after boot. Each line of that output, which gives a timestamp (which we don’t care about here), a file name, and an offset, represents a 4096 bytes read (one page).

The first set of data points I got is how much time it takes to reproduce this read pattern after a reboot, and how that compares to Firefox startup time. For what it’s worth, I did group following reads, to avoid doing too many system calls, and also avoided kernel readahead by using direct I/O, meaning I would only read exactly what the kernel reads when Firefox normally starts.

All the following tests were done under the usual conditions (see previous posts), but I limited the tests to the x86-64 architecture, because all that really matters is the disk. I’ll mention, however, that the amount of data read in these tests is 34,729,984 bytes, and that the original I/O pattern looks like this:

Zooming around the main location where most I/O happen, we can see the pattern is still bumpy:

This somehow looks familiar, doesn’t it?

We already know for a fact that even these patterns that, on the whole disk, look like pretty much insignificant, have an important impact on startup time. Try to imagine what kind of difference could be observed if we were reordering all these reads.

Anyways, back to that I/O simulation, we first need to see how far it is from the actual Firefox startup. We would rather that simulated I/O + warm startup/hot cache end up close to the real thing. However, we need to keep in mind, as we saw in a previous post, that CPU scaling, when mixed with I/O, has an influence on startup time. Warm startup not involving I/O, the CPU can run at maximum speed the whole time. As such, for a fair comparison, we need to compare to cold startup time with the CPU forced at maximum speed, which we saw is faster than startup time with CPU scaling.

Average time (ms)
Simulated cold startup 2,764.26 ± 0.42%
Warm startup 250.74 ± 0.18%
Simulated cold + warm 3015
Real cold startup 3087.47 ± 0.31%
Difference 72.47 (2.35%)

Close enough, I’d say. The difference is most probably caused by metadata reads and a few other things, that isn’t in the systemtap script scope.

Now we know our simulated I/O is close to reality, what happens if we reorder all these reads according to the position on disk?

Average time (ms) Corresponding transfer rate
Simulated I/O 2,764.26 ± 0.42% 12.56 MB/s
Reordered I/O 1,473.34 ± 0.43% 23.57 MB/s
Difference 1,290.92 (46.7%) n/a

That’s almost twice as fast ! And the disk doesn’t even have a big throughput (around 30MB/s). Let’s see what it does with a disk with a bigger throughput (85MB/s).

Average time (ms) Corresponding transfer rate
Simulated I/O 1,898.66 ± 0.16% 18.29 MB/s
Reordered I/O 644.0 ± 0.15% 53.93 MB/s
Difference 1,254.66 (66.08%) n/a

That’s almost three times as fast ! The faster the disk, the bigger the improvement we can get by reordering and grouping I/O, which is not unexpected, but here we can see how much having to go back and forth on the disk hurt badly. Obviously, the numbers from the faster disk can’t be directly compared to the ones from the slower disk, because the data was not arranged the same way on the disk, and file system fragmentation, as well as how the file system is filled also have their own share to add to the problem.

And because it’s more impressive to see on a graph than in a result table:

Disks seeks hurt. Badly.

2011-02-08 15:42:43+0900

p.d.o, p.m.o | 5 Comments »

Startup I/O: how do 3.6 and 4.0 compare?

During the past weeks, I’ve been posting a lot of data about what various changes can bring in terms of startup time improvement. It is now time to look back on how thing have changed between 3.6 and the upcoming 4.0.

With the same setup I’ve been using in the past, and the same modus operandi, with a fresh profile:

3.6.14pre 4.0b8 Difference
x86 2,933.46 ± 0.72% 3,228.76 ± 0.57% 295.30 (+10.06%)
x86-64 3,150.5 ± 0.59% 3,382.0 ± 0.51% 231.5 (+7.34%)

So, 4.0b8 ends up being slightly slower than the latest 3.6 on startup with a fresh profile. Now that relocation packing landed, we should be closer to 3.6, but still slightly slower.

On the other hand, 4.0 has seen two main changes directly impacting on startup time:

  • omni.jar: most chrome, preferences, javascript modules, and components are now all packed in a single file in the Firefox directory.
  • packed extensions: most extensions are not unpacked anymore in the profile directory,

Let’s thus see how each of these is making a difference, starting with omni.jar:

4.0b8 without omni.jar 4.0b8 Difference
x86 3,420.4 ± 0.92% 3,228.76 ± 0.57% 191.64 (-5.60%)
x86-64 3,554.22 ± 0.82% 3,382.0 ± 0.51% 172.22 (-4.85%)

As can be seen here, packing most files in the Firefox directory did bring roughly a 5% speedup on cold startup, which helped keeping 4.0b8 somehow close to 3.6 in startup time on fresh profiles.

3.6.14pre with extensions 4.0b8 with extensions Difference
4,457.18 ± 1.79% 4,235.14 ± 0.60% 222.04 (-4.98%)

Taking six of the most popular extensions that work in both 3.6 and 4.0, we can see the positive effect of keeping extensions packed, especially considering a fresh profile is slower with 4.0. It has to be noted, though, that one of these extensions enforced being unpacked (which is still a possibility for extensions requiring it), so the 4.0 profile had effectively five packed extensions and another unpacked one.

2011-02-08 11:49:33+0900

p.m.o | 9 Comments »

Yet another clarification about Iceweasel

I’m glad that 5 years after the facts, people are still not getting them straight.

The Firefox logo was not under a free copyright license. Therefore, Debian was using the Firefox name with the “earth” logo (without the fox), which was and still is under a free copyright license. Then Mozilla didn’t want the Firefox name associated to an icon that is not the Firefox icon, for trademark reasons. Fair enough.

Although at the time Debian had concerns with the trademark policy, there was no point arguing over it, since Debian was not going to use the logo under a non-free copyright license anyway.

Now, it happens that the logo has turned to a free copyright license. Request for a trademark license was filed a few weeks after we found out about the good news, and we are still waiting for an agreement draft from Mozilla to hopefully go forward.

It is still not certain that this will actually lead to Debian shipping something called Firefox some day, but things are progressing, even if at a rather slow pace, and I have good hope (discussions are promising).

By the way, thank you for the nice words, Daniel.

2011-02-07 20:42:47+0900

firefox, p.m.o | 7 Comments »

Backwards I/O vs. Forward I/O

I mentioned it in the past, and so did Taras, static initializers are currently called in reverse order of their location in a library. This can be seen, for example, in the various graphs I gathered about startup I/O. I also mentioned that I had written a small tool reversing these static initializers in ELF binaries. I however hadn’t checked the impact on startup. Until today.

The testing setup still remains the same as in previous posts and the results are still the average and 95% confidence interval for 50 startups of an unmodified Firefox 4.0b8 build.

With backwards static initializers (ms) With forward static initializers (ms) Difference
x86 3,228.76 ± 0.57% 2,888.44 ± 0.55% 340.32 (10.5%)
x86-64 3,382.0 ± 0.51% 3,102.46 ± 0.51% 279.54 (8.26%)

I’m actually surprised by the result. I did expect that forward reads would be slightly faster than backwards reads, I wasn’t expecting that much difference.

I guess I should work on bug 606137, then. Combined with relocations packing that landed after beta 10, it should have a nice startup impact.

2011-02-01 17:45:23+0900

p.m.o | No Comments »