Archive for September, 2010

Getting figures on startup libxul I/O, and static initializers

As we saw in my previous analysis, there are several parts of the library initialization process that have a direct impact on startup performance, especially on big applications like Firefox. Understanding the various issues involved surely is important, and visualizing them and their evolution (through logs or graphs, the latter being more appealing to the eye) may be helpful.

The above graph represents a whole lot of information about how is accessed during startup. [Disclaimer: there are problems with the data collection that don’t help for the best accuracy. I’ll talk about them in a subsequent post.]

  • The horizontal axis represents time, though as the data was gathered while the system and applications are loaded off a fast SSD drive, it does not show how some I/O can have a big impact on the startup time. The timeline also stops at the first access to an html test file given on the command line, which actually (and unfortunately) doesn’t account for the whole startup.
  • The vertical axis is the offset within where I/O occur, start of the file being at the bottom.
  • The red dots/bubbles represent each I/O. Access length is actually not represented for reasons I’ll give in the subsequent post.
  • The horizontal coloured bars represent various important sections of the file. From bottom to top, .rela.dyn, .text, .rodata,, and .data. Note the huge gap between .rodata and is .eh_frame (the data was gathered on x86-64, but x86 patterns are close enough that I didn’t bother)
  • The vertical grayed areas represent time where I/O occur to other files than The lightest shade represents I/O on system files, the slightly darker one I/O on files in the Firefox directory, and the darkest one (almost black) I/O on profile files. Each of these areas starts on the beginning of the first access to a file in a given category and ends on the end of last access to a file within the same category. They obviously don’t relate to where code is being executed, but somehow to where a rotating hard disk head might be (though fragmentation makes that more or less irrelevant). The data was gathered after a total flush of the page cache, which is biased because on a typical desktop, the page cache will already be populated with most system libraries Firefox requires.

The leftmost part of the graph is really nothing new or surprising if you read Taras’ blog and my previous post: Initialization of other (system) libraries, followed by relocations, then static initializers.

One of the things this graph shows that was only a hunch before is that the relocation process ends up reading the whole and .data sections as a byproduct. Or at least, it reads enough of them that it’s not read again later (even on longer sessions, accesses to and .data don’t show up).

It also appears that .rodata is accessed in an unordered manner from the static initializers, but not enough to cover the whole section. Also, as Taras explained, static initializers are run backwards, starting with the last one. Some of them also call some functions within, which explains why the static initialization phase accesses in .text are not entirely sequentially backwards.

Follows accesses to a bunch of different files, and calls to functions spread accross the .text section.

There are several places that can help improving startup time (some are Linux only, but others echo on Windows and Mac ; please note none of the following is really new):

  • Get rid of static initializers. More on this below.
  • Avoid going back and forth during relocation process. We’re going to test how much good madvise() will do.
  • Improve the relocation process. We are planning to experiment relocation sections packing/compression. The ongoing deCOMtamination process may help reducing the amount of relocations, too, by removing the need for some vtables (around 75% of the relocations are for vtables entries).
  • Improve code layout. Taras initialized this with icegrind, aiming at reordering functions, which requires help from the linker. Before that, I’d like to try reordering object files and see how it helps. This should be made easier by Kyle’s fakelibs.
  • Reduce I/O to other files. A great part of that is being addressed with omnijar and jar reordering, and more attempts at reducing I/O from e.g. profile files are in progress.

I’m currently working on reducing the number of static initializers, and wanted to have a more accurate view on how this reduction affects startup. Unfortunately, mozilla-central is a moving target, and some patches I was working on already don’t apply anymore, so I only gathered data after the first (big) batch of static initializer removal: cycle collection induced ones (60% of all the static initializers).

There are obvious upside and downside to static initializers removal: the upside is that backwards I/O tending to be suboptimal, reducing the amount of backwards seek should help. The downside is that anything that is not read during the static initialization phase and thus put in page cache may be required at a later time. For example, if a function is a few bytes or kilobytes after a static initializer and is used during startup, its code will be cached when reading the static initializer because of readahead. If the static initializer is removed, this function will still be accessed, thus itself triggering an I/O, but later in the startup process. The main question is whether the upside overtakes the downside.

The above graph represents the same kind of data as the first graph (be aware the timescale is most probably different, and irrelevant anyways), with 179 static initializers less, out of 296 (when I started, there were “only” 237 static initializers, sigh ; sure cycle collection induced ones grew, but there’s also 27 more that are not cycle collection). If you take a closer look and compare both graphs, you’ll see that apart from the obvious static initializers reduction, there are, as expected, much more I/O later. I won’t ask you to count how many more, fortunately, computers and scripts are our friends here.

Initially, there were 176 accessed in the leftmost part, and 56 subsequent accesses. After the static initializer reduction, the leftmost part got reduced to 136, and the subsequent part went up to 85. Overall, I now get 221 accesses instead of 232, which may or may not be an actual (direct) win, depending on disk fragmentation, and depending how spread the accesses are.

One figure that may help getting an idea how spread they are is to add all the different seeks happening within alone. Doing so gives the following results:

  • Initially: 815,951,659 forward ; 620,941,312 backwards ; 1,436,892,971 overall
  • After static initializers reduction: 828,703,162 forward ; 617,279,488 backwards ; 1,445,982,650 overall

Overall, the after static initializers reduction result might be worse than before. Not even counting on the fact that backwards seeks are supposedly worse than forward seeks (I seem to remember reading something like this a while ago, factual information would be appreciated).

All in all, I’m afraid the static initializers reduction will only have a visible impact when combined with functions or objects reordering, and static initializers grouping.

I will follow up when I get the number further of static initializers down, in other words, once I rebased my work on the revision I was testing. I’d also like to make the graphs less static.

2010-09-30 16:30:28+0900

p.m.o | 3 Comments »

Don’t trust btrfs-show

I recently started using btrfs, in both raid0 and raid1 setups, and migrated some data there.

Once all the data was copied, I wanted to check how much overhead moving to btrfs meant, space-wise.

# df -h /mnt/share Filesystem Size Used Avail Use% Mounted on /dev/sdb3 932G 682G 250G 74% /mnt/share # btrfs-show /dev/sdb3 Label: none uuid: e06484de-fb18-4564-b6e6-adbaed8bebec Total devices 2 FS bytes used 681.61GB devid 2 size 465.66GB used 427.13GB path /dev/sdb3 devid 1 size 465.66GB used 427.15GB path /dev/sda3

The filesystem was freshly created, in a raid0 fashion, so both 465GB partitions are used to form (roughly) a 930GB volume. So, I had filled it with 682GB of data, and df was telling me 250GB were left, which is not unexpected. What I didn’t expect, though, is btrfs-show telling me 427GB were used on each partition. Again, the filesystem was freshly created and the data had just been copied over, so I wasn’t expecting these numbers to be off.

Which meant 854GB were being allocated on disk for 681GB of data, 25% overhead !

Curious to know how much left that really meant, I just tried a stupid thing: filling as much as btrfs permits:

# dd if=/dev/zero of=/mnt/share/zero bs=1M dd: writing `/mnt/share/zero': No space left on device 253359+0 records in 253358+0 records out 265665118208 bytes (266 GB) copied, 1144.41 s, 232 MB/s # du -sh /mnt/share/zero 248G /mnt/share/zero

It turns out I actually could write for the amount of free space df was claiming was free (modulo GiB-GB conversions).

# df -h /mnt/share Filesystem Size Used Avail Use% Mounted on /dev/sdb3 932G 930G 2.0G 100% /mnt/share # btrfs-show /dev/sdb3 Label: none uuid: e06484de-fb18-4564-b6e6-adbaed8bebec Total devices 2 FS bytes used 929.36GB devid 2 size 465.66GB used 465.13GB path /dev/sdb3 devid 1 size 465.66GB used 465.15GB path /dev/sda3

So, this time df is slightly off, and btrfs-show somehow agrees the filesystem is full.

But surprises don’t stop there:

# rm /mnt/share/zero # df -h /mnt/share Filesystem Size Used Avail Use% Mounted on /dev/sdb3 932G 682G 250G 74% /mnt/share # btrfs-show /dev/sda3 Label: none uuid: e06484de-fb18-4564-b6e6-adbaed8bebec Total devices 2 FS bytes used 929.36GB devid 2 size 465.66GB used 465.13GB path /dev/sdb3 devid 1 size 465.66GB used 465.15GB path /dev/sda3

That’s right, even after the file removal, for btrfs-show, everything is still in use. Even after several sync or several hours later.

All in all, just don’t trust what btrfs-show says.

NB: one also needs to know that df reports wrong sizes for raid1 setups, but that’s another story.

2010-09-25 17:53:53+0900

p.d.o | 1 Comment »


I’ve had external disks supporting all of firewire, usb and eSATA for a while now, but only recently got a machine supporting eSATA (as well as the two others). I was previously preferring firewire over usb, for good reasons:

  • usb: Timing buffered disk reads: 84 MB in 3.01 seconds = 27.89 MB/sec
  • firewire: Timing buffered disk reads: 112 MB in 3.00 seconds = 37.33 MB/sec

But I didn’t expect this 3 years old disk (possibly more, I don’t remember precisely) to perform that well over eSATA:

  • eSATA: Timing buffered disk reads: 216 MB in 3.01 seconds = 71.88 MB/sec

(All results as given by hdparm -t)

Interestingly, smartctl is unable to gather the vendor specific SMART attributes when the disk is attached on eSATA, while it can when on usb.

2010-09-25 17:26:45+0900

p.d.o | 3 Comments »