Archive for the 'debian' Category

The measured effect of I/O on application startup

I did some experiments with the small tool I wrote in previous post in order to gather some startup data about Firefox. It turns out it can't flush directories and other metadata from the page cache, making it unfortunately useless for what I'm most interested in.

So, I gathered various startup information about Firefox, showing how page cache (thus I/O) has a huge influence on startup. The data in this post are mean times and 95% confidence interval for 50 startups with an existing but fresh profile, in a mono-processor GNU/Linux x86-64 virtual machine (using kvm) with 1GB RAM and a 10GB raw hard drive partition over USB, running, except when said otherwise, on an i7 870 (up to 3.6GHz with Intel Turbo Boost). The Operating System itself is an up-to-date Debian Squeeze running the default GNOME environment.

Firefox startup time is measured as the difference between the time in ms right before starting Firefox and time in ms as returned by javascript in a data:text/html page used as home page.

Startup vs. page cache

	Average startup time (ms)
Entirely cold cache (`drop_caches`)	5887.62 Â± 0.88%
Cold cache after boot	3382.0 Â± 0.51%
Selectively cold cache (see below)	2911.46 Â± 0.48%
Hot cache (everything previously in memory)	250.74 Â± 0.18%

The selectively cold cache case makes use of the flush program from previous post and a systemtap script used to get the list of files read during startup. This script will be described in a separate post.

As you can see, profiling startup after echo 3 > /proc/sys/vm/drop_caches takes significantly more time than in the normal conditions users would experience, because of all system libraries that would normally be in the page cache being flushed, hence biasing the view one can have of the actual startup performance. Mozilla build bots were running, until recently, a ts_cold startup test that, as I understand it, had this bias (which is part of why it was stopped).

The Hot cache value is also interesting because it shows that the vast majority of cold startup time is due to hard disk I/O (and no, there is no page faults number difference).

I/O vs CPU

Interestingly, testing on a less beefy machine (Core 2 Duo 2.2GHz) with the same USB disk and kvm setup shows something not entirely intuitive:

	Average (ms)
Entirely cold cache	6748.42 Â± 1.01%
Cold cache after boot	3973.64 Â± 0.53%
Selectively cold cache	3445.7 Â± 0.43%
Hot cache	570.58 Â± 0.70%

I, for one, would have expected I/O bound startups to only be slower by around 320ms, which is roughly the hot cache startup difference, or, in other words, the CPU bound startup difference. But I figured I was forgetting an important factor.

I/O vs. CPU scaling

Modern processors do frequency scaling, which allows the processor to run slowly when underused, and faster when used, thus saving power. It was first used on laptop processors to reduce power drawn from the battery, allowing batteries to last longer, and is now used on desktop processors to reduce power consumption. It unfortunately has a drawback, in that it introduces some latency when the scaling kicks in.

A not so nice side effect of frequency scaling is that when a process is waiting for I/O, the CPU is underused, making the CPU usually run at its slowest frequency. When the I/O ends and the process runs again, the CPU can go back to full speed. This means every I/O can induce, on top of latency because of, e.g. disk seeks, CPU scaling latency. And it actually has much more impact than I would have thought.

Here are results on the same Core 2 Duo, with frequency scaling disabled, and the CPU forced to its top speed :

	Average (ms)
Entirely cold cache	5824.1 Â± 1.13%
Cold cache after boot	3441.8 Â± 0.43%
Selectively cold cache	3025.72 Â± 0.29%
Hot cache	576.88 Â± 0.98%

(I would have liked to do the same on the i7, but Intel Turbo Boost complicates things and I would have needed to get two new sets of data)

Update: I actually found a way to force one core at its max frequency and run kvm processes on it, giving the following results:

	Average (ms)
Entirely cold cache	5395.94 Â± 0.83%
Cold cache after boot	3087.47 Â± 0.31%
Selectively cold cache	2673.64 Â± 0.21%
Hot cache	258.52 Â± 0.35%

I haven't gathered enough data to have accurate figures, but it also seems that forcing the CPU frequency to a fraction of the fastest supported frequency gives the intuitive results where the difference between all I/O bound startup times is the same as the difference for hot cache startup times. As such, I/O bound startup improvements would be best measured as an improvement in the difference between cold and hot cache startup times, i.e. (cold₂ - hot₂) - (cold₁ - hot₁), at a fixed CPU frequency.

Startup vs. desktop environment

We saw above that the amount of system libraries in page cache directly influences application startup times. And not all GNU/Linux systems are made equal. While the above times were obtained under a GNOME environment, some other desktop environments don't use the same base libraries, which can make Firefox require to load more of them at cold startup. The most used environment besides GNOME is KDE, and here is what cold startup looks like under KDE:

	Average startup time (ms)
GNOME cold cache	3382.0 Â± 0.51%
KDE cold cache	4031.9 Â± 0.48%

It's significantly slower, yet not as slow as the entirely cold cache case. This is due to KDE not using (thus not pre-loading) some of the GNOME core libraries, yet using libraries in common, like e.g. libc (obviously), or dbus.

2011-01-03 16:59:17+0900

firefox, p.m.o | 4 Comments »

Improving libxul startup I/O by hacking the ELF format

If you have been following this blog during the past months, you might have gathered that I've been focusing on I/O patterns on library initialization, more specifically libxul.so. The very patterns there's not much to do about, except enduring them, or hacking ld.so and/or the toolchain. The latter is not necessarily easy, and won't benefit everyone until all linux distros use an improved glibc and/or toolchain. This could take years before any actual result would reach end users.

But with a little creativity, one can overcome these limitations and improve the situation today. Here is a summary of my recent hacks.

Backwards static initializers

As Taras showed in the past, static initializers are executed in reverse order. It turns out it is an unfortunate design choice in gcc, and it also turns out not to impact all ELF binary files.

As far as I know there are currently two different implementations of the static initializers call loop, one provided by gcc, and one provided by ld.so.

In the gcc case, the compiler creates a .ctors section, starting with 0xffffffff, ending with 0x00000000, and filled with pointers to the various static initializer functions, in ascending order. The compiler also injects an initializer function, called from .init, going through this .ctors section, starting from the end (__do_global_ctors_aux in crtend.o). A similar .dtors section is created for functions to run when the library is unloaded, and the pointers in it are called in ascending order. There is actually no importance in the order in which .ctors and .dtors are executed. All that matters is that they are executed in reverse order from each other, because that limits the risks of bad ordering. It is only unfortunate that .ctors was chosen to be the one backwards.

In the ld.so case, the compiler creates a .init_array section, filled with pointers to the various static initializer functions, in ascending order. It also stores the location and the size of this section in the PT_DYNAMIC ELF header. At runtime, ld.so itself reads this information and executes the static initializers in ascending orders. A similar .fini_array section is created for functions to run when the library in unloaded, and ld.so goes through it in descending order.

The .init_array section is newer, and for backwards compatibility reasons, is not used on systems that have been using the .ctors approach forever. On the other hand, some newer ABIs, such as arm EABI, uses the new facility, which means they are not victim of backwards static initialization. I however wonder if backwards compatibility is still needed, considering how long .init_array support has been around (more than 11 years, according to glibc git repository).

While we could obviously bug the gcc maintainers to revert their .ctors execution order, or ditch .ctors support on glibc systems, which means we wouldn't get any immediate global benefit, we can also just revert the .ctors section content. Please note the source code out there doesn't take care of reverting the .dtors section accordingly, because it is currently empty in libxul.so.

Relocations

As I mentioned in an earlier post, relocations are what make it possible to load libraries at arbitrary locations. Unfortunately, a great number of relocations means awful I/O patterns: ld.so reads and applies them one by one, and while the kernel reads ahead by 128 KB chunks, when your relocation section exceeds that amount by more than an order of magnitude, it means I/O goes back and forth a lot between the relocation section and where the relocations take place.

There are two classes of relocations: REL and RELA. In both cases, there are several types of relocations, some of which require an addend. The difference between both classes is in REL relocations, the addend is stored at the offset the relocation takes place, and in RELA relocations, the addend is stored in the relocation entry. It means that RELA relocations effectively waste space, as the addend ends up actually being stored twice: it is also stored at the offset the relocation takes place, like with REL relocations, though it could technically be nulled out, there ; in any case, the space is wasted.

REL relocations thus take 2 words of storage, and RELA relocations, 3. On x86, the ELF ABI only uses REL relocations, and words are 32 bits, making each relocation take 8 bytes ; on x86-64, the ELF ABI only uses RELA relocations, and words are 64 bits, thus making each relocation take 24 bytes, 3 times that of x86 relocations. On ARM, both may be used, but apparently the compiler and the linker only use REL relocations.

These words of storage are used to store 3 or 4 bits of informations:

r_offset: the offset at which the relocation takes place
r_info: containing both the relocation type and a symbol reference for the relocation. On 32 bits systems, the symbol reference is stored on 24 bits, and the relocation type on 8 bits. On 64 bits systems, the symbol reference is stored on 32 bits, and the relocation type on 32 bits (which means at least 24 bits are never going to be used: there are currently no ABI using more than 256 relocation types).
r_addend: The addend, only in RELA relocations

The linker splits relocations in two main categories: those for dynamic linking at startup and those for procedure linkage table (PLT) resolution at runtime. Here is what the repartition look like on libxul.so (taken from the mozilla-central nightly from Nov, the 22nd):

	dynamic relocations		PLT relocations
Architecture	number	size (bytes)	number	size (bytes)
x86	231,836	1,854,688	3,772	30,176
x86-64	235,893	5,661,432	3,773	90,528

So, PLT relocations are quite negligible, and dynamic relocations are just huge, especially on x86-64. Compared to the binary size, it's pretty scary:

Architecture	`libxul.so` size	relocations size	%
x86	21,869,684	1,884,864	8.61%
x86-64	29,629,040	5,751,984	19.41%

Let me remind you that not only do we need to read these relocations at startup (at least the dynamic ones, but the PLT ones are just negligible), but we only read them by chunks, and need to read other data in between.

Of the many existing types of relocations, only a few are actually used in dynamic libraries. Here is a summary of those in use in libxul.so:

x86 relocation type	number	x86-64 relocation type	number
dynamic relocations
`R_386_TLS_DTPMOD32`	1	`R_X86_64_DTPMOD64`	1
`R_386_GLOB_DAT`	213	`R_X86_64_GLOB_DAT`	238
`R_386_32`	27,643	`R_X86_64_64`	27,611
`R_386_RELATIVE`	203,979	`R_X86_64_RELATIVE`	208,043
PLT relocations
`R_386_JUMP_SLOT`	3,772	`R_X86_64_JUMP_SLOT`	3,773

The above types can be separated in two distinct categories:

Requiring symbols resolution (R_*_GLOB_DAT, R_*_JUMP_SLOT, R_386_32, R_X86_64_64)
Not requiring symbols resolution (R_*_RELATIVE, R_386_TLS_DTPMOD32, R_X86_64_TLS_DTPMOD64)

In the second category, it means we are wasting all the bits for the symbol reference in each entry: 24 on x86 and 32 on x86-64. Unfortunately, the vast majority of the relocations in libxul.so are from that category. At this point, this is how much waste we identified:

description	num. entries	waste per entry (bytes)	total	% of relocation size
x86
Relocation without symbol ref.	203,980	3	611,937	32.46%
x86-64
Duplicated addend	239,666	8	1,917,328	33.33%
Relocation type encoded on 32 bits	239,666	3	718,998	12.49%
Relocation without symbol ref.	208,044	4	832,176	14.46%
Total			3,468,502	60.30%

On top of that, one could argue that we don't need 24 or 32 bits for symbol reference (which is the entry number in the symbol table), and that we're really far from requiring 64 bits for r_offset on 64 bits systems, as libxul.so is far from being bigger than 4GB. Anyways, that means we have room for improvement here, and it means that the ELF format, especially on 64 bits systems with RELA relocations, really doesn't help with big programs or libraries with a lot of relocations. (For that matter, the mach-o format used on OS-X has some kind of similar issues ; I don't know for the DLL format)

Another interesting data point is looking at how relocations' r_offset are following each other, once relocations are reordered (to avoid too much symbol resolution, they are actually grouped by symbol). It turns out 206,739 relocations' r_offset are at 4 bytes from the previous one on x86 (i.e. strictly following it), and 210,364 at 8 bytes from the previous on x86-64. The main reason why this happens is that most relocations are happening in pointer tables, and libxul.so has lots of them:

That's what you get with virtual methods in C++.
That's also what you get in .ctors and .dtors sections, which are relocated as well (but not .init_array and .fini_array sections).
That's what you get when you use function tables.
Obvously, that's what you get when you use pointer tables.

And when I say libxul.so has lots of them, that's not an understatement: there are 184,320 relocations due to vtables (function tables you get when using virtual methods in C++) on x86 and 184,098 on x86-64. That's respectively 78.23% and 76.81% of the total number of relocations. Interestingly, while most of these relocations are R_*_RELATIVE, 26,853 are R_386_32, and 26,829 are R_X86_64_64. Except for a very few, these non relative relocations are for the __cxa_pure_virtual symbol, which is a dummy function that does nothing, used for pure virtual methods, such as the following:

class Abstract { public: virtual void pure_virtual() = 0; };

As for the remaining relative relocations, there are a few pointer tables responsible for several thousand relocations, such as JSString::length2StringTable, gEntries, GkAtoms_info, Html5Atoms_info, kUnixCharsets, nsIDOMCSS2Properties_properties, etc.

Packing relocations

While they are not detrimental to most binaries, we saw that relocations can have a bad effect on startup I/O for big binaries. Reducing how much space they take would have a direct impact on these startup I/Os. One way to do so is obviously to reduce the number of relocations. But as we saw above, the vast majority are due to vtables. While reducing vtables would be a nice goal, it's a long way before this can have a significant impact on the relocations number. So another possible way is to try to avoid wasting so much space in relocations.

The main problem is that relocations are applied by ld.so, and that, as mentioned in the introduction, while we can hack ld.so to do what we want, it won't work everywhere before a long time. Moreover, it's difficult to test various implementations, and binary compatibility needs to be guaranteed once something is applied in the upstream libc.

The idea I started to test a few weeks ago, is to effectively apply relocations by hand instead of relying on ld.so doing it. Considering the first thing ld.so does with a library after applying relocations is to call its DT_INIT function, the idea was to replace the dynamic relocations we can with a new DT_INIT function we would inject that'd apply those relocations, while storing them in a more efficient manner. The function would obviously need to call the original DT_INIT function once done. Some parts of the PT_DYNAMIC section also obviously need to be adjusted accordingly.

The first iteration I tested a few weeks ago helped validating the idea, but didn't change much of the binary: the dynamic relocation section still was taking all the space it took before, though its content was mostly empty (we'll see further below how efficient even the current dumb format is). Once the theory that an injected DT_INIT function could apply the relocations without breaking everything, I went on to implement a program that would actually strip the ELF binary. I won't get too much into the gory details, but it required PT_LOAD splitting, thus program headers growth, which means moving the following sections, then removing a huge part of the relocations, shrinking the relocations section, injecting our code and the packed relocations as sections, and moving the remaining sections after these new sections, to avoid empty space, but still respecting the proper offsets for both PT_LOAD and the code itself.

So far, I only took care of the R_*_RELATIVE relocations. Some but not all other relocations could be taken care of, too, but everything involving symbol resolution from other libraries would be really difficult to handle from the injected code. The relocations are currently stored with a quite dumb format:

a 32 bits word containing r_offset
a 32 bits word containing the number of relocations to apply consecutively

As we only take care of one type of relocations, we can skip any information about the type. Also, the addend already being stored at r_offset, we just strip it from RELA relocations ; the binary being far from 4GB big, 32 bits is enough on 64 bits systems too. We also saw that a lot of relocations were next to the following ones, so instead of storing r_offset for each one, storing some kind of range is more efficient. As I've so far been focusing on getting something that actually works more than on getting something really efficient, I only tested this format, but the results are already nice:

	x86	x86-64
before
dynamic relocations (bytes)	1,854,688	5,661,432
`libxul.so` size	21,869,684	29,629,040
after
dynamic relocations (bytes)	222,856	668,400
injected code size	129	84
packed relative relocations (bytes)	224,816	228,568
`libxul.so` size	20,464,836 (-6.42%)	24,865,520 (-16.07%)

Getting rid of the __cxa_pure_virtual relocations would help getting this even further down, and an even better storage format could be used, too. Gathering new startup coverage data should be interesting as well.

The code is available via mercurial. Please note the Makefile deeply sucks, and that the code is in the middle of its third refactoring. Anyways, once you get the code to compile, you just need to run test-x86 or test-x64 depending on your architecture, from the build directory, and give it the libxul.so file as an argument. It will create a libxul.so.new file that you can use to replace libxul.so. You may successfully use the tool on other libraries, but chances are you'll bump into an assert(), or in the worst case, that the resulting library ends up broken. The tool might work on PIE executables, but most probably not on other executables. If you wonder, Chrome is not PIE. It's not even relocatable (which means ASLR won't work with it) ; or at least the x86-64 binaries I just got aren't.

More possible ELF hacking

The code I wrote to pack relocations allows to do various tricks on ELF files, and it opens various opportunities, such as:

Aligning .bss. As I wrote on this blog a while ago, ld.so fills the last page from the .data section with zeroes, starting at the offset of the .bss section. Unfortunately, this means reading ahead at most 128KB of data nobody needs, and the corresponding disk seek. Inserting a section full of zeroes between .data and .bss would get rid of that, while minimally wasting space (less than 4KB in a 20MB+ binary).
Trying to move the PT_DYNAMIC section. It is one of the first bunch of data to be read from programs and libraries, yet it is near the end of the binary.
Replacing .ctors/.dtors with .init_array/.fini_array (thus removing the corresponding relocations, though there's only around 300 of them).
Replacing the __cxa_pure_virtual relocations with relative relocations pointing to an injected dummy function (it could even just point to any ret code in any existing function).
Replacing some internal symbol resolutions with relative relocations (which should be equivalent to building with -Bsymbolic)
etc.

Some of the above could probably be done with linker scripts, but using linker scripts is unnecessarily over complicated: as soon as you start using a linker script to fiddle with sections, you need to reimplement the entire default linker script (the one you can see with ld --verbose). And it's not guaranteed to work (Taras told me he had a hard time failing to align the .bss section). And as far as I know, you can't use the same thing depending whether you use ld or gold (AIUI, gold doesn't support the whole ld linker script syntax).

2010-11-23 15:44:19+0900

firefox, p.m.o | 24 Comments »

Debian Mozilla team APT archive now signed

You may have been using the repository as described on a previous blog post. You may also have noticed that despite a lack of announcement, all beta releases but 4.0beta6 have been subsequently made available on that repository (moreover, within 24 hours of the upstream release). I just made a change to that repository which will most likely make your APT complain, but all the necessary instructions are given on the repository itself.

2010-11-20 14:45:57+0900

firefox | 5 Comments »

I/O patterns on ELF binary initialization

[ I meant to finish this article much earlier, but i had to leave it in a half-written draft state for quite some time due to other activities ]

One particular issue that arises with big binary files is that of I/O patterns. It looks like it is not something under the usual scope for both ld.so and programs/libraries, but it can have a dramatic influence on startup times. My analysis is far from complete, would need to be improved by actually investigating ld.so code further, and so far has been limited to libxul.so built from mozilla-central, with the help of icegrind, on x86-64 linux.

As I wrote recently, icegrind allows to track memory accesses within mmap()ed memory ranges. As ld.so does mmap() the program and library binaries, tracking the memory accesses within these mmap()s allows to get a better idea at I/O patterns when an ELF binary is initialized. I only focused on local accesses within a single given binary file (libxul.so) because the case of Firefox loading many files at startup is already being scrutinized, and because most of the non-Firefox files being read at Firefox startup (system and GNOME libraries) are most likely already in the page cache. I also focused on that because it was an interesting case that could be helpful to understand how big (and maybe less big) binaries may be affected by the toolchain (compiler, linker and dynamic linker) and possibly by some coding practices.

For the record, what I did to get these results is to use icegrind and elflog, with further additions to the "sections" file with some output from "objdump -h": I added sections that elflog wouldn't give (such as .plt, .hash, .dynsym, etc.), and even down to the entry level for the .rela.dyn section. The latter is particularly interesting because icegrind only outputs the first access for any given section. To see sequential accesses within that section, you need to split it in smaller pieces, which I did for .rela.dyn entries (each one being 24 bytes on x86-64). Uncommenting some parts of the icegrind code was also useful to track where some accesses were made from, code-wise.

Now, the interesting data:

One of the first accesses is a nullification of a few variables within the .bss section. The .bss section is usually an anonymous piece of memory, that is, a range of mmap()ed memory that is not backed by a file, and filled with zeroes by the kernel (I think it even does that lazily). It is used for e.g. variables that are initialized with zeroes in the code, and obviously, any code accessing these variables would be addressing at some offset of the .bss section. It means the section needs to start in memory at the offset it has been assigned by the linker at build time. This is actually where problems begin.

When the .bss section offset as assigned by the linker doesn't align on a page (usually 4KB), the mmap()ed .bss can't be at that location, and really starts on the next page. The remainder is still mmap()ed from the binary file, and ld.so will itself fill that part. As the .bss section doesn't start on a page boundary, any write at this location will trigger the kernel reading the entire page. This means one of the first set of data being read from the file is the end of the section preceding .bss, and the beginning of the following one. Most likely, respectively the .data and .comment sections.

While this probably doesn't matter much when the binary file is small, when it is big enough, reading in a non sequential manner will trigger hard disk seeks, and we all know how they can hurt performance. Although thankfully, cheap SSDs should be coming some day, in the meanwhile, we still need to cope with the bad performance. The interesting part is, the .bss section is really empty in the binary file, so its "virtual" memory address could be anywhere. Why the linker wouldn't align it at a page boundary without having to resort to a linker script is beyond me.

The next accesses go back and forth between .dynamic, .gnu.hash, .dynstr, .gnu.version_r, .gnu.version and .dynsym. These are probably all related to symbol version resolution and DT_NEEDED library loading. While most of these sections are at the beginning of the file (not necessarily in the order they are read from), the .dynamic section is much nearer to the end, but way before .data, so that it won't even have been loaded as a by-product of the .bss section loading.

After that, the .rela.dyn section is read, and for each relocation entry it contains, the relocations are being applied. Relocations is one of the mechanisms by which position independent code (PIC) is made possible. When a library is loaded in memory, it is not necessarily loaded at the same address every time. The code and data contained in the library thus need to cope with that constraint. Fortunately, while the base address where the library is mmap()ed in memory is not necessary constant, the offsets of the various sections are still as codified in the binary. The library code can thus directly access data at the right offset if it knows the base offset of the library (which is what is done on the x86 ABI), or if it knows where the current instruction is located (which is what is done on the x86-64 ABI).

"Static" data (initialized in e.g. a const or a global variable, or, in C++ case, vtables), on the other hand, may contain pointers to other locations. This is where relocation enters the scene. The .rela.dyn section contains a set of rules describing where in the binary some pointers need to be adjusted depending on the base library offset (or some other information), and how they should be updated. ld.so thus reads all .rela.dyn entries, and applies each relocation, which means that while .rela.dyn is being read sequentially, reads and writes are also performed at various places of the binary, depending on the content of the .rela.dyn entries.

This is where this gets ugly for Firefox: there are near 200000 such relocations. On x86-64 an entry is 24 bytes (12 on x86), and each of these is going to read/write a pointer (8 bytes on x86-64, 4 on x86) at some random (though mostly ordered) location. The whole .rela.dyn section not being read ahead, what actually happens is that it is read in small batches, with seeks and reads of other data at various locations in between. In libxul.so case, this spreads over .ctors, .data.rel.ro, .got, and .data. The relocation entries are somehow ordered by address to be rewritten, though they occasionally jump backwards. Some of these relocations also appear to be touching to .gnu.version, .dynsym and .dynstr, because their type involves a symbol resolution.

Once .rela.dyn relocation have been dealt with comes .rela.plt's turn. The principle is the same for this section: entries describe what kind of relocation must be done where, and how the result must be calculated. The scope of this section, though, is apparently limited to .got.plt. But before explaining these relocations, I'll explain what happens with the PLT.

The PLT (Procedure Linkage Table) is used when calling functions from external libraries. For example, in a Hello World example, the PLT would be used for calls to the puts function. The function making the call would in fact call the corresponding PLT location. The PLT itself, on x86 and x86-64, at least, consists of 3 instructions (I'll skip the gory details, especially for x86, where the caller needs to also set some register before calling the PLT). The first instruction is the only one to be called most of the time: it reads the final destination in the .got.plt section, and jumps there. That final destination, obviously, is not fixed in the library file, since it needs to be resolved by its symbol. This is why the two subsequent instructions exist : originally, the destination "stored" in the .got.plt section points back to the second instruction ; the first instruction will effectively be a nop (no operation), and the following instructions will be executed. They will jump into code responsible for symbol resolution, update of the .got.plt entry for the next call, and call of the real function.

But pointing back to the second instruction is like the pointers in static data we saw above : it's not possible in position independent code. So, the .rela.plt relocations are actually filling the .got.plt section with these pointers back to the PLT. There are a tad more than 3000 such relocations.

All these relocations should be going away when prelinking the binaries, but from my several experimentations, it looks like prelinking only avoids relative relocations, and not the others, while it technically could skip all of them. prelink even properly applies all the relocations in the binary, but executing the binary rewrites the same information at startup for all but relative relocations. That could well be a bug in either ld.so not skipping enough relocations or prelink not marking enough relocations to be skipped. I haven't dug deep enough in the code to know how prelinking works exactly. Anyways, prelinking is not a perfect solution, as it also breaks ASLR. Sure, prelink can randomize library locations, but it relies on the user or a cron job doing so at arbitrary times, but that's far from satisfying.

An interesting thing to note, though, is that a good part of the relocations prelinking doesn't rid us of in libxul.so (more than 25000) are due to the cxa_pure_virtual symbol, which is used for, well, pure virtual methods. In other words, virtual methods that don't have an implementation in a given class. The cxa_pure_virtual function is set as method in the corresponding field(s) of the class VTABLE in the .data.rel.ro section. This function is provided by libstdc++, and as such, is dynamically linked. But this function is just a dummy function, doing nothing. Defining an empty __cxa_pure_virtual function to be included in libxul.so makes these relocations become relative, thus taken care of by prelinking.

After all relocations occur, the library initialization itself can begin, and the content of the .init section is executed. That section, indirectly, executes all functions stored in the .ctors section. This includes static initializers, which are unfortunately called backwards, as Taras pointed out already. Each of these static initializers are also accessing various locations in the .data or .bss sections, which may or may not have already been loaded during the relocation phase. The execution of these initializers will also (obviously) read various pieces of the .text section (despite its name, it contains executable sections, i.e. functions code).

The initialization of the library ends there, and no access should happen until a function from the library is called. In libxul.so case, XRE_main is first called, then many other functions, but that's another story. All that needs to be remembered about the startup process past this point is that the .text section will be read heavily, as well as the various data, got, plt and dynamic symbol sections, in a very scattered way. While most of these sections may have been retrieved in memory already, as a byproduct of the various initialization processes described above, some may have not, increasing even more the need to seek at all places in the binary file.

Now the main problem with all these I/O patterns at startup, is that it seems the only way reorganizing the binary layout may have a visible impact is by considering all the above, and not only a part of it, because only addressing a part of it is very likely to only move part of the problem to a different layer.

All in all, making sure the relevant sections of libxul.so are read by the kernel before ld.so enters the game is a good short-term solution to avoid many seeks at startup.

2010-07-14 18:37:13+0900

mozilla, p.m.o | 7 Comments »

Iceweasel 4.0 beta 1ã‚’å…¬é–‹

Firefoxã®æœ€æ–°ãƒ™ãƒ¼ã‚¿ç‰ˆãŒå…¬é–‹ã•ã‚ŒãŸç‚ºã€Iceweaselã®4.0beta1ãƒãƒ¼ã‚¸ãƒ§ãƒ³ã‚’ã‚¢ãƒƒãƒ—ãƒãƒ¼ãƒ‰ã—ã¾ã—ãŸã€‚unstableã§ã‚‚experimentalã§ã‚‚ãªãã€åˆ¥ã®ãƒ¬ãƒã‚¸ãƒˆãƒªã¸ã€‚ã§ã™ã®ã§ã€ä¸‹è¨˜ã®è¡Œã‚’source.listã«è¿½åŠ ã—ã¦ä¸‹ã•ã„ã€‚

deb http://mozilla.debian.net/packages/ ./

ä»Šã®æ‰€ã€x86ã¨x86-64ï¼ˆamd64ï¼‰ã®ãƒ‘ãƒƒã‚±ãƒ¼ã‚¸ã—ã‹å…¥ã£ã¦ã„ã¾ã›ã‚“ã€‚

unstableã‹experimentalã®ãƒãƒ¼ã‚¸ãƒ§ãƒ³ï¼ˆãƒ™ãƒ¼ã‚¿ã˜ã‚ƒãªã„æ–¹ï¼‰ã‚’ä½¿ç”¨ã—ç¶šã‘ãŸã„å ´åˆã¯æ¬¡ã®ãƒˆãƒªãƒƒã‚¯ã‚’è©¦ã—ã¦ã¿ã¦ä¸‹ã•ã„ã€‚ä¸Šè¨˜ã®ãƒ¬ãƒã‚¸ãƒˆãƒªã‹ã‚‰iceweaselã‚’ã‚¤ãƒ³ã‚¹ãƒˆãƒ¼ãƒ«ã—ãšã«ã€

xulrunner-1.9.2ã‚’ã‚¤ãƒ³ã‚¹ãƒˆãƒ¼ãƒ«,
iceweaselã‚’ãƒ€ã‚¦ãƒ³ãƒãƒ¼ãƒ‰ï¼ˆã‚¤ãƒ³ã‚¹ãƒˆãƒ¼ãƒ«ã§ã¯ãªãï¼‰,
æ¬¡ã®ã‚³ãƒžãƒ³ãƒ‰ã‚’å®Ÿè¡Œï¼šdpkg-deb -x iceweasel_4.0b1-0_*.deb /some/directory,
ã‚·ãƒ³ãƒœãƒªãƒƒã‚¯ãƒªãƒ³ã‚¯ã‚’ä½œæˆï¼šln -s /usr/lib/xulrunner-2.0 /some/directory/usr/lib.

ãã‚Œã§Iceweaselã®ãƒ™ãƒ¼ã‚¿ãƒãƒ¼ã‚¸ãƒ§ãƒ³ã¯/some/directory/usr/bin/iceweaselã§èµ·å‹•ãŒå¯èƒ½ã«ãªã‚Šã¾ã™ã€‚ä»¥å‰ã®ãƒãƒ¼ã‚¸ãƒ§ãƒ³ã§ã‚‚ã“ã®ãƒˆãƒªãƒƒã‚¯ã‚’åˆ©ç”¨å‡ºæ¥ã‚‹ã¯ãšã§ã™ã€‚

2010-07-09 21:34:41+0900

firefox, xulrunner | No Comments »

Iceweasel 4.0 beta 1 preliminary packages for Debian, and a nice trick

Since this blog is now syndicated on Planet Mozilla, a little background for its readers: Iceweasel is the name under which Firefox is distributed in Debian. This is an unfortunate detail due to copyright issues with the Firefox logos that have been solved recently. Work is under progress to get Firefox back in Debian (I do have good hope this will be possible).

Anyways, I've started to work on getting Iceweasel in shape for the 4.0 release in a few months. The packaging being in a good enough shape, I am hereby making available some preliminary packages (meaning there is still some more work needed for proper packaging) for the first beta release.

The packages are available in a separate repository, which you need to add to your sources.list:

deb http://mozilla.debian.net/packages/ ./

Only x86 and x86-64 (amd64) packages are available for now. As far as I tested it, it works as well as the official Firefox 4.0beta1 binaries, though some xpcshell tests failed, and I haven't got reftests working yet.

If you still wish to use a current version of iceweasel (i.e. non-beta), but want to try this beta, here is the nice trick: Instead of installing iceweasel from the above mentioned repository,

Install xulrunner-2.0,
Download the iceweasel package (don't install),
Run the following command: dpkg-deb -x iceweasel_4.0b1-0_*.deb /some/directory,
Create a necessary symbolic link: ln -s /usr/lib/xulrunner-2.0 /some/directory/usr/lib.

Now you can start the Iceweasel beta with /some/directory/usr/bin/iceweasel. This trick should work with older versions of iceweasel, too.

2010-07-09 20:59:08+0900

firefox, xulrunner | 46 Comments »

Playing with icegrind

In the past few days, I've been playing with icegrind, a little valgrind plugin that Taras Glek wrote to investigate how Firefox startup spends its time in libxul.so reading/initialization, mostly. I invite you to check out his various posts on the subject, it's an interesting read.

After some fiddling with the original icegrind patch in the bug linked above, I got some better understanding on the binary initialization phase (more on that in another post), as we could call it, and hit some problems due to problems in icegrind and elflog. The latest version of the icegrind patch as of writing solved my main problem half way (though it needs further modifications to work properly, see below).

What icegrind actually does is to track memory accesses in mmap()ed memory. When the program being profiled mmap()s a file, icegrind looks for a corresponding "sections" file next to it (e.g. "libxul.so.sections" for "libxul.so"). The "sections" file contains offsets, sizes, and sections names, that will be used by icegrind to report the accesses. Whenever the profiled program makes memory accesses (whether it is for instructions or data) within the mmap()ed range, icegrind will add the corresponding section name in a "log" file (e.g. "libxul.so.log" for "libxul.so"). Please note it will only do so for the first access in the given section.

To generate section information for ELF files, Taras also hacked an elflog program that scans an ELF binary (program or library) and will output sections for symbols present in its symtab. This means the binary needs not to be stripped, though the full fledged debugging symbols are not needed. It outputs the sections names in a form that would be reusable in a linker script after a build with the "-ffunction-sections" and "-fdata-sections" compiler options, but that's mostly irrelevant when what you are after is only to see what is going on, not to reorder the sections at link time. There are glitches in the way elflog works that makes the section names for .bss symbols wrong (they will start with .symtab, .shstrtab, .comment, etc. instead of .bss). Also, technically, the .bss section is not mapped from the file.

Note icegrind has rough edges, is still experimental, and isn't very user friendly, as in unix permissions, because its input and output files need to be next to the file being mmap()ed, and when that is a library in /usr/lib, well, you need write access there ; or you need to copy the library in a different directory and adjust LD_LIBRARY_PATH. Anyways, it's already useful as it currently is.

If you want to play with it yourself, here are steps that worked well for me:

svn co svn://svn.valgrind.org/valgrind/trunk valgrind -r 11100
When I first tried icegrind, valgrind trunk was broken ; revision 11100 is known to work properly, at least for icegrind.
cd valgrind
wget -O - -q https://bug549749.bugzilla.mozilla.org/attachment.cgi?id=449748 | patch -p0
You need to open icegrind/ice_main.c, go on the last line of the track_mmap function and replace mmaps with dest_mmaps.
mkdir icegrind/tests ; touch icegrind/tests/Makefile.am
The valgrind build system is a bit lunatic.
libtoolize ; aclocal ; automake --add-missing ; autoreconf
./configure; make
wget http://hg.mozilla.org/users/tglek_mozilla.com/startup/raw-file/6453ad2a7906/elflog.cpp
g++ -o elflog elflog.cpp -lelf
You will need the libelf-dev package.
make install ; install -m 755 elflog /usr/local/bin
as root, to install in /usr/local.

Once you're done with this setup, you are ready to start playing:

elflog --contents libxul.so > libxul.so.sections
Be aware that if libxul.so is a symbolic link pointing to a file in another directory, as it happens in the dist/bin directory in mozilla builds, the "sections" file needs to be in that other directory.
LD_LIBRARY_PATH=. valgrind --tool=icegrind ./firefox-bin -P my-profile -no-remote about:blank

Obviously, you don't have to play around with Firefox. You can try with other libraries, or even programs. You'll see in a subsequent post, however, that to get more interesting information, the elflog output is unfortunately not enough.

2010-06-08 18:46:42+0900

mozilla, p.m.o | 1 Comment »

Important milestone for Iceweasel in Debian

I just uploaded a pre-release of Iceweasel 3.6.4 to experimental. This is an important milestone for several reasons:

Last time I was able to push a pre-release was almost 2 years ago. This means the keeping up with upstream is mostly done. This also means I should be ready to create a 3.7alpha5 package when upstream releases it.
It includes a new feature: Out of Process Plugins. This means plugins instances are sandboxed and if they crash, they won't take the browser down with them (finally). In this release, only the Adobe flash plugin is sandboxed, but it should be enough for the vast majority of plugin induced crashes.
Upstream is particularly interested in feedback from Debian (experimental) users for this pre-release, as they usually don't get much feedback from their own linux pre-release builds except from the test suite.

2010-05-03 14:49:12+0900

firefox, xulrunner | 10 Comments »

A new world of possibilities is opening up

Account created for myaddress<at>glandium.org with hg_mozilla and mg_mozsrc bits enabled.

Yes, this means I now have commit access on hg.mozilla.org. Thanks to those who vouched for me, namely Justin Wood, Christian Biesinger and Benjamin Smedberg.

And thanks to the Mozilla governance for having changed the commit access requirements just in time for me to take advantage of the new ones: under the old rules, it was required that the vouching superreviewer had never reviewed a patch from the person applying for commit access. Of the 31 superreviewers, 14 (almost half of them) already had reviewed one or more patches from me (which is not really unexpected, considering the number of patches I sent in the past).

One could wonder why I never applied for access earlier, though.

2010-04-19 23:21:57+0900

firefox, xulrunner | 5 Comments »

Re: git reflex

Joey Hess writes about putting snapshot.debian.org into git, which is something I already mentioned on this blog. It turns out I used the snapshot.debian.org data a few months ago to try the idea again, and while I haven't done the math yet with the data I got in the end, the data during the experiment tells me we shouldn't expect much more than a 50% space reduction.

There are two main reasons for that: a few number of packages actually take a whole lot of the archive space, and a whole lot of packages are simply data that don't benefit from being put in a git repository.

2010-04-15 08:13:13+0900

debian | 3 Comments »