[ I meant to finish this article much earlier, but i had to leave it in a half-written draft state for quite some time due to other activities ]
One particular issue that arises with big binary files is that of I/O patterns. It looks like it is not something under the usual scope for both ld.so
and programs/libraries, but it can have a dramatic influence on startup times. My analysis is far from complete, would need to be improved by actually investigating ld.so
code further, and so far has been limited to libxul.so built from mozilla-central, with the help of icegrind, on x86-64 linux.
As I wrote recently, icegrind allows to track memory accesses within mmap()
ed memory ranges. As ld.so
does mmap()
the program and library binaries, tracking the memory accesses within these mmap()
s allows to get a better idea at I/O patterns when an ELF binary is initialized. I only focused on local accesses within a single given binary file (libxul.so) because the case of Firefox loading many files at startup is already being scrutinized, and because most of the non-Firefox files being read at Firefox startup (system and GNOME libraries) are most likely already in the page cache. I also focused on that because it was an interesting case that could be helpful to understand how big (and maybe less big) binaries may be affected by the toolchain (compiler, linker and dynamic linker) and possibly by some coding practices.
For the record, what I did to get these results is to use icegrind and elflog, with further additions to the "sections" file with some output from "objdump -h
": I added sections that elflog wouldn't give (such as .plt
, .hash
, .dynsym
, etc.), and even down to the entry level for the .rela.dyn
section. The latter is particularly interesting because icegrind only outputs the first access for any given section. To see sequential accesses within that section, you need to split it in smaller pieces, which I did for .rela.dyn
entries (each one being 24 bytes on x86-64). Uncommenting some parts of the icegrind code was also useful to track where some accesses were made from, code-wise.
Now, the interesting data:
One of the first accesses is a nullification of a few variables within the .bss
section. The .bss
section is usually an anonymous piece of memory, that is, a range of mmap()
ed memory that is not backed by a file, and filled with zeroes by the kernel (I think it even does that lazily). It is used for e.g. variables that are initialized with zeroes in the code, and obviously, any code accessing these variables would be addressing at some offset of the .bss
section. It means the section needs to start in memory at the offset it has been assigned by the linker at build time. This is actually where problems begin.
When the .bss
section offset as assigned by the linker doesn't align on a page (usually 4KB), the mmap()
ed .bss
can't be at that location, and really starts on the next page. The remainder is still mmap()
ed from the binary file, and ld.so
will itself fill that part. As the .bss
section doesn't start on a page boundary, any write at this location will trigger the kernel reading the entire page. This means one of the first set of data being read from the file is the end of the section preceding .bss
, and the beginning of the following one. Most likely, respectively the .data
and .comment
sections.
While this probably doesn't matter much when the binary file is small, when it is big enough, reading in a non sequential manner will trigger hard disk seeks, and we all know how they can hurt performance. Although thankfully, cheap SSDs should be coming some day, in the meanwhile, we still need to cope with the bad performance. The interesting part is, the .bss
section is really empty in the binary file, so its "virtual" memory address could be anywhere. Why the linker wouldn't align it at a page boundary without having to resort to a linker script is beyond me.
The next accesses go back and forth between .dynamic
, .gnu.hash
, .dynstr
, .gnu.version_r
, .gnu.version
and .dynsym
. These are probably all related to symbol version resolution and DT_NEEDED library loading. While most of these sections are at the beginning of the file (not necessarily in the order they are read from), the .dynamic
section is much nearer to the end, but way before .data
, so that it won't even have been loaded as a by-product of the .bss
section loading.
After that, the .rela.dyn
section is read, and for each relocation entry it contains, the relocations are being applied. Relocations is one of the mechanisms by which position independent code (PIC) is made possible. When a library is loaded in memory, it is not necessarily loaded at the same address every time. The code and data contained in the library thus need to cope with that constraint. Fortunately, while the base address where the library is mmap()
ed in memory is not necessary constant, the offsets of the various sections are still as codified in the binary. The library code can thus directly access data at the right offset if it knows the base offset of the library (which is what is done on the x86 ABI), or if it knows where the current instruction is located (which is what is done on the x86-64 ABI).
"Static" data (initialized in e.g. a const
or a global variable, or, in C++ case, vtables), on the other hand, may contain pointers to other locations. This is where relocation enters the scene. The .rela.dyn
section contains a set of rules describing where in the binary some pointers need to be adjusted depending on the base library offset (or some other information), and how they should be updated. ld.so
thus reads all .rela.dyn
entries, and applies each relocation, which means that while .rela.dyn
is being read sequentially, reads and writes are also performed at various places of the binary, depending on the content of the .rela.dyn
entries.
This is where this gets ugly for Firefox: there are near 200000 such relocations. On x86-64 an entry is 24 bytes (12 on x86), and each of these is going to read/write a pointer (8 bytes on x86-64, 4 on x86) at some random (though mostly ordered) location. The whole .rela.dyn
section not being read ahead, what actually happens is that it is read in small batches, with seeks and reads of other data at various locations in between. In libxul.so case, this spreads over .ctors
, .data.rel.ro
, .got
, and .data
. The relocation entries are somehow ordered by address to be rewritten, though they occasionally jump backwards. Some of these relocations also appear to be touching to .gnu.version
, .dynsym
and .dynstr
, because their type involves a symbol resolution.
Once .rela.dyn
relocation have been dealt with comes .rela.plt
's turn. The principle is the same for this section: entries describe what kind of relocation must be done where, and how the result must be calculated. The scope of this section, though, is apparently limited to .got.plt
. But before explaining these relocations, I'll explain what happens with the PLT.
The PLT (Procedure Linkage Table) is used when calling functions from external libraries. For example, in a Hello World example, the PLT would be used for calls to the puts
function. The function making the call would in fact call the corresponding PLT location. The PLT itself, on x86 and x86-64, at least, consists of 3 instructions (I'll skip the gory details, especially for x86, where the caller needs to also set some register before calling the PLT). The first instruction is the only one to be called most of the time: it reads the final destination in the .got.plt
section, and jumps there. That final destination, obviously, is not fixed in the library file, since it needs to be resolved by its symbol. This is why the two subsequent instructions exist : originally, the destination "stored" in the .got.plt
section points back to the second instruction ; the first instruction will effectively be a nop (no operation), and the following instructions will be executed. They will jump into code responsible for symbol resolution, update of the .got.plt
entry for the next call, and call of the real function.
But pointing back to the second instruction is like the pointers in static data we saw above : it's not possible in position independent code. So, the .rela.plt
relocations are actually filling the .got.plt
section with these pointers back to the PLT. There are a tad more than 3000 such relocations.
All these relocations should be going away when prelinking the binaries, but from my several experimentations, it looks like prelinking only avoids relative relocations, and not the others, while it technically could skip all of them. prelink
even properly applies all the relocations in the binary, but executing the binary rewrites the same information at startup for all but relative relocations. That could well be a bug in either ld.so
not skipping enough relocations or prelink
not marking enough relocations to be skipped. I haven't dug deep enough in the code to know how prelinking works exactly. Anyways, prelinking is not a perfect solution, as it also breaks ASLR. Sure, prelink can randomize library locations, but it relies on the user or a cron job doing so at arbitrary times, but that's far from satisfying.
An interesting thing to note, though, is that a good part of the relocations prelinking doesn't rid us of in libxul.so (more than 25000) are due to the cxa_pure_virtual
symbol, which is used for, well, pure virtual methods. In other words, virtual methods that don't have an implementation in a given class. The cxa_pure_virtual
function is set as method in the corresponding field(s) of the class VTABLE in the .data.rel.ro
section. This function is provided by libstdc++, and as such, is dynamically linked. But this function is just a dummy function, doing nothing. Defining an empty __cxa_pure_virtual
function to be included in libxul.so makes these relocations become relative, thus taken care of by prelinking.
After all relocations occur, the library initialization itself can begin, and the content of the .init
section is executed. That section, indirectly, executes all functions stored in the .ctors
section. This includes static initializers, which are unfortunately called backwards, as Taras pointed out already. Each of these static initializers are also accessing various locations in the .data
or .bss
sections, which may or may not have already been loaded during the relocation phase. The execution of these initializers will also (obviously) read various pieces of the .text
section (despite its name, it contains executable sections, i.e. functions code).
The initialization of the library ends there, and no access should happen until a function from the library is called. In libxul.so case, XRE_main is first called, then many other functions, but that's another story. All that needs to be remembered about the startup process past this point is that the .text
section will be read heavily, as well as the various data, got, plt and dynamic symbol sections, in a very scattered way. While most of these sections may have been retrieved in memory already, as a byproduct of the various initialization processes described above, some may have not, increasing even more the need to seek at all places in the binary file.
Now the main problem with all these I/O patterns at startup, is that it seems the only way reorganizing the binary layout may have a visible impact is by considering all the above, and not only a part of it, because only addressing a part of it is very likely to only move part of the problem to a different layer.
All in all, making sure the relevant sections of libxul.so are read by the kernel before ld.so
enters the game is a good short-term solution to avoid many seeks at startup.