Archive for August, 2023

Hacking the ELF format for Firefox, 12 years later ; doing better with less

(I haven't posted a lot in the past couple years, except for git-cinnabar announcements. This is going to be a long one, hold tight)

This is quite the cryptic title, isn't it? What is this all about? ELF (Executable and Linkable Format) is a file format used for binary files (e.g. executables, shared libraries, object files, and even core dumps) on some Unix systems (Linux, Solaris, BSD, etc.). A little over 12 years ago, I wrote a blog post about improving libxul startup I/O by hacking the ELF format. For context, libxul is the shared library, shipped with Firefox, that contains most of its code.

Let me spare you the read. Back then I was looking at I/O patterns during Firefox startup on Linux, and sought ways to reduce disk seeks that were related to loading libxul. One particular pattern was caused by relocations, and the way we alleviated it was through elfhack.

Relocations are necessary in order for executables to work when they are loaded in memory at a location that is not always the same (because of e.g. ASLR). Applying them requires reading the section containing the relocations, and adjusting the pieces of code or data that are described by the relocations. When the relocation section is very large (and that was the case on libxul back then, and more so now), that means going back and forth (via disk seeks) between the relocation section and the pieces to adjust.

Elfhack to the rescue

Shortly after the aforementioned blog post, the elfhack tool was born and made its way into the Firefox code base.

The main idea behind elfhack was to reduce the size of the relocation section. How? By storing it in a more compact form. But how? By taking the executable apart, rewriting its relocation section, injecting code to apply those relocations, moving sections around, and adjusting the ELF program header, section header, section table, and string table accordingly. I will spare you the gory details (especially the part about splitting segments or the hack to use .bss section as a temporary Global Offset Table). Elfhack itself is essentially a minimalist linker that works on already linked executables. That has caused us a number of issues over the years (and much more). In fact, it's known not to work on binaries created with lld (the linker from the LLVM project) because the way lld lays things out does not provide space for the tricks we pull (although it seems to be working with the latest version of lld. But who knows what will happen with next version).

Hindsight is 20/20, and if I were to redo it, I'd take a different route. Wait, I'm actually kind of doing that! But let me fill you in with what happened in the past 12 years, first.

Android packed relocations

In 2014, Chrome started using a similar-ish approach for Android on ARM with an even more compact format, compared to the crude packing elfhack was doing. Instead of injecting initialization code in the executable, it would use a custom dynamic loader/linker to handle the packed relocations (that loader/linker was forked from the one in the Android NDK, which solved similar problems to what our own custom linker had, but that's another story).

That approach eventually made its way into Android itself, in 2015, with support from the dynamic loader in bionic (the Android libc), and later support for emitting those packed relocations was added to lld in October 2017. Interestingly, the packer added to lld created smaller packed relocations than the packer in Android (for the same format).

The road to standardization

Shortly after bionic got its native packed relocation support, a conversation started on the gnu-gabi mailing list related to the general problem of relocations representing a large portion of Position Independent Executable. What we observed on a shared library had started to creep into programs as well because PIE binaries started to be prominent around that time, with some compilers and linkers starting to default to that for hardening reasons. Both Chrome's and Firefox prior art were mentioned. This was April 2017.

A few months went by, and a simpler format was put forward, with great results, which led to, a few days later, a formal proposal for RELR relocations in the Generic System V Application Binary Interface.

More widespread availability

Shortly after the proposal, Android got experimental support for it, and a few months later, in July 2018, lld gained experimental support as well.

The Linux kernel got support for it too, for KASLR relocations, but for arm64 only (I suppose this was for Android kernels. It still is the only architecture it has support for to this day).

GNU binutils gained support for the proposal (via a -z pack-relative-relocs flag) at the end of 2021, and glibc eventually caught up in 2022, and this shipped respectively in binutils 2.38 and glibc 2.36. These versions should now have reached most latest releases of major Linux distros.

Lld thereafter got support for the same flag as binutils's, with the same side effect of adding a version dependency on GLIBC_ABI_DT_RELR, to avoid crashes when running executables with packed relocations against an older glibc.

What about Firefox?

Elfhack was updated to use the format from the proposal at the very end of 2021 (or rather, close enough to that format). More recently (as in, two months ago), support for the -z pack-relative-relocs flag was added, so that when building Firefox against a recent enough glibc and with a recent enough linker, it will use that instead of elfhack automatically. This means in some cases, Firefox packages in Linux distros will be using those relocations (for instance, that's the case since Firefox 116 in Debian unstable).

Which (finally) brings us to the next step, and the meat of this post.

Retiring Elfhack

It's actually still too early for that. The Firefox binaries Mozilla provides need to run on a broad variety of systems, including many that don't support those new packed relocations. That includes Android systems older than Red Velvet Cake (11), and not necessarily very old desktop systems.

Android Pie (9) shipped with experimental, but incompatible, support for the same packed relocation format, but using different constants. Hacking the PT_DYNAMIC segment (the segment containing metadata for dynamic linking) for compatibility with all Android versions >= 9 would technically be possible, but again, Mozilla needs to support even older versions of Android.

There comes the idea behind what I've now called relrhack: injecting code that can apply the packed relocations created by the linker if the system dynamic loader hasn't.

To some extent, that sounds similar to what elfhack does, doesn't it? But elfhack packs the relocations itself. And because its input is a fully linked binary, it has to do complex things that we know don't always work reliably.

In the past few years, an idea was floating in the back of my mind to change elfhack to start off a relocatable binary (also known as partially linked). It would then rewrite the sections it needs to, and invoke the linker to link that to its initialization code and produce the final binary. That would theoretically avoid all the kinds of problems we've hit, and work more reliably with lld.

The idea I've toyed with more recently, though, is even simpler: Use the -z pack-relative-relocs linker support, and add the initialization code on the linker command line so that it does everything in one go. We're at this sweet spot in time where we can actually start doing this.

Testing the idea

My first attempts were with a small executable, and linking with lld's older --pack-dyn-relocs=relr flag, which does the same as -z pack-relative-relocs but skips adding the GLIBC_ABI_DT_RELR version dependency. That allowed to avoid having to do post-processing of the binary in this first experimentation step.

I quickly got something working on a Debian Bullseye system (using an older glibc that doesn't support the packed relocations). Here's how it goes:

// Compile with: clang -fuse-ld=lld -Wl,--pack-dyn-relocs=relr,--entry=my_start,-z,norelro -o relr-test
#include <stdio.h>

char *helloworld[] = {"Hello, world"};

int main(void) {
  printf("%s\n", helloworld[0]);
  return 0;
}

This is a minimal Hello world program that contains a relative relocation: the helloworld variable is an array of pointers, and those pointers need to be relocated. Optimizations would get rid of the array but we don't enable optimizations specifically for that. We also disable "Relocation Read-Only", which is a protection that makes the dynamic loader relocated sections read-only after it's done applying relocations. That would prevent us from applying the missing relocations on our own. We're just testing, we'll deal with that later.

Compiling just this without --entry=my_start (because we haven't defined that yet), and running it yields a segmentation fault. We don't even reach main because there actually is an initialization function section that runs before that, and its location, defined in the .init_array section, is behind a relative relocation, which --pack-dyn-relocs=relr packed. This is exactly why -z pack-relative-relocs adds a dependency on a symbol version that doesn't exist in older glibcs. With that flag, the error becomes:

/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_ABI_DT_RELR' not found

which is more user-friendly than a plain crash.

At this point, what do we want? Well, we want to apply the relocations ourselves, as early as possible. The first thing that will run in an executable is its "entry point", that defaults to _start (provided by the C runtime, aka CRT). As hinted in the code snippet above, we can set our own with --entry.

static void real_init();
extern void _start();

void my_start() {
  real_init();
  _start();
}

Here's our own entry point. It will start by calling the "real" initialization function we forward declare here. Let's see if that actually works. Let's add the following temporarily and see how things go.

void real_init() {
  printf("Early Hello world\n");
}

Running the program now yields:

$ ./relr-test
Early Hello world
Segmentation fault

There we go, we've executed code before anything relies on the relative relocations being applied. By the way, adding functions calls like this printf, that early, with elfhack, was an interesting challenge. This is pleasantly much simpler.

Applying the relocations for real

Let's replace that real_init function with some boilerplate for the upcoming real real_init:

#include <link.h>

#ifndef DT_RELRSZ
#define DT_RELRSZ 35
#endif
#ifndef DT_RELR
#define DT_RELR 36
#endif

extern ElfW(Dyn) _DYNAMIC[];
extern ElfW(Ehdr) __executable_start;

The defines are there because older systems don't have them in link.h. _DYNAMIC is a symbol that gives access to the PT_DYNAMIC segment at runtime, and the __executable_start symbol gives access to the base address of the program, which non-relocated addresses in the binary are relative to.

Now we're ready for the real work:

void real_init() {
  // Find the relocations section.
  ElfW(Addr) relr;
  ElfW(Word) size = 0;
  for (ElfW(Dyn) *dyn = _DYNAMIC; dyn->d_tag != DT_NULL; dyn++) {
    if (dyn->d_tag == DT_RELR) {
      relr = dyn->d_un.d_ptr;
    }
    if (dyn->d_tag == DT_RELRSZ) {
      size = dyn->d_un.d_val;
    }
  }
  uintptr_t elf_header = (uintptr_t)&__executable_start;

  // Apply the relocations.
  ElfW(Addr) *ptr, *start, *end;
  start = (ElfW(Addr) *)(elf_header + relr);
  end = (ElfW(Addr) *)(elf_header + relr + size);
  for (ElfW(Addr) *entry = start; entry < end; entry++) {
    if ((*entry & 1) == 0) {
      ptr = (ElfW(Addr) *)(elf_header + *entry);
      *ptr += elf_header;
    } else {
      size_t remaining = 8 * sizeof(ElfW(Addr)) - 1;
      ElfW(Addr) bits = *entry;
      do {
        bits >>= 1;
        remaining--;
        ptr++;
        if (bits & 1) {
          *ptr += elf_header;
        }
      } while (bits);
      ptr += remaining;
    }
  }
}

It's all kind of boring here. We scan the PT_DYNAMIC segment to get the location and size of the packed relocations section, and then read and apply them.

And does it work?

$ ./relr-test
Hello, world

It does! Mission accomplished? If only...

The devil is in the details

Let's try running this same binary on a system with a more recent glibc:

$ ./relr-test 
./relr-test: error while loading shared libraries: ./relr-test: DT_RELR without GLIBC_ABI_DT_RELR dependency

Oh come on! Yes, glibc insists that when the PT_DYNAMIC segment contains these types of relocations, the binary must have that symbol version dependency. That same symbol version dependency we need to avoid in order to work on older systems. I have no idea why the glibc developers went all their way to prevent that. Someone even asked when this was all at the patch stage, with no answer.

We'll figure out a workaround later. Let's use -Wl,-z,pack-relative-relocs for now and see how it goes.

$ ./relr-test 
Segmentation fault

Oops. Well, that actually didn't happen when I was first testing, but for the purpose of this post, I didn't want to touch this topic before strictly necessary. Because we're now running on a system that does support the packed relocations, when our initialization code is reached, relocations are already applied, and we're applying them again. That overcompensates every relocated address, and leads to accesses to unmapped memory.

But how can we know whether relocations were applied? Well, conveniently, the address of a function, from within that function, doesn't need a relative relocation to be known. That's one half. The other half requires "something" that uses a relative relocation to know that same address. We insert this before real_init, but after its forward declaration:

void (*__real_init)() = real_init;

Because it's a global variable that points to the address of the function, it requires a relocation. And because the function is static and in the compilation unit, it needs a relative relocation, not one that would require symbol resolution.

Now we can add this at the beginning of real_init:

  // Don't apply relocations when the dynamic loader has applied them already.
  if (__real_init == real_init) {
    return;
  }

And we're done. This works:

$ ./relr-test 
Hello, world

Unfortunately, we're back to square one on an older system:

$ ./relr-test 
./relr-test: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_ABI_DT_RELR' not found (required by ./relr-test)

Hacking the ELF format, again

And here we go again, having to post-process a binary. So what do we need this time around? Well, starting from a binary linked with --pack-dyn-relocs=relr, we need to avoid the "DT_RELR without GLIBC_ABI_DT_RELR" check. If we change the PT_DYNAMIC segment such that it doesn't contain DT_RELR-related tags, the error will be avoided. Sadly, that means we'll always apply relocations ourselves, but so be it.

How do we do that? Open the file, find the PT_DYNAMIC segment, scan it, overwrite a few tags with a different value, and done. Damn, that's much less work than everything elfhack was doing. I will spare you the code required to do that. Heck, that can trivially be done in a hex editor. Hey, you know what? That would actually be less stuff to write here than ELF parsing code, and would still allow you to follow at home.

Let's start from that binary we built earlier with --pack-dyn-relocs=relr.

$ objcopy --dump-section .dynamic=dyn relr-test

We now have a dyn file with the contents of the PT_DYNAMIC segment.

In that segment, each block of 16 bytes (assuming a 64-bits system) stores a 8-byte tag and a 8-byte value. We want to change the DT_RELR, DT_RELRSZ and DT_RELRENT tags. Their hex value are, respectively, 0x24, 0x23 and 0x25.

$ xxd dyn | grep 2[345]00
00000060: 2400 0000 0000 0000 6804 0000 0000 0000  $.......h.......
00000070: 2300 0000 0000 0000 1000 0000 0000 0000  #...............
00000080: 2500 0000 0000 0000 0800 0000 0000 0000  %...............

(got lucky a bit here, not matching anywhere else than in the tag)

Let's set an extra arbitrary high-ish bit.

$ xxd dyn | sed -n '/: 2[345]00/s/ 0000/ 0080/p'
00000060: 2400 0080 0000 0000 6804 0000 0000 0000  $.......h.......
00000070: 2300 0080 0000 0000 1000 0000 0000 0000  #...............
00000080: 2500 0080 0000 0000 0800 0000 0000 0000  %...............

This went well, let's do it for real.

$ xxd dyn | sed '/: 2[345]00/s/ 0000/ 0080/' | xxd -r > dyn.new
$ objcopy --update-section .dynamic=dyn.new relr-test

Let me tell you I'm glad we're in 2023, because these objcopy options we just used didn't exist 12+ years ago.

So, how did it go?

$ ./relr-test 
Segmentation fault

Uh oh. Well duh, we didn't change the code that applies the relocations, so it can't find the packed relocation section.

Let's edit the loop to use this:

    if (dyn->d_tag == (DT_RELR | 0x80000000)) {
      relr = dyn->d_un.d_ptr;
    }
    if (dyn->d_tag == (DT_RELRSZ | 0x80000000)) {
      size = dyn->d_un.d_val;
    }

And start over:

$ clang -fuse-ld=lld -Wl,--pack-dyn-relocs=relr,--entry=my_start,-z,norelro -o relr-test relr-test.c
$ objcopy --dump-section .dynamic=dyn relr-test
$ xxd dyn | sed '/: 2[345]00/s/ 0000/ 0080/' | xxd -r > dyn.new
$ objcopy --update-section .dynamic=dyn.new relr-test
$ ./relr-test
Hello, world

Copy over to the newer system, and try:

$ ./relr-test
Hello, world

Flawless victory. We now have a binary that works on both old and new systems, using packed relocations created by the linker, and barely post-processing the binary (and we don't need that if (__real_init == real_init) anymore).

Generalizing a little

Okay, so while we're here, we'd rather use -z packed-relative-relocs because it works across more linkers, so we need to get rid of that GLIBC_ABI_DT_RELR symbol version dependency it adds, in order for the output to be more or less equivalent to what --pack-dyn-relocs=relr would produce.

$ clang -fuse-ld=lld -Wl,-z,pack-relative-relocs,--entry=my_start,-z,norelro -o relr-test relr-test.c

You know what, we might as well learn new things. Objcopy is nice, but as I was starting to write this section, I figured it was going to be annoying to do in the same style as above.

Have you heard of GNU poke? I saw a presentation about it at FOSDEM 2023, and haven't had the occasion to try it, I guess this is the day to do that. We'll be using GNU poke 3.2 (latest version as of writing).

Of course, that version doesn't contain the necessary bits. But this is Free Software, right? After a few patches, we're all set.

$ git clone https://git.savannah.gnu.org/git/poke/poke-elf
$ POKE_LOAD_PATH=poke-elf poke relr-test
(poke) load elf
(poke) var elf = Elf64_File @ 0#B

Let's get the section containing the symbol version information. It starts with a Verneed header.

(poke) var section = elf.get_sections_by_type(ELF_SHT_GNU_VERNEED)[0]
(poke) var verneed = Elf_Verneed @ section.sh_offset
(poke) verneed
Elf_Verneed {vn_version=1UH,vn_cnt=2UH,vn_file=110U,vn_aux=16U,vn_next=0U}

vn_file identifies the library file expected to contain those vn_cnt versions. Let's check this is about the libc. The section's sh_link will tell us which entry of the section header (shdr) corresponds to the string table that vn_file points into.

(poke) var strtab = elf.shdr[section.sh_link].sh_offset
(poke) string @ strtab + verneed.vn_file#B
"libc.so.6"

Bingo. Now let's scan the two (per vn_cnt) Vernaux entries that the Verneed header points to via vn_aux. The first one:

(poke) var off = section.sh_offset + verneed.vn_aux#B
(poke) var aux = Elf_Vernaux @ off
(poke) aux
Elf_Vernaux {vna_hash=157882997U,vna_flags=0UH,vna_other=2UH,vna_name=120U,vna_next=16U}
(poke) string @ strtab + aux.vna_name#B
"GLIBC_2.2.5"

And the second one, that vna_next points to.

(poke) var off = off + aux.vna_next#B
(poke) var aux2 = Elf_Vernaux @ off
(poke) aux2
Elf_Vernaux {vna_hash=16584258U,vna_flags=0UH,vna_other=3UH,vna_name=132U,vna_next=0U}
(poke) string @ strtab + aux2.vna_name#B
"GLIBC_ABI_DT_RELR"

This is it. This is the symbol version we want to get rid of. We could go on by adjusting vna_next in the first entry, and reducing vn_cnt in the header, but forward thinking to automating this for binaries that may contain more than two symbol versions from more than one dependency, it's just simpler to pretend this version is a repeat of the previous one. So we copy all its fields, except vna_next.

(poke) aux2.vna_hash = aux.vna_hash 
(poke) aux2.vna_flags = aux.vna_flags 
(poke) aux2.vna_other = aux.vna_other
(poke) aux2.vna_name = aux.vna_name

We could stop here and go back to the objcopy/xxd way of adjusting the PT_DYNAMIC segment, but while we're in poke, it can't hurt to try to do the adjustement with it.

(poke) var dyn = elf.get_sections_by_type(ELF_SHT_DYNAMIC)[0]
(poke) var dyn = Elf64_Dyn[dyn.sh_size / dyn.sh_entsize] @ dyn.sh_offset
(poke) for (d in dyn) if (d.d_tag in [ELF_DT_RELR,ELF_DT_RELRSZ,ELF_DT_RELRENT]) d.d_tag |= 0x80000000L
<stdin>:1:20: error: invalid operand in expression
</stdin><stdin>:1:20: error: expected uint<32>, got Elf64_Sxword

Gah, that seemed straightforward. It turns out in is not lenient about integer types. Let's just use the plain values.

(poke) for (d in dyn) if (d.d_tag in [0x23L,0x24L,0x25L]) d.d_tag |= 0x80000000L
unhandled constraint violation exception
failed expression
  elf_config.check_enum ("dynamic-tag-typ                       elf_mach, d_tag)
in field Elf64_Dyn.d_tag

This time, this is because poke is actually validating the tag values, which is both a blessing and a curse. It can avoid shooting yourself in the foot (after all, we're setting a non-existing value), but also hinder getting things done (because before I actually got here, many of the d_tag values in the binary straight out of the linker weren't even supported).

Let's make poke's validator know about the values we're about to set:

(poke) for (n in [0x23L,0x24L,0x25L]) elf_config.add_enum :class "dynamic-tag-types" :entries [Elf_Config_UInt { value = 0x80000000L | n }]
(poke) for (d in dyn) if (d.d_tag in [0x23L,0x24L,0x25L]) d.d_tag |= 0x80000000L
(poke) .exit
$ ./relr-test
Hello, world

And it works on the newer system too!

Repeating for a shared library

Let's set up a new testcase, using a shared library:

  • Take our previous testcase, and rename the main function to relr_test.
  • Compile it with clang -fuse-ld=lld -Wl,--pack-dyn-relocs=relr,--entry=my_start,-z,norelro -fPIC -shared -o librelr-test.so
  • Create a new file with the following content and compile it:
// Compile with: clang -o relr-test -L. -lrelr-test -Wl,-rpath,'$ORIGIN'
extern int relr_test(void);

int main(void) {
  return relr_test();
}
  • Apply the same GNU poke commands as before, on the librelr-test.so file.

So now, it should work, right?

$ ./relr-test
Segmentation fault

Oops. What's going on?

$ gdb -q -ex run -ex backtrace -ex detach -ex quit ./relr-test
Reading symbols from ./relr-test...
(No debugging symbols found in ./relr-test)
Starting program: /relr-test 
BFD: /librelr-test.so: unknown type [0x13] section `.relr.dyn'
warning: `/librelr-test.so': Shared library architecture unknown is not compatible with target architecture i386:x86-64.

Program received signal SIGSEGV, Segmentation fault.
0x00000000000016c0 in ?? ()
#0  0x00000000000016c0 in ?? ()
#1  0x00007ffff7fe1fe2 in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffdfc8, 
    env=env@entry=0x7fffffffdfd8) at dl-init.c:72
#2  0x00007ffff7fe20e9 in call_init (env=0x7fffffffdfd8, argv=0x7fffffffdfc8, argc=1, l=</optimized><optimized out>) at dl-init.c:30
#3  _dl_init (main_map=0x7ffff7ffe180, argc=1, argv=0x7fffffffdfc8, env=0x7fffffffdfd8) at dl-init.c:119
#4  0x00007ffff7fd30ca in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#5  0x0000000000000001 in ?? ()
#6  0x00007fffffffe236 in ?? ()
#7  0x0000000000000000 in ?? ()
Detaching from program: /relr-test, process 3104868
[Inferior 1 (process 3104868) detached]

Side note: it looks like we'll also need to change some section types if we want to keep tools like gdb happy.

So, this is crashing when doing what looks like a jump/call to an address that is not relocated (seeing how low it is). Let's pull the libc6 source and see what's around dl-init.c:72:

addrs = (ElfW(Addr) *) (init_array->d_un.d_ptr + l->l_addr);
for (j = 0; j < jm; ++j)
  ((init_t) addrs[j]) (argc, argv, env);

This is when it goes through .init_array and calls each of the functions in the table. So, .init_array is not relocated, which means our initialization code hasn't run. But why? Well, that's because the ELF entry point is not used for shared libraries. So, we need to execute our code some other way. What runs on shared library loading? Well, functions from the .init_array table... but they need to be relocated, we got ourselves a chicken and egg problem. Does something else run before that? It turns out that yes, right before that dl-init.c:72 code, there is this:

if (l->l_info[DT_INIT] != NULL)
  DL_CALL_DT_INIT(l, l->l_addr + l->l_info[DT_INIT]->d_un.d_ptr, argc, argv, env);

And the good news here is that it doesn't require DT_INIT to be relocated: that l_addr is the base address the loader used for the library, so it's relocating the address itself. Thank goodness.

So, how do we get a function in DT_INIT? Well... we already have one:

$ readelf -d librelr-test.so | grep '(INIT)'
 0x000000000000000c (INIT)               0x18a8
$ readelf -sW librelr-test.so | grep 18a8
     7: 00000000000018a8     0 FUNC    GLOBAL DEFAULT   13 _init
    20: 00000000000018a8     0 FUNC    GLOBAL DEFAULT   13 _init

So we want to wrap it similarly to what we did for _start, adding the following to the code of the library:

extern void _init();

void my_init() {
  real_init();
  _init();
}

And we replace --entry=my_start with --init=my_init when relinking librelr-test.so (while not forgetting all the GNU poke dance), and it finally works:

$ ./relr-test
Hello, world

(and obviously, it also works on the newer system too)

But does this work for Firefox?

We now have a manual procedure that gets us mostly what we want, that works with two tiny testcases. But does it scale to Firefox? Before implementing the whole thing, let's test a little more. First, let's build two .o files based on our code so far, without the relr_test function. One with the my_init wrapper, the other with the my_start wrapper. We'll call the former relr-test-lib.o and the latter relr-test-bin.o (Compile with clang -c -fPIC -O2).

Then, let's add the following to the .mozconfig we use to build Firefox:

export MOZ_PROGRAM_LDFLAGS="-Wl,-z,pack-relative-relocs,--entry=my_start,-z,norelro /path/to/relr-test-bin.o"
mk_add_options 'export EXTRA_DSO_LDOPTS="-Wl,-z,pack-relative-relocs,--init=my_init,-z,norelro /path/to/relr-test-lib.o"'

This leverages some arcane Firefox build system knowledge to have something minimally intrusive to use the flags we need and to inject our code. However, because of how the Firefox build system works, it also means some Rust build scripts will also be compiled with these flags (unfortunately). In turn, this means those build scripts won't run on a system without packed relocation support in glibc, so we need to build Firefox on the newer system.

And because we're on the newer system, running this freshly built Firefox will just work, because the init code is skipped and relocations applied by the dynamic loader. Things will only get spicy when we start applying our hack to make our initialization code handle the relocations itself. Because Firefox is bigger than our previous testcases, scanning through to find the right versioned symbol to remove is going to be cumbersome, so we'll just skip that part. In fact, we can just use our first approach with objcopy, because it's smaller. After a successful build, let's first do that for libxul.so, which is the largest binary in Firefox.

$ objcopy --dump-section .dynamic=dyn obj-x86_64-pc-linux-gnu/dist/bin/libxul.so
$ xxd dyn | sed '/: 2[345]00/s/ 0000/ 0080/' | xxd -r > dyn.new
$ objcopy --update-section .dynamic=dyn.new obj-x86_64-pc-linux-gnu/dist/bin/libxul.so
$ ./mach run
 0:00.15 /path/to/obj-x86_64-pc-linux-gnu/dist/bin/firefox -no-remote -profile /path/to/obj-x86_64-pc-linux-gnu/tmp/profile-default
$ echo $?
245

Aaaand... it doesn't start. Let's try again in a debugger.

$ ./mach run --debug
<snip>
(gdb) run
<snip>
Thread 1 "firefox" received signal SIGSEGV, Segmentation fault.
real_init () at /tmp/relr-test.c:55
55          if ((*entry & 1) == 0) {

It's crashing while applying the relocations?! But why?

(gdb) print entry
$1 = (Elf64_Addr *) 0x303c8

That's way too small to be a valid address. What's going on? Let's start looking where this value is and where it comes from.

(gdb) print &entry
Address requested for identifier "entry" which is in register $rax

So where does the value of the rax register come from?

(gdb) set pagination off
(gdb) disassemble/m
<snip>
41          if (dyn->d_tag == (DT_RELR | 0x80000000)) {
42            relr = dyn->d_un.d_ptr;
   0x00007ffff2289f47 <+71>:    mov    (%rcx),%rax
<snip>
52        start = (ElfW(Addr) *)(elf_header + relr);
   0x00007ffff2289f54 <+84>:    add    0x681185(%rip),%rax        # 0x7ffff290b0e0
<snip>

So rax starts with the value from DT_RELR, and the value stored at the address 0x7ffff290b0e0 is added to it. What's at that address?

(gdb) print *(void**)0x7ffff290b0e0
$1 = (void *) 0x0

Well, no surprise here. Wanna bet it's another chicken and egg problem?

(gdb) info files
<snip>
        0x00007ffff28eaed8 - 0x00007ffff290b0e8 is .got in /path/to/obj-x86_64-pc-linux-gnu/dist/bin/libxul.so
<snip>

It's in the Global Offset Table, that's typically something that will have been relocated. It smells like there's a packed relocation for this, which would confirm our new chicken and egg problem. First, we find the non-relocated virtual address of the .got section in libxul.so.

$ readelf -SW obj-x86_64-pc-linux-gnu/dist/bin/libxul.so | grep '.got '
  [28] .got              PROGBITS        000000000ab7aed8 ab78ed8 020210 00  WA  0   0  8

So that 0x000000000ab7aed8 is loaded at 0x00007ffff28eaed8. Then we check if there's a relocation for the non-relocated virtual address of 0x7ffff290b0e0.

$ readelf -r obj-x86_64-pc-linux-gnu/dist/bin/libxul.so | grep -e Relocation -e $(printf %x $((0x7ffff290b0e0 - 0x00007ffff28eaed8 + 0x000000000ab7aed8)))
Relocation section '.rela.dyn' at offset 0x28028 contains 1404 entries:
Relocation section '.relr.dyn' at offset 0x303c8 contains 13406 entries:
000000000ab9b0e0
Relocation section '.rela.plt' at offset 0x4a6b8 contains 2635 entries:

And there is, and it is a RELR one, one of those that we're supposed to apply ourselves... we're kind of doomed aren't we? But how come this wasn't a problem with librelr-test.so? Let's find out in the corresponding code there:

$ objdump -d librelr-test.so
<snip>
    11e1:       48 8b 05 30 21 00 00    mov    0x2130(%rip),%rax        # 3318 <__executable_start@Base>
<snip>
$ readelf -SW librelr-test.so
<snip>
  [20] .got              PROGBITS        0000000000003308 002308 000040 08  WA  0   0  8
<snip>
$ readelf -r librelr-test.so | grep -e Relocation -e 3318
Relocation section '.rela.dyn' at offset 0x450 contains 7 entries:
000000003318  000300000006 R_X86_64_GLOB_DAT 0000000000000000 __executable_start + 0
Relocation section '.rela.plt' at offset 0x4f8 contains 1 entry:
Relocation section '.relr.dyn' at offset 0x510 contains 3 entries:

We had a relocation through symbol resolution, which the dynamic loader applies before calling our initialization code. That's what saved us, but all things considered, that is not exactly great either.

How do we avoid this? Well, let's take a step back, and consider why the GOT is being used. Our code is just using the address of __executable_start, and the compiler doesn't know where it is (the symbol is extern). Since it doesn't know where it is, and whether it will be in the same binary, and because we are building Position Independent Code, it uses the GOT, and a relocation will put the right address in the GOT. At link time, when the linker knows the symbol is in the same binary, it ends up using a relative relocation, which causes our problem.

So, how do we avoid using the GOT? By making the compiler aware that the symbol is eventually going to be in the same binary, which we can do by marking it with the hidden visibility.

Replacing

extern ElfW(Ehdr) __executable_start;

with

extern __attribute__((visibility("hidden"))) ElfW(Ehdr) __executable_start;

will do that for us. And after rebuilding, and re-hacking, our Firefox works, yay!

Let's try other binaries

Let's now try with the main Firefox binary.

$ objcopy --dump-section .dynamic=dyn obj-x86_64-pc-linux-gnu/dist/bin/firefox
$ xxd dyn | sed '/: 2[345]00/s/ 0000/ 0080/' | xxd -r > dyn.new
$ objcopy --update-section .dynamic=dyn.new obj-x86_64-pc-linux-gnu/dist/bin/firefox
$ ./mach run
 0:00.15 /path/to/obj-x86_64-pc-linux-gnu/dist/bin/firefox -no-remote -profile /path/to/obj-x86_64-pc-linux-gnu/tmp/profile-default
$ echo $?
245

We crashed again. Come on! What is it this time?

$ ./mach run --debug
<snip>
(gdb) run
<snip>
Program received signal SIGSEGV, Segmentation fault.
0x0000000000032370 in ?? ()
(gdb) bt
#0  0x0000000000032370 in ?? ()
#1  0x00005555555977be in phc_init (aMallocTable=0x7fffffffdb38, aBridge=0x555555626778 <greplacemallocbridge>)
    at /path/to/memory/replace/phc/PHC.cpp:1700
#2  0x00005555555817c5 in init () at /path/to/memory/build/mozjemalloc.cpp:5213
#3  0x000055555558196c in Allocator<replacemallocbase>::malloc (arg1=72704) at /path/to/memory/build/malloc_decls.h:51
#4  malloc (arg1=72704) at /path/to/memory/build/malloc_decls.h:51
#5  0x00007ffff7ca57ba in (anonymous namespace)::pool::pool (this=0x7ffff7e162c0 <(anonymous namespace)::emergency_pool>)
    at ../../../../src/libstdc++-v3/libsupc++/eh_alloc.cc:123
#6  __static_initialization_and_destruction_0 (__priority=65535, __initialize_p=1)
    at ../../../../src/libstdc++-v3/libsupc++/eh_alloc.cc:262
#7  _GLOBAL__sub_I_eh_alloc.cc(void) () at ../../../../src/libstdc++-v3/libsupc++/eh_alloc.cc:338
#8  0x00007ffff7fcfabe in call_init (env=0x7fffffffdd00, argv=0x7fffffffdcd8, argc=4, l=<optimized out>) at ./elf/dl-init.c:70
#9  call_init (l=</optimized><optimized out>, argc=4, argv=0x7fffffffdcd8, env=0x7fffffffdd00) at ./elf/dl-init.c:26
#10 0x00007ffff7fcfba4 in _dl_init (main_map=0x7ffff7ffe2e0, argc=4, argv=0x7fffffffdcd8, env=0x7fffffffdd00) at ./elf/dl-init.c:117
#11 0x00007ffff7fe5a60 in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#12 0x0000000000000004 in ?? ()
#13 0x00007fffffffdfae in ?? ()
#14 0x00007fffffffdfe2 in ?? ()
#15 0x00007fffffffdfed in ?? ()
#16 0x00007fffffffdff6 in ?? ()
#17 0x0000000000000000 in ?? ()
(gdb) info symbol 0x00007ffff7ca57ba
_GLOBAL__sub_I_eh_alloc.cc + 58 in section .text of /lib/x86_64-linux-gnu/libstdc++.so.6

Oh boy! So here, what's going on is that the libstdc++ initializer is called before Firefox's, and that initializer calls malloc, which is provided by the Firefox binary, but because Firefox's initializer hasn't run yet, the code in its allocator that depends on relative relocations fails...

Let's... just workaround this by disabling the feature of the Firefox allocator that requires those relocations:

ac_add_options --disable-replace-malloc

Rebuild, re-hack, and... Victory is mine!

Getting this in production

So far, we've looked at how we can achieve the same as elfhack with a simpler and more reliable strategy, that will allow us to consistently use lld across platforms and build types. Now that the approach has been validated, we can proceed with writing the actual code and hooking it in the Firefox build system. Our strategy here will be for our new tool to act as the linker. It will take all the arguments the compiler passes it, and will itself call the real linker with all the required extra arguments, including the object file containing the code to apply the relocations.

Of course, I also encountered some more grievances. For example, GNU ld doesn't define the __executable_start symbol when linking shared libraries, contrary to lld. Thankfully, it defines __ehdr_start, with the same meaning (and so does lld). There are also some details I left out for the _init function, which normally takes 3 arguments, and that the actual solution will have to deal with. It will also have to deal with "Relocation Read-Only" (relro), but for that, we can just reuse the code from elfhack.

The code already exists, and is up for review (this post was written in large part to give reviewers some extra background). The code handles desktop Linux for now (Android support will come later ; it will require a couple adjustments), and is limited to shared libraries (until the allocator is changed to avoid using relative relocations). It's also significantly smaller than elfhack.

$ loc build/unix/elfhack/elf*
--------------------------------------------------------------------------------
 Language             Files        Lines        Blank      Comment         Code
--------------------------------------------------------------------------------
 C++                      2         2393          230          302         1861
 C/C++ Header             1          701          120           17          564
--------------------------------------------------------------------------------
 Total                    3         3094          350          319         2425
--------------------------------------------------------------------------------
$ loc build/unix/elfhack/relr* 
--------------------------------------------------------------------------------
 Language             Files        Lines        Blank      Comment         Code
--------------------------------------------------------------------------------
 C++                      1          443           32           62          349
 C/C++ Header             1           25            5            3           17
--------------------------------------------------------------------------------
 Total                    2          468           37           65          366
--------------------------------------------------------------------------------

(this excludes the code to apply relocations, which is shared between both)

This is the beginning of the end for elfhack. Once "relrhack" is enabled in its place, it will be left around for Firefox downstream builds on systems with older linkers that don't support the necessary flags. Elfhack will eventually be removed when support for those systems is dropped, in a few years. Further down the line, we'll be able to retire both tools, as support for RELR relocations become ubiquitous.

As anticipated, this was a long post. Thank you for sticking to the end.

2023-08-30 11:16:38+0900

p.m.o | 2 Comments »