February 1st, 2024

When undefined behavior causes a nonsensical error (in Rust)

This all started when I looked at whether it would be possible to build Firefox with Pointer Authentication Code for arm64 macOS. In case you're curious, the quick answer is no, because Apple essentially hasn't upstreamed the final ABI for it yet, only Xcode clang can produce it, and obviously Rust can't.

Anyways, the Rust compiler did recently add the arm64e-apple-darwin target (which, as mentioned above, turns out to be useless for now), albeit without a prebuilt libstd (so, requiring the use of the -Zbuild-std flag). And by recently, I mean in 1.76.0 (in beta as of writing).

So, after tricking the Firefox build system into accepting to build for that target, I ended up with a Firefox build that... crashed on startup, saying:

Hit MOZ_CRASH(unsafe precondition(s) violated: slice::from_raw_parts requires the pointer to be aligned and non-null, and the total size of the slice not to exceed isize::MAX) at /builds/worker/fetches/rustc/lib/rustlib/src/rust/library/core/src/panicking.rs:155"

(MOZ_CRASH is what we get on explicit crashes, like MOZ_ASSERT in C++ code, or assert!() in Rust)

The caller of the crashing code was NS_InvokeByIndex, so at this point, I was thinking XPConnect might need some adjustement for arm64e.

But that was a build I had produced through the Mozilla try server. So I did a local non-optimized debug build to see what's up, which crashed with a different message:

Hit MOZ_CRASH(slice::get_unchecked requires that the index is within the slice) at /Users/glandium/.rustup/toolchains/nightly-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/slice/index.rs:228

This comes from this code in rust libstd:

    unsafe fn get_unchecked(self, slice: *const [T]) -> *const T {
        debug_assert_nounwind!(
            self < slice.len(),
            "slice::get_unchecked requires that the index is within the slice",
        );
        // SAFETY: the caller guarantees that `slice` is not dangling, so it
        // cannot be longer than `isize::MAX`. They also guarantee that
        // `self` is in bounds of `slice` so `self` cannot overflow an `isize`,
        // so the call to `add` is safe.
        unsafe {
            crate::hint::assert_unchecked(self < slice.len());
            slice.as_ptr().add(self)
        }
    }

(I'm pasting the whole thing because it will be important later)

We're hitting the debug_assert_nounwind.

The calling code looks like the following:

let end = atoms.get_unchecked(STATIC_ATOM_COUNT) as *const _;

And what the debug_assert_nounwind means is that STATIC_ATOM_COUNT is greater or equal to the slice size (spoiler alert: it is equal).

At that point, I started to suspect this might be a more general issue with the new Rust version, rather than something limited to arm64e. And I was kind of right? Mozilla automation did show crashes on all platforms when building with Rust beta (currently 1.76.0). But that was a different, and non-sensical crash:

Hit MOZ_CRASH(attempt to add with overflow) at servo/components/style/gecko_string_cache/mod.rs:77

But this time, it was in the same vicinity as the crash I was getting locally.

Since this was talking about an overflowing addition, I wrapped both terms in dbg!() to see the numbers and... the overflow disappeared but now I was getting a plain crash:

application crashed [@ <usize as core::slice::index::SliceIndex<[T]>>::get_unchecked]

(still from the same call to get_unchecked, at least)

The problem was fixed by essentially removing the entire code that was using get_unchecked. めでたしめでたし.

But this was too weird to leave it at that.

So what's going on?

Well, first is that despite there being a debug_assert, debug builds don't complain about the out-of-bounds use of get_unckecked. Only when using -Zbuild-std does it happen. I'm not sure whether that's intended, but I opened an issue about it to figure out.

Second, in the code I pasted from get_unchecked, the hint::assert_unchecked is new in 1.76.0 (well, it was intrinsics::assume in 1.76.0 and became hint::assert_unchecked in 1.77.0, but it wasn't there before). This is why our broken code didn't cause actual problems until now.

What about the addition overflow?

Well, this is where undefined behavior leads the optimizer to do what the user might perceive as weird things, but they actually make sense (as usual with these things involving undefined behavior). Let's start with a standalone version of the original code, simplifying the types used originally:

#![allow(non_upper_case_globals, non_snake_case, dead_code)]

#[inline]
fn static_atoms() -> &'static [[u32; 3]; STATIC_ATOM_COUNT] {
    unsafe {
        let addr = &gGkAtoms as *const _ as usize + kGkAtomsArrayOffset as usize;
        &*(addr as *const _)
    }
}

#[inline]
fn valid_static_atom_addr(addr: usize) -> bool {
    unsafe {
        let atoms = static_atoms();
        let start = atoms.as_ptr();
        let end = atoms.get_unchecked(STATIC_ATOM_COUNT) as *const _;
        let in_range = addr >= start as usize && addr < end as usize;
        let aligned = addr % 4 == 0;
        in_range && aligned
    }
}

fn main() {
    println!("{:?}", valid_static_atom_addr(0));
}

Stick this code in a newly created crate (with e.g. cargo new testcase), and run it:

$ cargo +nightly run -q
false

Nothing obviously bad happened. So what went wrong in Firefox? In my first local attempt, I had -Zbuild-std, so let's try that:

$ cargo +nightly run -q -Zbuild-std --target=x86_64-unknown-linux-gnu
thread 'main' panicked at /home/glandium/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/slice/index.rs:228:9:
slice::get_unchecked requires that the index is within the slice
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread caused non-unwinding panic. aborting.

There we go, we hit that get_unchecked error. But what went bad in Firefox if the reduced testcase doesn't crash without -Zbuild-std? Well, Firefox is always built with optimizations on by default, even for debug builds.

$ RUSTFLAGS=-O cargo +nightly run -q
thread 'main' panicked at src/main.rs:10:20:
attempt to add with overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Interestingly, though, changing the addition to

        let addr = dbg!(&gGkAtoms as *const _ as usize) + dbg!(kGkAtomsArrayOffset as usize);

doesn't "fix" it like it did with Firefox, but it shows:

[src/main.rs:10:20] &gGkAtoms as *const _ as usize = 94400145014784
[src/main.rs:10:59] kGkAtomsArrayOffset as usize = 61744
thread 'main' panicked at src/main.rs:10:20:
attempt to add with overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

which is even funnier, because you can see that adding those two numbers is definitely not causing an overflow.

Let's take a look at what LLVM is doing with this code across optimization passes, with the following command (on the initial code without dbg!(), and with a #[inline(never)] on valid_static_atom_addr):

RUSTFLAGS="-C debuginfo=0 -O -Cllvm-args=-print-changed=quiet" cargo +nightly run -q

Here is what's most relevant to us. First, what the valid_static_atom_addr function looks like after inlining as_ptr into it:

*** IR Dump After InlinerPass on (_ZN8testcase22valid_static_atom_addr17h778b64d644106c67E) ***
; Function Attrs: noinline nonlazybind uwtable
define internal fastcc noundef zeroext i1 @_ZN8testcase22valid_static_atom_addr17h778b64d644106c67E(i64 noundef %0) unnamed_addr #3 {
  %2 = call fastcc noundef align 4 dereferenceable(31212) ptr @_ZN8testcase12static_atoms17hde3e2dda1d3edc34E()
  call void @llvm.experimental.noalias.scope.decl(metadata !4)
  %3 = call fastcc noundef align 4 dereferenceable(12) ptr @"_ZN4core5slice29_$LT$impl$u20$$u5b$T$u5d$$GT$13get_unchecked17he5e8081ea9f9099dE"(ptr noalias noundef nonnull readonly align 4 %2, i64 noundef 2601, i64 noundef 2601)
  %4 = icmp eq ptr %2, null
  ret i1 %4 
}

At this point, we've already done some constant propagation, and we can see the call to get_unchecked is done with constants.

What comes next, after inlining both static_atoms and get_unchecked:

*** IR Dump After InlinerPass on (_ZN8testcase22valid_static_atom_addr17h778b64d644106c67E) ***
; Function Attrs: noinline nonlazybind uwtable
define internal fastcc noundef zeroext i1 @_ZN8testcase22valid_static_atom_addr17h778b64d644106c67E(i64 noundef %0) unnamed_addr #2 {
  %2 = call { i64, i1 } @llvm.uadd.with.overflow.i64(i64 ptrtoint (ptr @_ZN8testcase8gGkAtoms17h338a289876067f43E to i64), i64 61744)
  %3 = extractvalue { i64, i1 } %2, 1             
  br i1 %3, label %4, label %5, !prof !4

4:                                                ; preds = %1
  call void @_ZN4core9panicking5panic17hae453b53e597714dE(ptr noalias noundef nonnull readonly align 1 @str.0, i64 noundef 28, ptr noalias noundef nonnull readonly align 8 dereferenceable(24) @2) #9                      
  unreachable

5:                                                ; preds = %1
  %6 = extractvalue { i64, i1 } %2, 0
  %7 = inttoptr i64 %6 to ptr
  call void @llvm.experimental.noalias.scope.decl(metadata !5)
  unreachable

8:                                                ; No predecessors!
  %9 = icmp eq ptr %7, null
  ret i1 %9
}

The first basic block has two exits: 4 and 5, depending on how the add with overflow performed. Both of these basic blocks finish in... unreachable. The first one because it's the panic case for the overflow, and the second one because both values passed to get_unchecked are constants and equal, which the compiler has been hinted (with hint::assert_unchecked) that it's not possible. Thus, once get_unchecked is inlined, what's left is unreachable code. And because we're not rebuilding libstd, the debug_assert is not there before the unreachable annotation. Finally, the last basic block is now orphan.

Imagine you're an optimizer, and you want to optimize this code considering all its annotations. Well, you'll start by removing the orphan basic block. Then you see that the basic block 5 doesn't do anything, and doesn't have side effects, so you just remove it. Which means the branch leading to it can't happen. Basic block 4? There's a function call, so it would have to stay there, and so would the first basic block.

Guess what the Control-Flow Graph pass did? Just that:

*** IR Dump After SimplifyCFGPass on _ZN8testcase22valid_static_atom_addr17h778b64d644106c67E ***
; Function Attrs: noinline nonlazybind uwtable
define internal fastcc noundef zeroext i1 @_ZN8testcase22valid_static_atom_addr17h778b64d644106c67E(i64 noundef %0) unnamed_addr #2 {
  %2 = call { i64, i1 } @llvm.uadd.with.overflow.i64(i64 ptrtoint (ptr @_ZN8testcase8gGkAtoms17h338a289876067f43E to i64), i64 61744)
  %3 = extractvalue { i64, i1 } %2, 1
  call void @llvm.assume(i1 %3)
  call void @_ZN4core9panicking5panic17hae453b53e597714dE(ptr noalias noundef nonnull readonly align 1 @str.0, i64 noundef 28, ptr noalias noundef nonnull readonly align 8 dereferenceable(24) @2) #9
  unreachable
}

Now, there's no point doing the addition at all, since we're not even looking at its result:

*** IR Dump After InstCombinePass on _ZN8testcase22valid_static_atom_addr17h778b64d644106c67E ***
; Function Attrs: noinline nonlazybind uwtable
define internal fastcc noundef zeroext i1 @_ZN8testcase22valid_static_atom_addr17h778b64d644106c67E(i64 noundef %0) unnamed_addr #2 {
  call void @llvm.assume(i1 icmp uge (i64 ptrtoint (ptr @_ZN8testcase8gGkAtoms17h338a289876067f43E to i64), i64 -61744))
  call void @_ZN4core9panicking5panic17hae453b53e597714dE(ptr noalias noundef nonnull readonly align 1 @str.0, i64 noundef 28, ptr noalias noundef nonnull readonly align 8 dereferenceable(24) @2) #9
  unreachable
}

And this is how a hint that undefined behavior can't happen transformed get_unchecked(STATIC_ATOM_COUNT) into an addition overflow that never happened.

Obviously, this all doesn't happen with -Zbuild-std, because in that case the get_unchecked branch has a panic call that is still relevant.

$ RUSTFLAGS=-O cargo +nightly run -q -Zbuild-std --target=x86_64-unknown-linux-gnu
thread 'main' panicked at /home/glandium/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/slice/index.rs:228:9:
slice::get_unchecked requires that the index is within the slice
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread caused non-unwinding panic. aborting.

What about non-debug builds?

$ cargo +nightly run --release -q
Illegal instruction

In those builds, because there is no call to display a panic, the entire function ends up unreachable:

define internal fastcc noundef zeroext i1 @_ZN8testcase22valid_static_atom_addr17h9d1fc9abb5e1cc3aE(i64 noundef %0) unnamed_addr #4 {
  unreachable
}

So thanks to the magic of hints and compiler optimization, we have code that invokes undefined behavior that

crashes when built with cargo build --release
works when built with cargo build
says there's an addition overflow when built with RUSTFLAGS=-O cargo build.

And none of those give a hint as to what the real problem is.

2024-02-01 10:38:09+0900

p.m.o | No Comments »

November 22nd, 2023

How I (kind of) killed Mercurial at Mozilla

Did you hear the news? Firefox development is moving from Mercurial to Git. While the decision is far from being mine, and I was barely involved in the small incremental changes that ultimately led to this decision, I feel I have to take at least some responsibility. And if you are one of those who would rather use Mercurial than Git, you may direct all your ire at me.

But let's take a step back and review the past 25 years leading to this decision. You'll forgive me for skipping some details and any possible inaccuracies. This is already a long post, while I could have been more thorough, even I think that would have been too much. This is also not an official Mozilla position, only my personal perception and recollection as someone who was involved at times, but mostly an observer from a distance.

From CVS to DVCS

From its release in 1998, the Mozilla source code was kept in a CVS repository. If you're too young to know what CVS is, let's just say it's an old school version control system, with its set of problems. Back then, it was mostly ubiquitous in the Open Source world, as far as I remember.

In the early 2000s, the Subversion version control system gained some traction, solving some of the problems that came with CVS. Incidentally, Subversion was created by Jim Blandy, who now works at Mozilla on completely unrelated matters. In the same period, the Linux kernel development moved from CVS to Bitkeeper, which was more suitable to the distributed nature of the Linux community. BitKeeper had its own problem, though: it was the opposite of Open Source, but for most pragmatic people, it wasn't a real concern because free access was provided. Until it became a problem: someone at OSDL developed an alternative client to BitKeeper, and licenses of BitKeeper were rescinded for OSDL members, including Linus Torvalds (they were even prohibited from purchasing one).

Following this fiasco, in April 2005, two weeks from each other, both Git and Mercurial were born. The former was created by Linus Torvalds himself, while the latter was developed by Olivia Mackall, who was a Linux kernel developer back then. And because they both came out of the same community for the same needs, and the same shared experience with BitKeeper, they both were similar distributed version control systems.

Interestingly enough, several other DVCSes existed:

SVK, a DVCS built on top of Subversion, allowing users to create local (offline) branches of remote Subversion repositories. It was also known for its powerful merging capabilities. I picked it at some point for my Debian work, mainly because I needed to interact with Subversion repositories.
Arch (tla), later known as GNU arch. From what I remember, it was awful to use. You think Git is complex or confusing? Arch was way worse. It was forked as "Bazaar", but the fork was abandoned in favor of "Bazaar-NG", now known as "Bazaar" or "bzr", a much more user-friendly DVCS. The first release of Bzr actually precedes Git's by two weeks. I guess it was too new to be considered by Linus Torvalds for the Linux kernel needs.
Monotone, which I don't know much about, but it was mentioned by Linus Torvalds two days before the initial Git commit of Git. As far as I know, it was too slow for the Linux kernel's needs. I'll note in passing that Monotone is the creation of Graydon Hoare, who also created Rust.
Darcs, with its patch-based model, rather than the common snapshot-based model, allowed more flexible management of changes. This approach came, however, at the expense of performance.

In this landscape, the major difference Git was making at the time was that it was blazing fast. Almost incredibly so, at least on Linux systems. That was less true on other platforms (especially Windows). It was a game-changer for handling large codebases in a smooth manner.

Anyways, two years later, in 2007, Mozilla decided to move its source code not to Bzr, not to Git, not to Subversion (which, yes, was a contender), but to Mercurial. The decision "process" was laid down in two rather colorful blog posts. My memory is a bit fuzzy, but I don't recall that it was a particularly controversial choice. All of those DVCSes were still young, and there was no definite "winner" yet (GitHub hadn't even been founded). It made the most sense for Mozilla back then, mainly because the Git experience on Windows still wasn't there, and that mattered a lot for Mozilla, with its diverse platform support. As a contributor, I didn't think much of it, although to be fair, at the time, I was mostly consuming the source tarballs.

Personal preferences

Digging through my archives, I've unearthed a forgotten chapter: I did end up setting up both a Mercurial and a Git mirror of the Firefox source repository on alioth.debian.org. Alioth.debian.org was a FusionForge-based collaboration system for Debian developers, similar to SourceForge. It was the ancestor of salsa.debian.org. I used those mirrors for the Debian packaging of Firefox (cough cough Iceweasel). The Git mirror was created with hg-fast-export, and the Mercurial mirror was only a necessary step in the process. By that time, I had converted my Subversion repositories to Git, and switched off SVK. Incidentally, I started contributing to Git around that time as well.

I apparently did this not too long after Mozilla switched to Mercurial. As a Linux user, I think I just wanted the speed that Mercurial was not providing. Not that Mercurial was that slow, but the difference between a couple seconds and a couple hundred milliseconds was a significant enough difference in user experience for me to prefer Git (and Firefox was not the only thing I was using version control for)

Other people had also similarly created their own mirror, or with other tools. But none of them were "compatible": their commit hashes were different. Hg-git, used by the latter, was putting extra information in commit messages that would make the conversion differ, and hg-fast-export would just not be consistent with itself! My mirror is long gone, and those have not been updated in more than a decade.

I did end up using Mercurial, when I got commit access to the Firefox source repository in April 2010. I still kept using Git for my Debian activities, but I now was also using Mercurial to push to the Mozilla servers. I joined Mozilla as a contractor a few months after that, and kept using Mercurial for a while, but as a, by then, long time Git user, it never really clicked for me. It turns out, the sentiment was shared by several at Mozilla.

Git incursion

In the early 2010s, GitHub was becoming ubiquitous, and the Git mindshare was getting large. Multiple projects at Mozilla were already entirely hosted on GitHub. As for the Firefox source code base, Mozilla back then was kind of a Wild West, and engineers being engineers, multiple people had been using Git, with their own inconvenient workflows involving a local Mercurial clone. The most popular set of scripts was moz-git-tools, to incorporate changes in a local Git repository into the local Mercurial copy, to then send to Mozilla servers. In terms of the number of people doing that, though, I don't think it was a lot of people, probably a few handfuls. On my end, I was still keeping up with Mercurial.

I think at that time several engineers had their own unofficial Git mirrors on GitHub, and later on Ehsan Akhgari provided another mirror, with a twist: it also contained the full CVS history, which the canonical Mercurial repository didn't have. This was particularly interesting for engineers who needed to do some code archeology and couldn't get past the 2007 cutoff of the Mercurial repository. I think that mirror ultimately became the official-looking, but really unofficial, mozilla-central repository on GitHub. On a side note, a Mercurial repository containing the CVS history was also later set up, but that didn't lead to something officially supported on the Mercurial side.

Some time around 2011~2012, I started to more seriously consider using Git for work myself, but wasn't satisfied with the workflows others had set up for themselves. I really didn't like the idea of wasting extra disk space keeping a Mercurial clone around while using a Git mirror. I wrote a Python script that would use Mercurial as a library to access a remote repository and produce a git-fast-import stream. That would allow the creation of a git repository without a local Mercurial clone. It worked quite well, but it was not able to incrementally update. Other, more complete tools existed already, some of which I mentioned above. But as time was passing and the size and depth of the Mercurial repository was growing, these tools were showing their limits and were too slow for my taste, especially for the initial clone.

Boot to Git

In the same time frame, Mozilla ventured in the Mobile OS sphere with Boot to Gecko, later known as Firefox OS. What does that have to do with version control? The needs of third party collaborators in the mobile space led to the creation of what is now the gecko-dev repository on GitHub. As I remember it, it was challenging to create, but once it was there, Git users could just clone it and have a working, up-to-date local copy of the Firefox source code and its history... which they could already have, but this was the first officially supported way of doing so. Coincidentally, Ehsan's unofficial mirror was having trouble (to the point of GitHub closing the repository) and was ultimately shut down in December 2013.

You'll often find comments on the interwebs about how GitHub has become unreliable since the Microsoft acquisition. I can't really comment on that, but if you think GitHub is unreliable now, rest assured that it was worse in its beginning. And its sustainability as a platform also wasn't a given, being a rather new player. So on top of having this official mirror on GitHub, Mozilla also ventured in setting up its own Git server for greater control and reliability.

But the canonical repository was still the Mercurial one, and while Git users now had a supported mirror to pull from, they still had to somehow interact with Mercurial repositories, most notably for the Try server.

Git slowly creeping in Firefox build tooling

Still in the same time frame, tooling around building Firefox was improving drastically. For obvious reasons, when version control integration was needed in the tooling, Mercurial support was always a no-brainer.

The first explicit acknowledgement of a Git repository for the Firefox source code, other than the addition of the .gitignore file, was bug 774109. It added a script to install the prerequisites to build Firefox on macOS (still called OSX back then), and that would print a message inviting people to obtain a copy of the source code with either Mercurial or Git. That was a precursor to current bootstrap.py, from September 2012.

Following that, as far as I can tell, the first real incursion of Git in the Firefox source tree tooling happened in bug 965120. A few days earlier, bug 952379 had added a mach clang-format command that would apply clang-format-diff to the output from hg diff. Obviously, running hg diff on a Git working tree didn't work, and bug 965120 was filed, and support for Git was added there. That was in January 2014.

A year later, when the initial implementation of mach artifact was added (which ultimately led to artifact builds), Git users were an immediate thought. But while they were considered, it was not to support them, but to avoid actively breaking their workflows. Git support for mach artifact was eventually added 14 months later, in March 2016.

From gecko-dev to git-cinnabar

Let's step back a little here, back to the end of 2014. My user experience with Mercurial had reached a level of dissatisfaction that was enough for me to decide to take that script from a couple years prior and make it work for incremental updates. That meant finding a way to store enough information locally to be able to reconstruct whatever the incremental updates would be relying on (guess why other tools hid a local Mercurial clone under hood). I got something working rather quickly, and after talking to a few people about this side project at the Mozilla Portland All Hands and seeing their excitement, I published a git-remote-hg initial prototype on the last day of the All Hands.

Within weeks, the prototype gained the ability to directly push to Mercurial repositories, and a couple months later, was renamed to git-cinnabar. At that point, as a Git user, instead of cloning the gecko-dev repository from GitHub and switching to a local Mercurial repository whenever you needed to push to a Mercurial repository (i.e. the aforementioned Try server, or, at the time, for reviews), you could just clone and push directly from/to Mercurial, all within Git. And it was fast too. You could get a full clone of mozilla-central in less than half an hour, when at the time, other similar tools would take more than 10 hours (needless to say, it's even worse now).

Another couple months later (we're now at the end of April 2015), git-cinnabar became able to start off a local clone of the gecko-dev repository, rather than clone from scratch, which could be time consuming. But because git-cinnabar and the tool that was updating gecko-dev weren't producing the same commits, this setup was cumbersome and not really recommended. For instance, if you pushed something to mozilla-central with git-cinnabar from a gecko-dev clone, it would come back with a different commit hash in gecko-dev, and you'd have to deal with the divergence.

Eventually, in April 2020, the scripts updating gecko-dev were switched to git-cinnabar, making the use of gecko-dev alongside git-cinnabar a more viable option. Ironically(?), the switch occurred to ease collaboration with KaiOS (you know, the mobile OS born from the ashes of Firefox OS). Well, okay, in all honesty, when the need of syncing in both directions between Git and Mercurial (we only had ever synced from Mercurial to Git) came up, I nudged Mozilla in the direction of git-cinnabar, which, in my (biased but still honest) opinion, was the more reliable option for two-way synchronization (we did have regular conversion problems with hg-git, nothing of the sort has happened since the switch).

One Firefox repository to rule them all

For reasons I don't know, Mozilla decided to use separate Mercurial repositories as "branches". With the switch to the rapid release process in 2011, that meant one repository for nightly (mozilla-central), one for aurora, one for beta, and one for release. And with the addition of Extended Support Releases in 2012, we now add a new ESR repository every year. Boot to Gecko also had its own branches, and so did Fennec (Firefox for Mobile, before Android). There are a lot of them.

And then there are also integration branches, where developer's work lands before being merged in mozilla-central (or backed out if it breaks things), always leaving mozilla-central in a (hopefully) good state. Only one of them remains in use today, though.

I can only suppose that the way Mercurial branches work was not deemed practical. It is worth noting, though, that Mercurial branches are used in some cases, to branch off a dot-release when the next major release process has already started, so it's not a matter of not knowing the feature exists or some such.

In 2016, Gregory Szorc set up a new repository that would contain them all (or at least most of them), which eventually became what is now the mozilla-unified repository. This would e.g. simplify switching between branches when necessary.

7 years later, for some reason, the other "branches" still exist, but most developers are expected to be using mozilla-unified. Mozilla's CI also switched to using mozilla-unified as base repository.

Honestly, I'm not sure why the separate repositories are still the main entry point for pushes, rather than going directly to mozilla-unified, but it probably comes down to switching being work, and not being a top priority. Also, it probably doesn't help that working with multiple heads in Mercurial, even (especially?) with bookmarks, can be a source of confusion. To give an example, if you aren't careful, and do a plain clone of the mozilla-unified repository, you may not end up on the latest mozilla-central changeset, but rather, e.g. one from beta, or some other branch, depending which one was last updated.

Hosting is simple, right?

Put your repository on a server, install hgweb or gitweb, and that's it? Maybe that works for... Mercurial itself, but that repository "only" has slightly over 50k changesets and less than 4k files. Mozilla-central has more than an order of magnitude more changesets (close to 700k) and two orders of magnitude more files (more than 700k if you count the deleted or moved files, 350k if you count the currently existing ones).

And remember, there are a lot of "duplicates" of this repository. And I didn't even mention user repositories and project branches.

Sure, it's a self-inflicted pain, and you'd think it could probably(?) be mitigated with shared repositories. But consider the simple case of two repositories: mozilla-central and autoland. You make autoland use mozilla-central as a shared repository. Now, you push something new to autoland, it's stored in the autoland datastore. Eventually, you merge to mozilla-central. Congratulations, it's now in both datastores, and you'd need to clean-up autoland if you wanted to avoid the duplication.

Now, you'd think mozilla-unified would solve these issues, and it would... to some extent. Because that wouldn't cover user repositories and project branches briefly mentioned above, which in GitHub parlance would be considered as Forks. So you'd want a mega global datastore shared by all repositories, and repositories would need to only expose what they really contain. Does Mercurial support that? I don't think so (okay, I'll give you that: even if it doesn't, it could, but that's extra work). And since we're talking about a transition to Git, does Git support that? You may have read about how you can link to a commit from a fork and make-pretend that it comes from the main repository on GitHub? At least, it shows a warning, now. That's essentially the architectural reason why. So the actual answer is that Git doesn't support it out of the box, but GitHub has some backend magic to handle it somehow (and hopefully, other things like Gitea, Girocco, Gitlab, etc. have something similar).

Now, to come back to the size of the repository. A repository is not a static file. It's a server with which you negotiate what you have against what it has that you want. Then the server bundles what you asked for based on what you said you have. Or in the opposite direction, you negotiate what you have that it doesn't, you send it, and the server incorporates what you sent it. Fortunately the latter is less frequent and requires authentication. But the former is more frequent and CPU intensive. Especially when pulling a large number of changesets, which, incidentally, cloning is.

"But there is a solution for clones" you might say, which is true. That's clonebundles, which offload the CPU intensive part of cloning to a single job scheduled regularly. Guess who implemented it? Mozilla. But that only covers the cloning part. We actually had laid the ground to support offloading large incremental updates and split clones, but that never materialized. Even with all that, that still leaves you with a server that can display file contents, diffs, blames, provide zip archives of a revision, and more, all of which are CPU intensive in their own way.

And these endpoints are regularly abused, and cause extra load to your servers, yes plural, because of course a single server won't handle the load for the number of users of your big repositories. And because your endpoints are abused, you have to close some of them. And I'm not mentioning the Try repository with its tens of thousands of heads, which brings its own sets of problems (and it would have even more heads if we didn't fake-merge them once in a while).

Of course, all the above applies to Git (and it only gained support for something akin to clonebundles last year). So, when the Firefox OS project was stopped, there wasn't much motivation to continue supporting our own Git server, Mercurial still being the official point of entry, and git.mozilla.org was shut down in 2016.

The growing difficulty of maintaining the status quo

Slowly, but steadily in more recent years, as new tooling was added that needed some input from the source code manager, support for Git was more and more consistently added. But at the same time, as people left for other endeavors and weren't necessarily replaced, or more recently with layoffs, resources allocated to such tooling have been spread thin.

Meanwhile, the repository growth didn't take a break, and the Try repository was becoming an increasing pain, with push times quite often exceeding 10 minutes. The ongoing work to move Try pushes to Lando will hide the problem under the rug, but the underlying problem will still exist (although the last version of Mercurial seems to have improved things).

On the flip side, more and more people have been relying on Git for Firefox development, to my own surprise, as I didn't really push for that to happen. It just happened organically, by ways of git-cinnabar existing, providing a compelling experience to those who prefer Git, and, I guess, word of mouth. I was genuinely surprised when I recently heard the use of Git among moz-phab users had surpassed a third. I did, however, occasionally orient people who struggled with Mercurial and said they were more familiar with Git, towards git-cinnabar. I suspect there's a somewhat large number of people who never realized Git was a viable option.

But that, on its own, can come with its own challenges: if you use git-cinnabar without being backed by gecko-dev, you'll have a hard time sharing your branches on GitHub, because you can't push to a fork of gecko-dev without pushing your entire local repository, as they have different commit histories. And switching to gecko-dev when you weren't already using it requires some extra work to rebase all your local branches from the old commit history to the new one.

Clone times with git-cinnabar have also started to go a little out of hand in the past few years, but this was mitigated in a similar manner as with the Mercurial cloning problem: with static files that are refreshed regularly. Ironically, that made cloning with git-cinnabar faster than cloning with Mercurial. But generating those static files is increasingly time-consuming. As of writing, generating those for mozilla-unified takes close to 7 hours. I was predicting clone times over 10 hours "in 5 years" in a post from 4 years ago, I wasn't too far off. With exponential growth, it could still happen, although to be fair, CPUs have improved since. I will explore the performance aspect in a subsequent blog post, alongside the upcoming release of git-cinnabar 0.7.0-b1. I don't even want to check how long it now takes with hg-git or git-remote-hg (they were already taking more than a day when git-cinnabar was taking a couple hours).

I suppose it's about time that I clarify that git-cinnabar has always been a side-project. It hasn't been part of my duties at Mozilla, and the extent to which Mozilla supports git-cinnabar is in the form of taskcluster workers on the community instance for both git-cinnabar CI and generating those clone bundles. Consequently, that makes the above git-cinnabar specific issues a Me problem, rather than a Mozilla problem.

Taking the leap

I can't talk for the people who made the proposal to move to Git, nor for the people who put a green light on it. But I can at least give my perspective.

Developers have regularly asked why Mozilla was still using Mercurial, but I think it was the first time that a formal proposal was laid out. And it came from the Engineering Workflow team, responsible for issue tracking, code reviews, source control, build and more.

It's easy to say "Mozilla should have chosen Git in the first place", but back in 2007, GitHub wasn't there, Bitbucket wasn't there, and all the available options were rather new (especially compared to the then 21 years-old CVS). I think Mozilla made the right choice, all things considered. Had they waited a couple years, the story might have been different.

You might say that Mozilla stayed with Mercurial for so long because of the sunk cost fallacy. I don't think that's true either. But after the biggest Mercurial repository hosting service turned off Mercurial support, and the main contributor to Mercurial going their own way, it's hard to ignore that the landscape has evolved.

And the problems that we regularly encounter with the Mercurial servers are not going to get any better as the repository continues to grow. As far as I know, all the Mercurial repositories bigger than Mozilla's are... not using Mercurial. Google has its own closed-source server, and Facebook has another of its own, and it's not really public either. With resources spread thin, I don't expect Mozilla to be able to continue supporting a Mercurial server indefinitely (although I guess Octobus could be contracted to give a hand, but is that sustainable?).

Mozilla, being a champion of Open Source, also doesn't live in a silo. At some point, you have to meet your contributors where they are. And the Open Source world is now majoritarily using Git. I'm sure the vast majority of new hires at Mozilla in the past, say, 5 years, know Git and have had to learn Mercurial (although they arguably didn't need to). Even within Mozilla, with thousands(!) of repositories on GitHub, Firefox is now actually the exception rather than the norm. I should even actually say Desktop Firefox, because even Mobile Firefox lives on GitHub (although Fenix is moving back in together with Desktop Firefox, and the timing is such that that will probably happen before Firefox moves to Git).

Heck, even Microsoft moved to Git!

With a significant developer base already using Git thanks to git-cinnabar, and all the constraints and problems I mentioned previously, it actually seems natural that a transition (finally) happens. However, had git-cinnabar or something similarly viable not existed, I don't think Mozilla would be in a position to take this decision. On one hand, it probably wouldn't be in the current situation of having to support both Git and Mercurial in the tooling around Firefox, nor the resource constraints related to that. But on the other hand, it would be farther from supporting Git and being able to make the switch in order to address all the other problems.

But... GitHub?

I hope I made a compelling case that hosting is not as simple as it can seem, at the scale of the Firefox repository. It's also not Mozilla's main focus. Mozilla has enough on its plate with the migration of existing infrastructure that does rely on Mercurial to understandably not want to figure out the hosting part, especially with limited resources, and with the mixed experience hosting both Mercurial and git has been so far.

After all, GitHub couldn't even display things like the contributors' graph on gecko-dev until recently, and hosting is literally their job! They still drop the ball on large blames (thankfully we have searchfox for those).

Where does that leave us? Gitlab? For those criticizing GitHub for being proprietary, that's probably not open enough. Cloud Source Repositories? "But GitHub is Microsoft" is a complaint I've read a lot after the announcement. Do you think Google hosting would have appealed to these people? Bitbucket? I'm kind of surprised it wasn't in the list of providers that were considered, but I'm also kind of glad it wasn't (and I'll leave it at that).

I think the only relatively big hosting provider that could have made the people criticizing the choice of GitHub happy is Codeberg, but I hadn't even heard of it before it was mentioned in response to Mozilla's announcement. But really, with literal thousands of Mozilla repositories already on GitHub, with literal tens of millions repositories on the platform overall, the pragmatic in me can't deny that it's an attractive option (and I can't stress enough that I wasn't remotely close to the room where the discussion about what choice to make happened).

"But it's a slippery slope". I can see that being a real concern. LLVM also moved its repository to GitHub (from a (I think) self-hosted Subversion server), and ended up moving off Bugzilla and Phabricator to GitHub issues and PRs four years later. As an occasional contributor to LLVM, I hate this move. I hate the GitHub review UI with a passion.

At least, right now, GitHub PRs are not a viable option for Mozilla, for their lack of support for security related PRs, and the more general shortcomings in the review UI. That doesn't mean things won't change in the future, but let's not get too far ahead of ourselves. The move to Git has just been announced, and the migration has not even begun yet. Just because Mozilla is moving the Firefox repository to GitHub doesn't mean it's locked in forever or that all the eggs are going to be thrown into one basket. If bridges need to be crossed in the future, we'll see then.

So, what's next?

The official announcement said we're not expecting the migration to really begin until six months from now. I'll swim against the current here, and say this: the earlier you can switch to git, the earlier you'll find out what works and what doesn't work for you, whether you already know Git or not.

While there is not one unique workflow, here's what I would recommend anyone who wants to take the leap off Mercurial right now:

Make sure git is installed. Chances are you already have it.

Install git-cinnabar where mach bootstrap would install it.

$ mkdir -p ~/.mozbuild/git-cinnabar
$ cd ~/.mozbuild/git-cinnabar
$ curl -sOL https://raw.githubusercontent.com/glandium/git-cinnabar/master/download.py
$ python3 download.py && rm download.py

Add git-cinnabar to your PATH. Make sure to also set that wherever you keep your PATH up-to-date (.bashrc or wherever else).
```
$ PATH=$PATH:$HOME/.mozbuild/git-cinnabar
```
Enter your mozilla-central or mozilla-unified Mercurial working copy, we'll do an in-place conversion, so that you don't need to move your mozconfigs, objdirs and what not.

Initialize the git repository from GitHub.

$ git init
$ git remote add origin https://github.com/mozilla/gecko-dev
$ git remote update origin

Switch to a Mercurial remote.

$ git remote set-url origin hg::https://hg.mozilla.org/mozilla-unified
$ git config --local remote.origin.cinnabar-refs bookmarks
$ git remote update origin --prune

Fetch your local Mercurial heads.
```
$ git -c cinnabar.refs=heads fetch hg::$PWD refs/heads/default/*:refs/heads/hg/*
```
This will create a bunch of hg/<sha1> local branches, not all relevant to you (some come from old branches on mozilla-central). Note that if you're using Mercurial MQ, this will not pull your queues, as they don't exist as heads in the Mercurial repo. You'd need to apply your queues one by one and run the command above for each of them.
Or, if you have bookmarks for your local Mercurial work, you can use this instead:
```
$ git -c cinnabar.refs=bookmarks fetch hg::$PWD refs/heads/*:refs/heads/hg/*
```
This will create hg/<bookmark_name> branches.
Now, make git know what commit your working tree is on.
```
$ git reset $(git cinnabar hg2git $(hg log -r . -T '{node}'))
```
This will take a little moment because Git is going to scan all the files in the tree for the first time. On the other hand, it won't touch their content or timestamps, so if you had a build around, it will still be valid, and mach build won't rebuild anything it doesn't have to.

As there is no one-size-fits-all workflow, I won't tell you how to organize yourself from there. I'll just say this: if you know the Mercurial sha1s of your previous local work, you can create branches for them with:

$ git branch <branch_name> $(git cinnabar hg2git <hg_sha1>)

At this point, you should have everything available on the Git side, and you can remove the .hg directory. Or move it into some empty directory somewhere else, just in case. But don't leave it here, it will only confuse the tooling. Artifact builds WILL be confused, though, and you'll have to ./mach configure before being able to do anything. You may also hit bug 1865299 if your working tree is older than this post.

If you have any problem or question, you can ping me on #git-cinnabar or #git on Matrix. I'll put the instructions above somewhere on wiki.mozilla.org, and we can collaboratively iterate on them.

Now, what the announcement didn't say is that the Git repository WILL NOT be gecko-dev, doesn't exist yet, and WON'T BE COMPATIBLE (trust me, it'll be for the better). Why did I make you do all the above, you ask? Because that won't be a problem. I'll have you covered, I promise. The upcoming release of git-cinnabar 0.7.0-b1 will have a way to smoothly switch between gecko-dev and the future repository (incidentally, that will also allow to switch from a pure git-cinnabar clone to a gecko-dev one, for the git-cinnabar users who have kept reading this far).

What about git-cinnabar?

With Mercurial going the way of the dodo at Mozilla, my own need for git-cinnabar will vanish. Legitimately, this begs the question whether it will still be maintained.

I can't answer for sure. I don't have a crystal ball. However, the needs of the transition itself will motivate me to finish some long-standing things (like finalizing the support for pushing merges, which is currently behind an experimental flag) or implement some missing features (support for creating Mercurial branches).

Git-cinnabar started as a Python script, it grew a sidekick implemented in C, which then incorporated some Rust, which then cannibalized the Python script and took its place. It is now close to 90% Rust, and 10% C (if you don't count the code from Git that is statically linked to it), and has sort of become my Rust playground (it's also, I must admit, a mess, because of its history, but it's getting better). So the day to day use with Mercurial is not my sole motivation to keep developing it. If it were, it would stay stagnant, because all the features I need are there, and the speed is not all that bad, although I know it could be better. Arguably, though, git-cinnabar has been relatively stagnant feature-wise, because all the features I need are there.

So, no, I don't expect git-cinnabar to die along Mercurial use at Mozilla, but I can't really promise anything either.

Final words

That was a long post. But there was a lot of ground to cover. And I still skipped over a bunch of things. I hope I didn't bore you to death. If I did and you're still reading... what's wrong with you? ;)

So this is the end of Mercurial at Mozilla. So long, and thanks for all the fish. But this is also the beginning of a transition that is not easy, and that will not be without hiccups, I'm sure. So fasten your seatbelts (plural), and welcome the change.

To circle back to the clickbait title, did I really kill Mercurial at Mozilla? Of course not. But it's like I stumbled upon a few sparks and tossed a can of gasoline on them. I didn't start the fire, but I sure made it into a proper bonfire... and now it has turned into a wildfire.

And who knows? 15 years from now, someone else might be looking back at how Mozilla picked Git at the wrong time, and that, had we waited a little longer, we would have picked some yet to come new horse. But hey, that's the tech cycle for you.

2023-11-22 04:49:47+0900

cinnabar, p.m.o | 8 Comments »

August 30th, 2023

Hacking the ELF format for Firefox, 12 years later ; doing better with less

(I haven't posted a lot in the past couple years, except for git-cinnabar announcements. This is going to be a long one, hold tight)

This is quite the cryptic title, isn't it? What is this all about? ELF (Executable and Linkable Format) is a file format used for binary files (e.g. executables, shared libraries, object files, and even core dumps) on some Unix systems (Linux, Solaris, BSD, etc.). A little over 12 years ago, I wrote a blog post about improving libxul startup I/O by hacking the ELF format. For context, libxul is the shared library, shipped with Firefox, that contains most of its code.

Let me spare you the read. Back then I was looking at I/O patterns during Firefox startup on Linux, and sought ways to reduce disk seeks that were related to loading libxul. One particular pattern was caused by relocations, and the way we alleviated it was through elfhack.

Relocations are necessary in order for executables to work when they are loaded in memory at a location that is not always the same (because of e.g. ASLR). Applying them requires reading the section containing the relocations, and adjusting the pieces of code or data that are described by the relocations. When the relocation section is very large (and that was the case on libxul back then, and more so now), that means going back and forth (via disk seeks) between the relocation section and the pieces to adjust.

Elfhack to the rescue

Shortly after the aforementioned blog post, the elfhack tool was born and made its way into the Firefox code base.

The main idea behind elfhack was to reduce the size of the relocation section. How? By storing it in a more compact form. But how? By taking the executable apart, rewriting its relocation section, injecting code to apply those relocations, moving sections around, and adjusting the ELF program header, section header, section table, and string table accordingly. I will spare you the gory details (especially the part about splitting segments or the hack to use .bss section as a temporary Global Offset Table). Elfhack itself is essentially a minimalist linker that works on already linked executables. That has caused us a number of issues over the years (and much more). In fact, it's known not to work on binaries created with lld (the linker from the LLVM project) because the way lld lays things out does not provide space for the tricks we pull (although it seems to be working with the latest version of lld. But who knows what will happen with next version).

Hindsight is 20/20, and if I were to redo it, I'd take a different route. Wait, I'm actually kind of doing that! But let me fill you in with what happened in the past 12 years, first.

Android packed relocations

In 2014, Chrome started using a similar-ish approach for Android on ARM with an even more compact format, compared to the crude packing elfhack was doing. Instead of injecting initialization code in the executable, it would use a custom dynamic loader/linker to handle the packed relocations (that loader/linker was forked from the one in the Android NDK, which solved similar problems to what our own custom linker had, but that's another story).

That approach eventually made its way into Android itself, in 2015, with support from the dynamic loader in bionic (the Android libc), and later support for emitting those packed relocations was added to lld in October 2017. Interestingly, the packer added to lld created smaller packed relocations than the packer in Android (for the same format).

The road to standardization

Shortly after bionic got its native packed relocation support, a conversation started on the gnu-gabi mailing list related to the general problem of relocations representing a large portion of Position Independent Executable. What we observed on a shared library had started to creep into programs as well because PIE binaries started to be prominent around that time, with some compilers and linkers starting to default to that for hardening reasons. Both Chrome's and Firefox prior art were mentioned. This was April 2017.

A few months went by, and a simpler format was put forward, with great results, which led to, a few days later, a formal proposal for RELR relocations in the Generic System V Application Binary Interface.

More widespread availability

Shortly after the proposal, Android got experimental support for it, and a few months later, in July 2018, lld gained experimental support as well.

The Linux kernel got support for it too, for KASLR relocations, but for arm64 only (I suppose this was for Android kernels. It still is the only architecture it has support for to this day).

GNU binutils gained support for the proposal (via a -z pack-relative-relocs flag) at the end of 2021, and glibc eventually caught up in 2022, and this shipped respectively in binutils 2.38 and glibc 2.36. These versions should now have reached most latest releases of major Linux distros.

Lld thereafter got support for the same flag as binutils's, with the same side effect of adding a version dependency on GLIBC_ABI_DT_RELR, to avoid crashes when running executables with packed relocations against an older glibc.

What about Firefox?

Elfhack was updated to use the format from the proposal at the very end of 2021 (or rather, close enough to that format). More recently (as in, two months ago), support for the -z pack-relative-relocs flag was added, so that when building Firefox against a recent enough glibc and with a recent enough linker, it will use that instead of elfhack automatically. This means in some cases, Firefox packages in Linux distros will be using those relocations (for instance, that's the case since Firefox 116 in Debian unstable).

Which (finally) brings us to the next step, and the meat of this post.

Retiring Elfhack

It's actually still too early for that. The Firefox binaries Mozilla provides need to run on a broad variety of systems, including many that don't support those new packed relocations. That includes Android systems older than Red Velvet Cake (11), and not necessarily very old desktop systems.

Android Pie (9) shipped with experimental, but incompatible, support for the same packed relocation format, but using different constants. Hacking the PT_DYNAMIC segment (the segment containing metadata for dynamic linking) for compatibility with all Android versions >= 9 would technically be possible, but again, Mozilla needs to support even older versions of Android.

There comes the idea behind what I've now called relrhack: injecting code that can apply the packed relocations created by the linker if the system dynamic loader hasn't.

To some extent, that sounds similar to what elfhack does, doesn't it? But elfhack packs the relocations itself. And because its input is a fully linked binary, it has to do complex things that we know don't always work reliably.

In the past few years, an idea was floating in the back of my mind to change elfhack to start off a relocatable binary (also known as partially linked). It would then rewrite the sections it needs to, and invoke the linker to link that to its initialization code and produce the final binary. That would theoretically avoid all the kinds of problems we've hit, and work more reliably with lld.

The idea I've toyed with more recently, though, is even simpler: Use the -z pack-relative-relocs linker support, and add the initialization code on the linker command line so that it does everything in one go. We're at this sweet spot in time where we can actually start doing this.

Testing the idea

My first attempts were with a small executable, and linking with lld's older --pack-dyn-relocs=relr flag, which does the same as -z pack-relative-relocs but skips adding the GLIBC_ABI_DT_RELR version dependency. That allowed to avoid having to do post-processing of the binary in this first experimentation step.

I quickly got something working on a Debian Bullseye system (using an older glibc that doesn't support the packed relocations). Here's how it goes:

// Compile with: clang -fuse-ld=lld -Wl,--pack-dyn-relocs=relr,--entry=my_start,-z,norelro -o relr-test
#include <stdio.h>

char *helloworld[] = {"Hello, world"};

int main(void) {
  printf("%s\n", helloworld[0]);
  return 0;
}

This is a minimal Hello world program that contains a relative relocation: the helloworld variable is an array of pointers, and those pointers need to be relocated. Optimizations would get rid of the array but we don't enable optimizations specifically for that. We also disable "Relocation Read-Only", which is a protection that makes the dynamic loader relocated sections read-only after it's done applying relocations. That would prevent us from applying the missing relocations on our own. We're just testing, we'll deal with that later.

Compiling just this without --entry=my_start (because we haven't defined that yet), and running it yields a segmentation fault. We don't even reach main because there actually is an initialization function section that runs before that, and its location, defined in the .init_array section, is behind a relative relocation, which --pack-dyn-relocs=relr packed. This is exactly why -z pack-relative-relocs adds a dependency on a symbol version that doesn't exist in older glibcs. With that flag, the error becomes:

/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_ABI_DT_RELR' not found

which is more user-friendly than a plain crash.

At this point, what do we want? Well, we want to apply the relocations ourselves, as early as possible. The first thing that will run in an executable is its "entry point", that defaults to _start (provided by the C runtime, aka CRT). As hinted in the code snippet above, we can set our own with --entry.

static void real_init();
extern void _start();

void my_start() {
  real_init();
  _start();
}

Here's our own entry point. It will start by calling the "real" initialization function we forward declare here. Let's see if that actually works. Let's add the following temporarily and see how things go.

void real_init() {
  printf("Early Hello world\n");
}

Running the program now yields:

$ ./relr-test
Early Hello world
Segmentation fault

There we go, we've executed code before anything relies on the relative relocations being applied. By the way, adding functions calls like this printf, that early, with elfhack, was an interesting challenge. This is pleasantly much simpler.

Applying the relocations for real

Let's replace that real_init function with some boilerplate for the upcoming real real_init:

#include <link.h>

#ifndef DT_RELRSZ
#define DT_RELRSZ 35
#endif
#ifndef DT_RELR
#define DT_RELR 36
#endif

extern ElfW(Dyn) _DYNAMIC[];
extern ElfW(Ehdr) __executable_start;

The defines are there because older systems don't have them in link.h. _DYNAMIC is a symbol that gives access to the PT_DYNAMIC segment at runtime, and the __executable_start symbol gives access to the base address of the program, which non-relocated addresses in the binary are relative to.

Now we're ready for the real work:

void real_init() {
  // Find the relocations section.
  ElfW(Addr) relr;
  ElfW(Word) size = 0;
  for (ElfW(Dyn) *dyn = _DYNAMIC; dyn->d_tag != DT_NULL; dyn++) {
    if (dyn->d_tag == DT_RELR) {
      relr = dyn->d_un.d_ptr;
    }
    if (dyn->d_tag == DT_RELRSZ) {
      size = dyn->d_un.d_val;
    }
  }
  uintptr_t elf_header = (uintptr_t)&__executable_start;

  // Apply the relocations.
  ElfW(Addr) *ptr, *start, *end;
  start = (ElfW(Addr) *)(elf_header + relr);
  end = (ElfW(Addr) *)(elf_header + relr + size);
  for (ElfW(Addr) *entry = start; entry < end; entry++) {
    if ((*entry & 1) == 0) {
      ptr = (ElfW(Addr) *)(elf_header + *entry);
      *ptr += elf_header;
    } else {
      size_t remaining = 8 * sizeof(ElfW(Addr)) - 1;
      ElfW(Addr) bits = *entry;
      do {
        bits >>= 1;
        remaining--;
        ptr++;
        if (bits & 1) {
          *ptr += elf_header;
        }
      } while (bits);
      ptr += remaining;
    }
  }
}

It's all kind of boring here. We scan the PT_DYNAMIC segment to get the location and size of the packed relocations section, and then read and apply them.

And does it work?

$ ./relr-test
Hello, world

It does! Mission accomplished? If only...

The devil is in the details

Let's try running this same binary on a system with a more recent glibc:

$ ./relr-test 
./relr-test: error while loading shared libraries: ./relr-test: DT_RELR without GLIBC_ABI_DT_RELR dependency

Oh come on! Yes, glibc insists that when the PT_DYNAMIC segment contains these types of relocations, the binary must have that symbol version dependency. That same symbol version dependency we need to avoid in order to work on older systems. I have no idea why the glibc developers went all their way to prevent that. Someone even asked when this was all at the patch stage, with no answer.

We'll figure out a workaround later. Let's use -Wl,-z,pack-relative-relocs for now and see how it goes.

$ ./relr-test 
Segmentation fault

Oops. Well, that actually didn't happen when I was first testing, but for the purpose of this post, I didn't want to touch this topic before strictly necessary. Because we're now running on a system that does support the packed relocations, when our initialization code is reached, relocations are already applied, and we're applying them again. That overcompensates every relocated address, and leads to accesses to unmapped memory.

But how can we know whether relocations were applied? Well, conveniently, the address of a function, from within that function, doesn't need a relative relocation to be known. That's one half. The other half requires "something" that uses a relative relocation to know that same address. We insert this before real_init, but after its forward declaration:

void (*__real_init)() = real_init;

Because it's a global variable that points to the address of the function, it requires a relocation. And because the function is static and in the compilation unit, it needs a relative relocation, not one that would require symbol resolution.

Now we can add this at the beginning of real_init:

  // Don't apply relocations when the dynamic loader has applied them already.
  if (__real_init == real_init) {
    return;
  }

And we're done. This works:

$ ./relr-test 
Hello, world

Unfortunately, we're back to square one on an older system:

$ ./relr-test 
./relr-test: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_ABI_DT_RELR' not found (required by ./relr-test)

Hacking the ELF format, again

And here we go again, having to post-process a binary. So what do we need this time around? Well, starting from a binary linked with --pack-dyn-relocs=relr, we need to avoid the "DT_RELR without GLIBC_ABI_DT_RELR" check. If we change the PT_DYNAMIC segment such that it doesn't contain DT_RELR-related tags, the error will be avoided. Sadly, that means we'll always apply relocations ourselves, but so be it.

How do we do that? Open the file, find the PT_DYNAMIC segment, scan it, overwrite a few tags with a different value, and done. Damn, that's much less work than everything elfhack was doing. I will spare you the code required to do that. Heck, that can trivially be done in a hex editor. Hey, you know what? That would actually be less stuff to write here than ELF parsing code, and would still allow you to follow at home.

Let's start from that binary we built earlier with --pack-dyn-relocs=relr.

$ objcopy --dump-section .dynamic=dyn relr-test

We now have a dyn file with the contents of the PT_DYNAMIC segment.

In that segment, each block of 16 bytes (assuming a 64-bits system) stores a 8-byte tag and a 8-byte value. We want to change the DT_RELR, DT_RELRSZ and DT_RELRENT tags. Their hex value are, respectively, 0x24, 0x23 and 0x25.

$ xxd dyn | grep 2[345]00
00000060: 2400 0000 0000 0000 6804 0000 0000 0000  $.......h.......
00000070: 2300 0000 0000 0000 1000 0000 0000 0000  #...............
00000080: 2500 0000 0000 0000 0800 0000 0000 0000  %...............

(got lucky a bit here, not matching anywhere else than in the tag)

Let's set an extra arbitrary high-ish bit.

$ xxd dyn | sed -n '/: 2[345]00/s/ 0000/ 0080/p'
00000060: 2400 0080 0000 0000 6804 0000 0000 0000  $.......h.......
00000070: 2300 0080 0000 0000 1000 0000 0000 0000  #...............
00000080: 2500 0080 0000 0000 0800 0000 0000 0000  %...............

This went well, let's do it for real.

$ xxd dyn | sed '/: 2[345]00/s/ 0000/ 0080/' | xxd -r > dyn.new
$ objcopy --update-section .dynamic=dyn.new relr-test

Let me tell you I'm glad we're in 2023, because these objcopy options we just used didn't exist 12+ years ago.

So, how did it go?

$ ./relr-test 
Segmentation fault

Uh oh. Well duh, we didn't change the code that applies the relocations, so it can't find the packed relocation section.

Let's edit the loop to use this:

    if (dyn->d_tag == (DT_RELR | 0x80000000)) {
      relr = dyn->d_un.d_ptr;
    }
    if (dyn->d_tag == (DT_RELRSZ | 0x80000000)) {
      size = dyn->d_un.d_val;
    }

And start over:

$ clang -fuse-ld=lld -Wl,--pack-dyn-relocs=relr,--entry=my_start,-z,norelro -o relr-test relr-test.c
$ objcopy --dump-section .dynamic=dyn relr-test
$ xxd dyn | sed '/: 2[345]00/s/ 0000/ 0080/' | xxd -r > dyn.new
$ objcopy --update-section .dynamic=dyn.new relr-test
$ ./relr-test
Hello, world

Copy over to the newer system, and try:

$ ./relr-test
Hello, world

Flawless victory. We now have a binary that works on both old and new systems, using packed relocations created by the linker, and barely post-processing the binary (and we don't need that if (__real_init == real_init) anymore).

Generalizing a little

Okay, so while we're here, we'd rather use -z packed-relative-relocs because it works across more linkers, so we need to get rid of that GLIBC_ABI_DT_RELR symbol version dependency it adds, in order for the output to be more or less equivalent to what --pack-dyn-relocs=relr would produce.

$ clang -fuse-ld=lld -Wl,-z,pack-relative-relocs,--entry=my_start,-z,norelro -o relr-test relr-test.c

You know what, we might as well learn new things. Objcopy is nice, but as I was starting to write this section, I figured it was going to be annoying to do in the same style as above.

Have you heard of GNU poke? I saw a presentation about it at FOSDEM 2023, and haven't had the occasion to try it, I guess this is the day to do that. We'll be using GNU poke 3.2 (latest version as of writing).

Of course, that version doesn't contain the necessary bits. But this is Free Software, right? After a few patches, we're all set.

$ git clone https://git.savannah.gnu.org/git/poke/poke-elf
$ POKE_LOAD_PATH=poke-elf poke relr-test
(poke) load elf
(poke) var elf = Elf64_File @ 0#B

Let's get the section containing the symbol version information. It starts with a Verneed header.

(poke) var section = elf.get_sections_by_type(ELF_SHT_GNU_VERNEED)[0]
(poke) var verneed = Elf_Verneed @ section.sh_offset
(poke) verneed
Elf_Verneed {vn_version=1UH,vn_cnt=2UH,vn_file=110U,vn_aux=16U,vn_next=0U}

vn_file identifies the library file expected to contain those vn_cnt versions. Let's check this is about the libc. The section's sh_link will tell us which entry of the section header (shdr) corresponds to the string table that vn_file points into.

(poke) var strtab = elf.shdr[section.sh_link].sh_offset
(poke) string @ strtab + verneed.vn_file#B
"libc.so.6"

Bingo. Now let's scan the two (per vn_cnt) Vernaux entries that the Verneed header points to via vn_aux. The first one:

(poke) var off = section.sh_offset + verneed.vn_aux#B
(poke) var aux = Elf_Vernaux @ off
(poke) aux
Elf_Vernaux {vna_hash=157882997U,vna_flags=0UH,vna_other=2UH,vna_name=120U,vna_next=16U}
(poke) string @ strtab + aux.vna_name#B
"GLIBC_2.2.5"

And the second one, that vna_next points to.

(poke) var off = off + aux.vna_next#B
(poke) var aux2 = Elf_Vernaux @ off
(poke) aux2
Elf_Vernaux {vna_hash=16584258U,vna_flags=0UH,vna_other=3UH,vna_name=132U,vna_next=0U}
(poke) string @ strtab + aux2.vna_name#B
"GLIBC_ABI_DT_RELR"

This is it. This is the symbol version we want to get rid of. We could go on by adjusting vna_next in the first entry, and reducing vn_cnt in the header, but forward thinking to automating this for binaries that may contain more than two symbol versions from more than one dependency, it's just simpler to pretend this version is a repeat of the previous one. So we copy all its fields, except vna_next.

(poke) aux2.vna_hash = aux.vna_hash 
(poke) aux2.vna_flags = aux.vna_flags 
(poke) aux2.vna_other = aux.vna_other
(poke) aux2.vna_name = aux.vna_name

We could stop here and go back to the objcopy/xxd way of adjusting the PT_DYNAMIC segment, but while we're in poke, it can't hurt to try to do the adjustement with it.

(poke) var dyn = elf.get_sections_by_type(ELF_SHT_DYNAMIC)[0]
(poke) var dyn = Elf64_Dyn[dyn.sh_size / dyn.sh_entsize] @ dyn.sh_offset
(poke) for (d in dyn) if (d.d_tag in [ELF_DT_RELR,ELF_DT_RELRSZ,ELF_DT_RELRENT]) d.d_tag |= 0x80000000L
<stdin>:1:20: error: invalid operand in expression
</stdin><stdin>:1:20: error: expected uint<32>, got Elf64_Sxword

Gah, that seemed straightforward. It turns out in is not lenient about integer types. Let's just use the plain values.

(poke) for (d in dyn) if (d.d_tag in [0x23L,0x24L,0x25L]) d.d_tag |= 0x80000000L
unhandled constraint violation exception
failed expression
  elf_config.check_enum ("dynamic-tag-typ                       elf_mach, d_tag)
in field Elf64_Dyn.d_tag

This time, this is because poke is actually validating the tag values, which is both a blessing and a curse. It can avoid shooting yourself in the foot (after all, we're setting a non-existing value), but also hinder getting things done (because before I actually got here, many of the d_tag values in the binary straight out of the linker weren't even supported).

Let's make poke's validator know about the values we're about to set:

(poke) for (n in [0x23L,0x24L,0x25L]) elf_config.add_enum :class "dynamic-tag-types" :entries [Elf_Config_UInt { value = 0x80000000L | n }]
(poke) for (d in dyn) if (d.d_tag in [0x23L,0x24L,0x25L]) d.d_tag |= 0x80000000L
(poke) .exit
$ ./relr-test
Hello, world

And it works on the newer system too!

Repeating for a shared library

Let's set up a new testcase, using a shared library:

Take our previous testcase, and rename the main function to relr_test.
Compile it with clang -fuse-ld=lld -Wl,--pack-dyn-relocs=relr,--entry=my_start,-z,norelro -fPIC -shared -o librelr-test.so
Create a new file with the following content and compile it:

// Compile with: clang -o relr-test -L. -lrelr-test -Wl,-rpath,'$ORIGIN'
extern int relr_test(void);

int main(void) {
  return relr_test();
}

Apply the same GNU poke commands as before, on the librelr-test.so file.

So now, it should work, right?

$ ./relr-test
Segmentation fault

Oops. What's going on?

$ gdb -q -ex run -ex backtrace -ex detach -ex quit ./relr-test
Reading symbols from ./relr-test...
(No debugging symbols found in ./relr-test)
Starting program: /relr-test 
BFD: /librelr-test.so: unknown type [0x13] section `.relr.dyn'
warning: `/librelr-test.so': Shared library architecture unknown is not compatible with target architecture i386:x86-64.

Program received signal SIGSEGV, Segmentation fault.
0x00000000000016c0 in ?? ()
#0  0x00000000000016c0 in ?? ()
#1  0x00007ffff7fe1fe2 in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffdfc8, 
    env=env@entry=0x7fffffffdfd8) at dl-init.c:72
#2  0x00007ffff7fe20e9 in call_init (env=0x7fffffffdfd8, argv=0x7fffffffdfc8, argc=1, l=</optimized><optimized out>) at dl-init.c:30
#3  _dl_init (main_map=0x7ffff7ffe180, argc=1, argv=0x7fffffffdfc8, env=0x7fffffffdfd8) at dl-init.c:119
#4  0x00007ffff7fd30ca in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#5  0x0000000000000001 in ?? ()
#6  0x00007fffffffe236 in ?? ()
#7  0x0000000000000000 in ?? ()
Detaching from program: /relr-test, process 3104868
[Inferior 1 (process 3104868) detached]

Side note: it looks like we'll also need to change some section types if we want to keep tools like gdb happy.

So, this is crashing when doing what looks like a jump/call to an address that is not relocated (seeing how low it is). Let's pull the libc6 source and see what's around dl-init.c:72:

addrs = (ElfW(Addr) *) (init_array->d_un.d_ptr + l->l_addr);
for (j = 0; j < jm; ++j)
  ((init_t) addrs[j]) (argc, argv, env);

This is when it goes through .init_array and calls each of the functions in the table. So, .init_array is not relocated, which means our initialization code hasn't run. But why? Well, that's because the ELF entry point is not used for shared libraries. So, we need to execute our code some other way. What runs on shared library loading? Well, functions from the .init_array table... but they need to be relocated, we got ourselves a chicken and egg problem. Does something else run before that? It turns out that yes, right before that dl-init.c:72 code, there is this:

if (l->l_info[DT_INIT] != NULL)
  DL_CALL_DT_INIT(l, l->l_addr + l->l_info[DT_INIT]->d_un.d_ptr, argc, argv, env);

And the good news here is that it doesn't require DT_INIT to be relocated: that l_addr is the base address the loader used for the library, so it's relocating the address itself. Thank goodness.

So, how do we get a function in DT_INIT? Well... we already have one:

$ readelf -d librelr-test.so | grep '(INIT)'
 0x000000000000000c (INIT)               0x18a8
$ readelf -sW librelr-test.so | grep 18a8
     7: 00000000000018a8     0 FUNC    GLOBAL DEFAULT   13 _init
    20: 00000000000018a8     0 FUNC    GLOBAL DEFAULT   13 _init

So we want to wrap it similarly to what we did for _start, adding the following to the code of the library:

extern void _init();

void my_init() {
  real_init();
  _init();
}

And we replace --entry=my_start with --init=my_init when relinking librelr-test.so (while not forgetting all the GNU poke dance), and it finally works:

$ ./relr-test
Hello, world

(and obviously, it also works on the newer system too)

But does this work for Firefox?

We now have a manual procedure that gets us mostly what we want, that works with two tiny testcases. But does it scale to Firefox? Before implementing the whole thing, let's test a little more. First, let's build two .o files based on our code so far, without the relr_test function. One with the my_init wrapper, the other with the my_start wrapper. We'll call the former relr-test-lib.o and the latter relr-test-bin.o (Compile with clang -c -fPIC -O2).

Then, let's add the following to the .mozconfig we use to build Firefox:

export MOZ_PROGRAM_LDFLAGS="-Wl,-z,pack-relative-relocs,--entry=my_start,-z,norelro /path/to/relr-test-bin.o"
mk_add_options 'export EXTRA_DSO_LDOPTS="-Wl,-z,pack-relative-relocs,--init=my_init,-z,norelro /path/to/relr-test-lib.o"'

This leverages some arcane Firefox build system knowledge to have something minimally intrusive to use the flags we need and to inject our code. However, because of how the Firefox build system works, it also means some Rust build scripts will also be compiled with these flags (unfortunately). In turn, this means those build scripts won't run on a system without packed relocation support in glibc, so we need to build Firefox on the newer system.

And because we're on the newer system, running this freshly built Firefox will just work, because the init code is skipped and relocations applied by the dynamic loader. Things will only get spicy when we start applying our hack to make our initialization code handle the relocations itself. Because Firefox is bigger than our previous testcases, scanning through to find the right versioned symbol to remove is going to be cumbersome, so we'll just skip that part. In fact, we can just use our first approach with objcopy, because it's smaller. After a successful build, let's first do that for libxul.so, which is the largest binary in Firefox.

$ objcopy --dump-section .dynamic=dyn obj-x86_64-pc-linux-gnu/dist/bin/libxul.so
$ xxd dyn | sed '/: 2[345]00/s/ 0000/ 0080/' | xxd -r > dyn.new
$ objcopy --update-section .dynamic=dyn.new obj-x86_64-pc-linux-gnu/dist/bin/libxul.so
$ ./mach run
 0:00.15 /path/to/obj-x86_64-pc-linux-gnu/dist/bin/firefox -no-remote -profile /path/to/obj-x86_64-pc-linux-gnu/tmp/profile-default
$ echo $?
245

Aaaand... it doesn't start. Let's try again in a debugger.

$ ./mach run --debug
<snip>
(gdb) run
<snip>
Thread 1 "firefox" received signal SIGSEGV, Segmentation fault.
real_init () at /tmp/relr-test.c:55
55          if ((*entry & 1) == 0) {

It's crashing while applying the relocations?! But why?

(gdb) print entry
$1 = (Elf64_Addr *) 0x303c8

That's way too small to be a valid address. What's going on? Let's start looking where this value is and where it comes from.

(gdb) print &entry
Address requested for identifier "entry" which is in register $rax

So where does the value of the rax register come from?

(gdb) set pagination off
(gdb) disassemble/m
<snip>
41          if (dyn->d_tag == (DT_RELR | 0x80000000)) {
42            relr = dyn->d_un.d_ptr;
   0x00007ffff2289f47 <+71>:    mov    (%rcx),%rax
<snip>
52        start = (ElfW(Addr) *)(elf_header + relr);
   0x00007ffff2289f54 <+84>:    add    0x681185(%rip),%rax        # 0x7ffff290b0e0
<snip>

So rax starts with the value from DT_RELR, and the value stored at the address 0x7ffff290b0e0 is added to it. What's at that address?

(gdb) print *(void**)0x7ffff290b0e0
$1 = (void *) 0x0

Well, no surprise here. Wanna bet it's another chicken and egg problem?

(gdb) info files
<snip>
        0x00007ffff28eaed8 - 0x00007ffff290b0e8 is .got in /path/to/obj-x86_64-pc-linux-gnu/dist/bin/libxul.so
<snip>

It's in the Global Offset Table, that's typically something that will have been relocated. It smells like there's a packed relocation for this, which would confirm our new chicken and egg problem. First, we find the non-relocated virtual address of the .got section in libxul.so.

$ readelf -SW obj-x86_64-pc-linux-gnu/dist/bin/libxul.so | grep '.got '
  [28] .got              PROGBITS        000000000ab7aed8 ab78ed8 020210 00  WA  0   0  8

So that 0x000000000ab7aed8 is loaded at 0x00007ffff28eaed8. Then we check if there's a relocation for the non-relocated virtual address of 0x7ffff290b0e0.

$ readelf -r obj-x86_64-pc-linux-gnu/dist/bin/libxul.so | grep -e Relocation -e $(printf %x $((0x7ffff290b0e0 - 0x00007ffff28eaed8 + 0x000000000ab7aed8)))
Relocation section '.rela.dyn' at offset 0x28028 contains 1404 entries:
Relocation section '.relr.dyn' at offset 0x303c8 contains 13406 entries:
000000000ab9b0e0
Relocation section '.rela.plt' at offset 0x4a6b8 contains 2635 entries:

And there is, and it is a RELR one, one of those that we're supposed to apply ourselves... we're kind of doomed aren't we? But how come this wasn't a problem with librelr-test.so? Let's find out in the corresponding code there:

$ objdump -d librelr-test.so
<snip>
    11e1:       48 8b 05 30 21 00 00    mov    0x2130(%rip),%rax        # 3318 <__executable_start@Base>
<snip>
$ readelf -SW librelr-test.so
<snip>
  [20] .got              PROGBITS        0000000000003308 002308 000040 08  WA  0   0  8
<snip>
$ readelf -r librelr-test.so | grep -e Relocation -e 3318
Relocation section '.rela.dyn' at offset 0x450 contains 7 entries:
000000003318  000300000006 R_X86_64_GLOB_DAT 0000000000000000 __executable_start + 0
Relocation section '.rela.plt' at offset 0x4f8 contains 1 entry:
Relocation section '.relr.dyn' at offset 0x510 contains 3 entries:

We had a relocation through symbol resolution, which the dynamic loader applies before calling our initialization code. That's what saved us, but all things considered, that is not exactly great either.

How do we avoid this? Well, let's take a step back, and consider why the GOT is being used. Our code is just using the address of __executable_start, and the compiler doesn't know where it is (the symbol is extern). Since it doesn't know where it is, and whether it will be in the same binary, and because we are building Position Independent Code, it uses the GOT, and a relocation will put the right address in the GOT. At link time, when the linker knows the symbol is in the same binary, it ends up using a relative relocation, which causes our problem.

So, how do we avoid using the GOT? By making the compiler aware that the symbol is eventually going to be in the same binary, which we can do by marking it with the hidden visibility.

Replacing

extern ElfW(Ehdr) __executable_start;

with

extern __attribute__((visibility("hidden"))) ElfW(Ehdr) __executable_start;

will do that for us. And after rebuilding, and re-hacking, our Firefox works, yay!

Let's try other binaries

Let's now try with the main Firefox binary.

$ objcopy --dump-section .dynamic=dyn obj-x86_64-pc-linux-gnu/dist/bin/firefox
$ xxd dyn | sed '/: 2[345]00/s/ 0000/ 0080/' | xxd -r > dyn.new
$ objcopy --update-section .dynamic=dyn.new obj-x86_64-pc-linux-gnu/dist/bin/firefox
$ ./mach run
 0:00.15 /path/to/obj-x86_64-pc-linux-gnu/dist/bin/firefox -no-remote -profile /path/to/obj-x86_64-pc-linux-gnu/tmp/profile-default
$ echo $?
245

We crashed again. Come on! What is it this time?

$ ./mach run --debug
<snip>
(gdb) run
<snip>
Program received signal SIGSEGV, Segmentation fault.
0x0000000000032370 in ?? ()
(gdb) bt
#0  0x0000000000032370 in ?? ()
#1  0x00005555555977be in phc_init (aMallocTable=0x7fffffffdb38, aBridge=0x555555626778 <greplacemallocbridge>)
    at /path/to/memory/replace/phc/PHC.cpp:1700
#2  0x00005555555817c5 in init () at /path/to/memory/build/mozjemalloc.cpp:5213
#3  0x000055555558196c in Allocator<replacemallocbase>::malloc (arg1=72704) at /path/to/memory/build/malloc_decls.h:51
#4  malloc (arg1=72704) at /path/to/memory/build/malloc_decls.h:51
#5  0x00007ffff7ca57ba in (anonymous namespace)::pool::pool (this=0x7ffff7e162c0 <(anonymous namespace)::emergency_pool>)
    at ../../../../src/libstdc++-v3/libsupc++/eh_alloc.cc:123
#6  __static_initialization_and_destruction_0 (__priority=65535, __initialize_p=1)
    at ../../../../src/libstdc++-v3/libsupc++/eh_alloc.cc:262
#7  _GLOBAL__sub_I_eh_alloc.cc(void) () at ../../../../src/libstdc++-v3/libsupc++/eh_alloc.cc:338
#8  0x00007ffff7fcfabe in call_init (env=0x7fffffffdd00, argv=0x7fffffffdcd8, argc=4, l=<optimized out>) at ./elf/dl-init.c:70
#9  call_init (l=</optimized><optimized out>, argc=4, argv=0x7fffffffdcd8, env=0x7fffffffdd00) at ./elf/dl-init.c:26
#10 0x00007ffff7fcfba4 in _dl_init (main_map=0x7ffff7ffe2e0, argc=4, argv=0x7fffffffdcd8, env=0x7fffffffdd00) at ./elf/dl-init.c:117
#11 0x00007ffff7fe5a60 in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#12 0x0000000000000004 in ?? ()
#13 0x00007fffffffdfae in ?? ()
#14 0x00007fffffffdfe2 in ?? ()
#15 0x00007fffffffdfed in ?? ()
#16 0x00007fffffffdff6 in ?? ()
#17 0x0000000000000000 in ?? ()
(gdb) info symbol 0x00007ffff7ca57ba
_GLOBAL__sub_I_eh_alloc.cc + 58 in section .text of /lib/x86_64-linux-gnu/libstdc++.so.6

Oh boy! So here, what's going on is that the libstdc++ initializer is called before Firefox's, and that initializer calls malloc, which is provided by the Firefox binary, but because Firefox's initializer hasn't run yet, the code in its allocator that depends on relative relocations fails...

Let's... just workaround this by disabling the feature of the Firefox allocator that requires those relocations:

ac_add_options --disable-replace-malloc

Rebuild, re-hack, and... Victory is mine!

Getting this in production

So far, we've looked at how we can achieve the same as elfhack with a simpler and more reliable strategy, that will allow us to consistently use lld across platforms and build types. Now that the approach has been validated, we can proceed with writing the actual code and hooking it in the Firefox build system. Our strategy here will be for our new tool to act as the linker. It will take all the arguments the compiler passes it, and will itself call the real linker with all the required extra arguments, including the object file containing the code to apply the relocations.

Of course, I also encountered some more grievances. For example, GNU ld doesn't define the __executable_start symbol when linking shared libraries, contrary to lld. Thankfully, it defines __ehdr_start, with the same meaning (and so does lld). There are also some details I left out for the _init function, which normally takes 3 arguments, and that the actual solution will have to deal with. It will also have to deal with "Relocation Read-Only" (relro), but for that, we can just reuse the code from elfhack.

The code already exists, and is up for review (this post was written in large part to give reviewers some extra background). The code handles desktop Linux for now (Android support will come later ; it will require a couple adjustments), and is limited to shared libraries (until the allocator is changed to avoid using relative relocations). It's also significantly smaller than elfhack.

$ loc build/unix/elfhack/elf*
--------------------------------------------------------------------------------
 Language             Files        Lines        Blank      Comment         Code
--------------------------------------------------------------------------------
 C++                      2         2393          230          302         1861
 C/C++ Header             1          701          120           17          564
--------------------------------------------------------------------------------
 Total                    3         3094          350          319         2425
--------------------------------------------------------------------------------
$ loc build/unix/elfhack/relr* 
--------------------------------------------------------------------------------
 Language             Files        Lines        Blank      Comment         Code
--------------------------------------------------------------------------------
 C++                      1          443           32           62          349
 C/C++ Header             1           25            5            3           17
--------------------------------------------------------------------------------
 Total                    2          468           37           65          366
--------------------------------------------------------------------------------

(this excludes the code to apply relocations, which is shared between both)

This is the beginning of the end for elfhack. Once "relrhack" is enabled in its place, it will be left around for Firefox downstream builds on systems with older linkers that don't support the necessary flags. Elfhack will eventually be removed when support for those systems is dropped, in a few years. Further down the line, we'll be able to retire both tools, as support for RELR relocations become ubiquitous.

As anticipated, this was a long post. Thank you for sticking to the end.

2023-08-30 11:16:38+0900

p.m.o | 2 Comments »

April 1st, 2023

Announcing git-cinnabar 0.6.0

Git-cinnabar is a git remote helper to interact with mercurial repositories. It allows to clone, pull and push from/to mercurial remote repositories, using git.

Get it on github.

These release notes are also available on the git-cinnabar wiki.

What's new since 0.5.11?

Full rewrite of the Python parts of git-cinnabar in Rust.
Push performance is between twice and 10 times faster than 0.5.x,
depending on scenarios.
Based on git 2.38.0.
git cinnabar fetch now accepts a --tags flag to fetch tags.
git cinnabar bundle now accepts a -t flag to give a specific
bundlespec.
git cinnabar rollback now accepts a --candidates flag to list the
metadata sha1 that can be used as target of the rollback.
git cinnabar rollback now also accepts a --force flag to allow
any commit sha1 as metadata.
git cinnabar now has a self-update subcommand that upgrades it
when a new version is available. The subcommand is only available
when building with the self-update feature (enabled on prebuilt
versions of git-cinnabar).
Disabled inexact copy/rename detection, that was enabled by accident.

What's new since 0.6.0rc2?

Fixed use-after-free in metadata initialization.
Look for the new location of the CA bundle in git-windows 2.40.

2023-04-01 11:17:15+0900

cinnabar, p.m.o | No Comments »

October 30th, 2022

Announcing git-cinnabar 0.5.11 and 0.6.0rc2

Git-cinnabar is a git remote helper to interact with mercurial repositories. It allows to clone, pull and push from/to mercurial remote repositories, using git.

Get version 0.5.11 on github. Or get version 0.6.0rc2 on github.

What's new in 0.5.11?

Fixed compatibility with python 3.11.
Disabled inexact copy/rename detection, that was enabled by accident.
Updated git to 2.38.1 for the helper.

What's new in 0.6.0rc2?

Improvements and bug fixes to git cinnabar self-update. Note: to upgrade
from 0.6.0rc1, don't use the self-update command except on Windows. Please
use the download.py script instead, or install from the release artifacts
on https://github.com/glandium/git-cinnabar/releases/tag/0.6.0rc2.
Disabled inexact copy/rename detection, that was enabled by accident.
Removed dependencies on msys DLLs on Windows.
Based on git 2.38.1.
Other minor fixes.

2022-10-30 06:48:45+0900

cinnabar, p.m.o | No Comments »

October 4th, 2022

Announcing git-cinnabar 0.6.0rc1

Git-cinnabar is a git remote helper to interact with mercurial repositories. It allows to clone, pull and push from/to mercurial remote repositories, using git.

Get it on github.

These release notes are also available on the git-cinnabar wiki.

What's new since 0.5.10?

Full rewrite of git-cinnabar in Rust.
Push performance is between twice and 10 times faster than 0.5.x, depending on scenarios.
Based on git 2.38.0.
git cinnabar fetch now accepts a --tags flag to fetch tags.
git cinnabar bundle now accepts a -t flag to give a specific bundlespec.
git cinnabar rollback now accepts a --candidates flag to list the metadata sha1 that can be used as target of the rollback.
git cinnabar rollback now also accepts a --force flag to allow any commit sha1 as metadata.
git cinnabar now has a self-update subcommand that upgrades it when a new version is available. The subcommand is only available when building with the self-update feature (enabled on prebuilt versions of git-cinnabar).

2022-10-04 07:26:05+0900

cinnabar, p.m.o | No Comments »

July 31st, 2022

Announcing git-cinnabar 0.5.10

Git-cinnabar is a git remote helper to interact with mercurial repositories. It allows to clone, pull and push from/to mercurial remote repositories, using git.

Get it on github.

These release notes are also available on the git-cinnabar wiki.

What's new since 0.5.9?

Fixed exceptions during config initialization.
Fixed swapped error messages.
Fixed correctness issues with bundle chunks with no delta node.
This is probably the last 0.5.x release before 0.6.0.

2022-07-31 06:35:25+0900

cinnabar, p.m.o | No Comments »

July 16th, 2022

Announcing git-cinnabar 0.5.9

Git-cinnabar is a git remote helper to interact with mercurial repositories. It allows to clone, pull and push from/to mercurial remote repositories, using git.

Get it on github.

These release notes are also available on the git-cinnabar wiki.

What's new since 0.5.8?

Updated git to 2.37.1 for the helper.
Various python 3 fixes.
Fixed stream bundle
Added python and py.exe as executables tried on top of python3 and python2.
Improved handling of ill-formed local urls.
Fixed using old mercurial libraries that don't support bundlev2 with a server that does.
When fsck reports the metadata as broken, prevent further updates to the repo.
When issue #207 is detected, mark the metadata as broken
Added support for logging redirection to a file
Now ignore refs/cinnabar/replace/ refs, and always use the corresponding metadata instead.
Various git cinnabar fsck fixes.

2022-07-16 07:11:55+0900

cinnabar, p.m.o | No Comments »

November 20th, 2021

Announcing git-cinnabar 0.5.8

Git-cinnabar is a git remote helper to interact with mercurial repositories. It allows to clone, pull and push from/to mercurial remote repositories, using git.

Get it on github.

These release notes are also available on the git-cinnabar wiki.

What's new since 0.5.7?

Updated git to 2.34.0 for the helper.
Python 3.5 and newer are now officially supported. Git-cinnabar will try to
use the python3 program by default, but will fallback to python2.7 if
that's where the Mercurial libraries are available. It is possible to pick
a specific python with the GIT_CINNABAR_PYTHON environment variable.
Fixed compatibility with Mercurial 5.8 and newer.
The prebuilt binaries are now optimized on arm64 macOS and Windows.
git cinnabar download now properly returns an error code when failing to
extract the prebuilt binaries.
Pushing to a non-empty Mercurial repository without having pulled at least
once from it is now prevented.
Replaced the nagging about fsck with a smaller check always happening after
pulling.
Fail earlier on git fetch hg::url <sha1> (it would properly fetch the
Mercurial changeset and its ancestors, but git would fail at the end because
the sha1 is not a git sha1 ; use git cinnabar fetch instead)
Minor fixes.

2021-11-20 07:05:57+0900

cinnabar, p.m.o | No Comments »

April 1st, 2021

Announcing git-cinnabar 0.5.7

Git-cinnabar is a git remote helper to interact with mercurial repositories. It allows to clone, pull and push from/to mercurial remote repositories, using git.

Get it on github.

These release notes are also available on the git-cinnabar wiki.

What's new since 0.5.6?

Updated git to 2.31.1 for the helper.
When using git >= 2.31.0, git -c config=value ... works again.
Minor fixes.

2021-04-01 07:50:50+0900

cinnabar, p.m.o | No Comments »