How I broke (some) crash reports

We recently figured out that elfhack has been responsible for broken crash reports from Fennec 4.0b5 and Firefox 4.0b11 and 4.0b12 Linux builds. We disabled elfhack by default as soon as it was noticed, meaning both Fennec and Firefox Release Candidates will be giving good crash reports. Unfortunately, this also means we lose the startup time improvements elfhack was giving.

Stack walking and memory mapping

Both Fennec and Firefox Linux crash reports breakages are due to the same problem: Breakpad makes assumptions on memory mapping, and elfhack breaks them.

Usually, ELF binaries are loaded in memory in two segments, one for executable code, and one for data. Each segment has different access permissions, such that executable code can't be modified during execution, and such that data can't be executed as code.

When a crash occurs, the crash reporter stores a bunch of information in a minidump, most importantly a "Module" mapping (I'll come back to that later), as well as registers and stack contents for each thread. This minidump is then sent to us (if the user chooses to) for processing and the result is made available on crash-stats.

The most useful part of the processing is called stack walking. From the data stored in the minidump, correlated with symbol files we keep for each build, we can get a meaningful stack trace, which will tells us where in our codebase the crash occurred, and what call path was taken to get there. Roughly this is how it works (over-simplified):

  1. Take the current instruction pointer address
  2. Find the corresponding symbol
  3. Find the corresponding stack walking information
  4. From the stack waking information, compute the address where the code we're currently in was called from
  5. Repeat from step 2

Step 2 and 3 require that we can map a given memory address to a relative address in a given "Module". For the stack walking software, a "Module" corresponds to a given library or program loaded in memory. For the minidump creation software, a "Module" corresponds to a single memory segment. This is where problems arise.

As I wrote above, ELF binaries are usually loaded in two memory segments, so the minidump creation software is going to store each segment as a different "Module". Well, this is what it does on Android, because Fennec uses its own dynamic loader, and this custom dynamic loader, for different reasons, was made to explicitly instruct the minidump creation software of each segment.

In the desktop Linux case, the minidump creation software actually considers that segments which don't map the beginning of the underlying binary isn't to be stored at all. In practice, this means only the first segment is stored in the minidump for Firefox Linux builds, while all of them are stored for Android. In the common case where binaries are loaded in two segments, this isn't a problem at all: only the first segment contains code, so addresses we get during stack walking are always in that segment for each binary.

Enter elfhack

What elfhack does, on the other hand, is to split the code segment in two parts that end up being loaded in memory separately. Which means instead of two segments, we get three. Moreover, the first segment then only contains binary metadata (symbols, relocations, etc.), and the actual code is in the second segment.

elfhack normal
Segment #1 Segment #1
Segment #2
Segment #3 Segment #2

In the Linux case, where the minidump creation software only keeps the first segment, addresses it gets during stack walking actually won't map anywhere the minidump knows. As such, it can't know what function we're in, and it can't get the computation information required to walk the stack further.

In the Android case, where all segments are considered separate "Modules", addresses it gets during stack walking do map somewhere the minidump knows. Except that when Breakpad resolves symbols it uses addresses relative to the start of each segment/"Module", while the correct behaviour would be to use addreses relative to the start of the first segment for a given library. Where it gets interesting, is that since libxul.so is so big, the relative address within the second segment is very likely to hit a portion of code when taken relative to the start of the first segment.

So Breakpad is actually likely to find a symbol corresponding to the crash address, as well as stack walking information. Which Breakpad is happy to use to compute the call location, but it ends up being very wrong, since actual register and stack contents don't fit what should be there if the code where Breakpad thinks we are had really been executed. With some luck, the computed call location also ends up in libxul.so as well, at a virtually random location. And so on and so forth.

This is why some of the Fennec crash reports had impossible stack traces, with functions that never call each other.

Fixing crash reports

While disabling elfhack made new builds send minidumps that the stack walking software can handle, it didn't solve the issue with existing crash reports.

Fortunately, in the set of information that the minidump writer software stores, there are the raw contents of the /proc/$pid/maps file. This file, specific to Linux systems, contains the memory mapping of a given process, displaying which parts of what files are mapped where in the address space. This is not used by the processing software, but it allows to figure out what the "Module" mapping would have been had elfhack not been used on the binary.

There are two possible approaches to get meaningful crash reports off these broken minidumps: either modifying the processing software so that it can read the /proc/$pid/maps data, or fix the minidumps themselves. I went with the latter, as it required less work for coding, testing and deploying, the former requiring to actually update the stack walking software, with all the risks this means. The latter only had the risk of further corrupting crash reports that were already corrupted in the first place.

Making some assumptions on the way libraries are loaded in memory, I wrote two tools reconstructing "Module" mapping information from the /proc/$pid/maps data for each of Linux and Android. (warning: the code is a quick and dirty hack).

People from the Socorro team then took over to test the tools on sample crash reports, and once we verified we were getting good results, they went further and systematically applied the fix on broken crash reports (some details can be found in bug 637680). Fortunately, there were "only" around 8000 crash reports for Fennec and around 8000 more for Firefox Linux, so it didn't take long to update the databases.

As of now, the top crashers list, as well as individual crash reports are all fixed, and incoming ones are fixed up every hour.

Re-enabling elfhack

Obviously, we still want the gains from elfhack in the future, so we need to address the issues on the Breakpad end before re-enabling elfhack. Here again we have several possible implementations, but there is one that stands out for being simpler (and not requiring changes on the crash-stats server end).

Despite ELF binaries being loaded in several separate segments of memory, what the dynamic loader actually does is to first reserve the whole memory area that's going to be used for the binary, including areas between segments, and then map the binary segments at the according place. The minidump writer can just record that whole memory area as a single "Module", making the stack walking software happy.

2011-03-09 20:18:36+0900

p.m.o

You can leave a response, or trackback from your own site.

Leave a Reply