March 12th, 2017

When the memory allocator works against you

Cloning mozilla-central with git-cinnabar requires a lot of memory. Actually too much memory to fit in a 32-bits address space.

I hadn’t optimized for memory use in the first place. For instance, git-cinnabar keeps sha-1s in memory as hex values (40 bytes) rather than raw values (20 bytes). When I wrote the initial prototype, it didn’t matter that much, and while close(ish) to the tipping point, it didn’t require more than 2GB of memory at the time.

Time passed, and mozilla-central grew. I suspect the recent addition of several thousands of commits and files has made things worse.

In order to come up with a plan to make things better (short or longer term), I needed data. So I added some basic memory resource tracking, and collected data while cloning mozilla-central.

I must admit, I was not ready for what I witnessed. Follow me for a tale of frustrations (plural).

I was expecting things to have gotten worse on the master branch (which I used for the data collection) because I am in the middle of some refactoring and did many changes that I was suspecting might have affected memory usage. I wasn’t, however, expecting to see the clone command using 10GB(!) memory at peak usage across all processes.

(Note, those memory sizes are RSS, minus “shared”)

It also was taking an unexpected long time, but then, I hadn’t cloned a large repository like mozilla-central from scratch in a while, so I wasn’t sure if it was just related to its recent growth in size or otherwise. So I collected data on 0.4.0 as well.

Less time spent, less memory usage… ok. There’s definitely something wrong on master. But wait a minute, that slope from ~2GB to ~4GB on the git-remote-hg process doesn’t actually make any kind of sense. I mean, I’d understand it if it were starting and finishing with the “Import manifest” phase, but it starts in the middle of it, and ends long before it finishes. WTH?

First things first, since RSS can be a variety of things, I checked /proc/$pid/smaps and confirmed that most of it was, indeed, the heap.

That’s the point where you reach for Google, type something like “python memory profile” and find various tools. One from the results that I remembered having used in the past is guppy’s heapy.

Armed with pdb, I broke execution in the middle of the slope, and tried to get memory stats with heapy. SIGSEGV. Ouch.

Let’s try something else. I reached out to objgraph and pympler. SIGSEGV. Ouch again.

Tried working around the crashes for a while (too long while, retrospectively, hindsight is 20/20), and was somehow successful at avoiding them by peaking at a smaller set of objects. But whatever I did, despite being attached to a process that had 2.6GB RSS, I wasn’t able to find more than 1.3GB of data. This wasn’t adding up.

It surely didn’t help that getting to that point took close to an hour each time. Retrospectively, I wish I had investigated using something like Checkpoint/Restore in Userspace.

Anyways, after a while, I decided that I really wanted to try to see the whole picture, not smaller peaks here and there that might be missing something. So I resolved myself to look at the SIGSEGV I was getting when using pympler, collecting a core dump when it happened.

Guess what? The Debian python-dbg package does not contain the debug symbols for the python package. The core dump was useless.

Since I was expecting I’d have to fix something in python, I just downloaded its source and built it. Ran the command again, waited, and finally got a backtrace. First Google hit for the crashing function? The exact (unfixed) crash reported on the python bug tracker. No patch.

Crashing code is doing:

((f)->f_builtins != (f)->f_tstate->interp->builtins)

And (f)->f_tstate is NULL. Classic NULL deref.

Added a guard (assessing it wouldn’t break anything). Ran the command again. Waited. Again. SIGSEGV.

Facedesk. Another crash on the same line. Did I really use the patched python? Yes. But this time (f)->f_tstate->interp is NULL. Sigh.

Same player, shoot again.

Finally, no crash… but still stuck on only 1.3GB accounted for. Ok, I know not all python memory profiling tools are entirely reliable, let’s try heapy again. SIGSEGV. Sigh. No debug info on the heapy module, where the crash happens. Sigh. Rebuild the module with debug info, try again. The backtrace looks like heapy is recursing a lot. Look at %rsp, compare with the address space from /proc/$pid/maps. Confirmed. A stack overflow. Let’s do ugly things and increase the stack size in brutal ways.

Woohoo! Now heapy tells me there’s even less memory used than the 1.3GB I found so far. Like, half less. Yeah, right.

I’m not clear on how I got there, but that’s when I found gdb-heap, a tool from Red Hat’s David Malcolm, and the associated talk “Dude, where’s my RAM?” A deep dive into how Python uses memory (slides).

With a gdb attached, I would finally be able to rip python’s guts out and find where all the memory went. Or so I thought. The gdb-heap tool only found about 600MB. About as much as heapy did, for that matter, but it could be coincidental. Oh. Kay.

I don’t remember exactly what went through my mind then, but, since I was attached to a running process with gdb, I typed the following on the gdb prompt:

gdb> call malloc_stats()

And that’s when the truth was finally unvealed: the memory allocator was just acting up the whole time. The ouput was something like:

Arena 0:
system bytes    =  some number above (but close to) 2GB
in use bytes    =  some number above (but close to) 600MB

Yes, the glibc allocator was just telling it had allocated 600MB of memory, but was holding onto 2GB. I must have found a really bad allocation pattern that causes massive fragmentation.

One thing that David Malcolm’s talk taught me, though, is that python uses its own allocator for small sizes, so the glibc allocator doesn’t know about them. And, roughly, adding the difference between RSS and what glibc said it was holding to to the use bytes it reported somehow matches the 1.3GB I had found so far.

So it was time to see how those things evolved in time, during the entire clone process. I grabbed some new data, tracking the evolution of “system bytes” and “in use bytes”.

There are two things of note on this data:

  • There is a relatively large gap between what the glibc allocator says it has gotten from the system, and the RSS (minus “shared”) size, that I’m expecting corresponds to the small allocations that python handles itself.
  • Actual memory use is going down during the “Import manifests” phase, contrary to what the evolution of RSS suggests.

In fact, the latter is exactly how git-cinnabar is supposed to work: It reads changesets and manifests chunks, and holds onto them while importing files. Then it throws away those manifests and changesets chunks one by one while it imports them. There is, however, some extra bookkeeping that requires some additional memory, but it’s expected to be less memory consuming than keeping all the changesets and manifests chunks in memory.

At this point, I thought a possible explanation is that since both python and glibc are mmap()ing their own arenas, they might be intertwined in a way that makes things not go well with the allocation pattern happening during the “Import manifest” phase (which, in fact, allocates and frees increasingly large buffers for each manifest, as manifests grow in size in the mozilla-central history).

To put the theory at work, I patched the python interpreter again, making it use malloc() instead of mmap() for its arenas.

“Aha!” I thought. That definitely looks much better. Less gap between what glibc says it requested from the system and the RSS size. And, more importantly, no runaway increase of memory usage in the middle of nowhere.

I was preparing myself to write a post about how mixing allocators could have unintended consequences. As a comparison point, I went ahead and ran another test, with the python allocator entirely disabled, this time.

Heh. It turns out glibc was acting up all alone. So much for my (plausible) theory. (I still think mixing allocators can have unintended consequences.)

(Note, however, that the reason why the python allocator exists is valid: without it, the overall clone took almost 10 more minutes)

And since I had been getting all this data with 0.4.0, I gathered new data without the python allocator with the master branch.

This paints a rather different picture than the original data on that branch, with much less memory use regression than one would think. In fact, there isn’t much difference, except for the spike at the end, which got worse, and some of the noise during the “Import manifests” phase that got bigger, implying larger amounts of temporary memory used. The latter may contribute to the allocation patterns that throw glibc’s memory allocator off.

It turns out tracking memory usage in python 2.7 is rather painful, and not all the tools paint a complete picture of it. I hear python 3.x is somewhat better in that regard, and I hope it’s true, but at the moment, I’m stuck with 2.7. The most reliable tool I’ve used here, it turns out, is pympler. Or rebuilding the python interpreter without its allocator, and asking the system allocator what is allocated.

With all this data, I now have some defined problems to tackle, some easy (the spike at the end of the clone), and some less easy (working around glibc allocator’s behavior). I have a few hunches as to what kind of allocations are causing the runaway increase of RSS. Coincidentally, I’m half-way through a refactor of the code dealing with manifests, and it should help dealing with the issue.

But that will be the subject of a subsequent post.

2017-03-12 10:47:12+0900

cinnabar, p.m.o | 5 Comments »

January 18th, 2017

Announcing git-cinnabar 0.4.0

Git-cinnabar is a git remote helper to interact with mercurial repositories. It allows to clone, pull and push from/to mercurial remote repositories, using git.

Get it on github.

These release notes are also available on the git-cinnabar wiki.

What’s new since 0.3.2?

  • Various bug fixes.
  • Updated git to 2.11.0 for cinnabar-helper.
  • Now supports bundle2 for both fetch/clone and push (https://www.mercurial-scm.org/wiki/BundleFormat2).
  • Now Supports git credential for HTTP authentication.
  • Now supports git push --dry-run.
  • Added a new git cinnabar fetch command to fetch a specific revision that is not necessarily a head.
  • Added a new git cinnabar download command to download a helper on platforms where one is available.
  • Removed upgrade path from repositories used with version < 0.3.0.
  • Experimental (and partial) support for using git-cinnabar without having mercurial installed.
  • Use a mercurial subprocess to access local mercurial repositories.
  • Cinnabar-helper now handles fast-import, with workarounds for performance issues on macOS.
  • Fixed some corner cases involving empty files. This prevented cloning Mozilla’s stylo incubator repository.
  • Fixed some correctness issues in file parenting when pushing changesets pulled from one mercurial repository to another.
  • Various improvements to the rules to build the helper.
  • Experimental (and slow) support for pushing merges, with caveats. See issue #20 for details about the current status.
  • Fail graft earlier when no commit was found to graft
  • Allow graft to work with git version < 1.9
  • Allow git cinnabar bundle to do the same grafting as git push

2017-01-18 18:21:45+0900

cinnabar, p.m.o | No Comments »

December 20th, 2016

Announcing git-cinnabar 0.4.0 release candidate 2

Git-cinnabar is a git remote helper to interact with mercurial repositories. It allows to clone, pull and push from/to mercurial remote repositories, using git.

Get it on github.

These release notes are also available on the git-cinnabar wiki.

What’s new since 0.4.0rc?

  • /!\ Warning /!\ If you have been using a version of the release branch between 0.4.0rc and 0.4.0rc2 (more precisely, in the range 0335aa1432bdb0a8b5bdbefa98f7c2cd95fc72d2^..0.4.0rc2^), and used git cinnabar download and run on Mac or Windows, please run git cinnabar download again with this version and then ensure your mercurial clones have not been corrupted by case-sensitivity issues by running git cinnabar fsck --manifests. If they contain sha1 mismatches, please reclone.
  • Updated git to 2.11.0 for cinnabar-helper
  • Improvements to the git cinnabar download command
  • Various small code cleanups
  • Improvement to the experimental support for pushing merges

2016-12-20 18:06:49+0900

cinnabar, p.m.o | No Comments »

December 2nd, 2016

Faster git-cinnabar graft of gecko-dev

Cloning Mozilla repositories from scratch with git-cinnabar can be a long process. Grafting them to gecko-dev is an equally long process.

The metadata git-cinnabar keeps is such that it can be exchanged, but it’s also structured in a way that doesn’t allow git push and fetch to do that efficiently, and pushing refs/cinnabar/metadata to github fails because it wants to push more than 2GB of data, which github doesn’t allow.

But with some munging before a push, it is possible to limit the push to a fraction of that size and stay within github limits. And inversely, some munging after fetching allows to produce the metadata git-cinnabar wants.

The news here is that there is now a cinnabar head on https://github.com/glandium/gecko-dev that contains the munged metadata, and a script that fetches it and produces the git-cinnabar metadata in an existing clone of gecko-dev. An easy way to run it is to use the following command from a gecko-dev clone:

$ curl -sL https://gist.github.com/glandium/56a61454b2c3a1ad2cc269cc91292a56/raw/bfb66d417cd1ab07d96ebe64cdb83a4217703db9/import.py | git cinnabar python

On my machine, the process takes 8 minutes instead of more than an hour. Make sure you use git-cinnabar 0.4.0rc for this.

Please note this doesn’t bring the full metadata for gecko-dev, just the metadata as of yesterday. This may be updated irregularly in the future, but don’t count on that.

So, from there, you still need to add mercurial remotes and pull from there, as per the original workflow.

Planned changes for version 0.5 of git-cinnabar will alter the metadata format such that it will be exchangeable without munging, making the process simpler and faster.

2016-12-02 14:31:28+0900

p.m.o | No Comments »

November 29th, 2016

Announcing git-cinnabar 0.4.0 release candidate

Git-cinnabar is a git remote helper to interact with mercurial repositories. It allows to clone, pull and push from/to mercurial remote repositories, using git.

Get it on github.

These release notes are also available on the git-cinnabar wiki.

What’s new since 0.4.0b3?

  • Updated git to 2.10.2 for cinnabar-helper.
  • Added a new git cinnabar download command to download a helper on platforms where one is available.
  • Fixed some corner cases with pack windows in the helper. This prevented cloning mozilla-central with the helper.
  • Fixed bundle2 support that broke cloning from a mercurial 4.0 server in some cases.
  • Fixed some corner cases involving empty files. This prevented cloning Mozilla’s stylo incubator repository.
  • Fixed some correctness issues in file parenting when pushing changesets pulled from one mercurial repository to another.
  • Various improvements to the rules to build the helper.
  • Experimental (and slow) support for pushing merges, with caveats. See issue #20 for details about the current status.

And since I realize I didn’t announce beta 3:

What’s new since 0.4.0b2?

  • Properly handle bundle2 errors, avoiding git to believe a push happened when it didn’t. (0.3.x is unaffected)

2016-11-29 09:18:14+0900

cinnabar, p.m.o | No Comments »

July 25th, 2016

Announcing git-cinnabar 0.4.0 beta 2

Git-cinnabar is a git remote helper to interact with mercurial repositories. It allows to clone, pull and push from/to mercurial remote repositories, using git.

Get it on github.

These release notes are also available on the git-cinnabar wiki.

What’s new since 0.4.0b1?

  • Some more bug fixes.
  • Updated git to 2.9.2 for cinnabar-helper.
  • Now supports `git push –dry-run`.
  • Added a new `git cinnabar fetch` command to fetch a specific revision that is not necessarily a head.
  • Some improvements to the experimental native wire protocol support.

2016-07-25 17:38:17+0900

cinnabar, p.m.o | No Comments »

July 8th, 2016

Are all integer overflows equal?

Background: I’ve been relearning Rust (more about that in a separate post, some time later), and in doing so, I chose to implement the low-level parts of git (I’ll touch the why in that separate post I just promised).

Disclaimer: It’s friday. This is not entirely(?) a serious post.

So, I was looking at Documentation/technical/index-format.txt, and saw:

32-bit number of index entries.

What? The index/staging area can’t handle more than ~4.3 billion files?

There I was, writing Rust code to write out the index.

try!(out.write_u32::<NetworkOrder>(self.entries.len()));

(For people familiar with the byteorder crate and wondering what NetworkOrder is, I have a use byteorder::BigEndian as NetworkOrder)

And the Rust compiler rightfully barfed:

error: mismatched types:
 expected `u32`,
    found `usize` [E0308]

And there I was, wondering: “mmmm should I just add as u32 and silently truncate or … hey what does git do?”

And it turns out, git uses an unsigned int to track the number of entries in the first place, so there is no truncation happening.

Then I thought “but what happens when cache_nr reaches the max?”

Well, it turns out there’s only one obvious place where the field is incremented.

What? Holy coffin nails, Batman! No overflow check?

Wait a second, look 3 lines above that:

ALLOC_GROW(istate->cache, istate->cache_nr + 1, istate->cache_alloc);

Yeah, obviously, if you’re incrementing cache_nr, you already have that many entries in memory. So, how big would that array be?

        struct cache_entry **cache;

So it’s an array of pointers, assuming 64-bits pointers, that’s … ~34.3 GB. But, all those cache_nr entries are in memory too. How big is a cache entry?

struct cache_entry {
        struct hashmap_entry ent;
        struct stat_data ce_stat_data;
        unsigned int ce_mode;
        unsigned int ce_flags;
        unsigned int ce_namelen;
        unsigned int index;     /* for link extension */
        unsigned char sha1[20];
        char name[FLEX_ARRAY]; /* more */
};

So, 4 ints, 20 bytes, and as many bytes as necessary to hold a path. And two inline structs. How big are they?


struct hashmap_entry {
        struct hashmap_entry *next;
        unsigned int hash;
};

struct stat_data {
        struct cache_time sd_ctime;
        struct cache_time sd_mtime;
        unsigned int sd_dev;
        unsigned int sd_ino;
        unsigned int sd_uid;
        unsigned int sd_gid;
        unsigned int sd_size;
};

Woohoo, nested structs.

struct cache_time {
        uint32_t sec;
        uint32_t nsec;
};

So all in all, we’re looking at 1 + 2 + 2 + 5 + 4 32-bit integers, 1 64-bits pointer, 2 32-bits padding, 20 bytes of sha1, for a total of 92 bytes, not counting the variable size for file paths.

The average path length in mozilla-central, which only has slightly over 140 thousands of them, is 59 (including the terminal NUL character).

Let’s conservatively assume our crazy repository would have the same average, making the average cache entry 151 bytes.

But memory allocators usually allocate more than requested. In this particular case, with the default allocator on GNU/Linux, it’s 156 (weirdly enough, it’s 152 on my machine).

156 times 4.3 billion… 670 GB. Plus the 34.3 from the array of pointers: 704.3 GB. Of RAM. Not counting the memory allocator overhead of handling that. Or all the other things git might have in memory as well (which apparently involves a hashmap, too, but I won’t look at that, I promise).

I think one would have run out of memory before hitting that integer overflow.

Interestingly, looking at Documentation/technical/index-format.txt again, the on-disk format appears smaller, with 62 bytes per file instead of 92, so the corresponding index file would be smaller. (And in version 4, paths are prefix-compressed, so paths would be smaller too).

But having an index that large supposes those files are checked out. So let’s say I have an empty ext4 file system as large as possible (which I’m told is 2^60 bytes (1.15 billion gigabytes)). Creating a small empty ext4 tells me at least 10 inodes are allocated by default. I seem to remember there’s at least one reserved for the journal, there’s the top-level directory, and there’s lost+found ; there apparently are more. Obviously, on that very large file system, We’d have a git repository. git init with an empty template creates 9 files and directories, so that’s 19 more inodes taken. But git init doesn’t create an index, and doesn’t have any objects. We’d thus have at least one file for our hundreds of gigabyte index, and at least 2 who-knows-how-big files for the objects (a pack and its index). How many inodes does that leave us with?

The Linux kernel source tells us the number of inodes in an ext4 file system is stored in a 32-bits integer.

So all in all, if we had an empty very large file system, we’d only be able to store, at best, 2^32 – 22 files… And we wouldn’t even be able to get cache_nr to overflow.

… while following the rules. Because the index can keep files that have been removed, it is actually possible to fill the index without filling the file system. After hours (days? months? years? decades?*) of running

seq 0 4294967296 | while read i; do touch $i; git update-index --add $i; rm $i; done

One should be able to reach the integer overflow. But that’d still require hundreds of gigabytes of disk space and even more RAM.

  • At the rate it was possible to add files to the index when I tried (yeah, I tried), for a few minutes, and assuming a constant rate, the estimate is close to 2 years. But the time spent reading and writing the index file increases linearly with its size, so the longer it’d run, the longer it’d take.

Ok, it’s actually much faster to do it hundreds of thousand files at a time, with something like:

seq 0 100000 4294967296 | while read i; do j=$(seq $i $(($i + 99999))); touch $j; git update-index --add $j; rm $j; done

At the rate the first million files were added, still assuming a constant rate, it would take about a month on my machine. Considering reading/writing a list of a million files is a thousand times faster than reading a list of a billion files, assuming linear increase, we’re still talking about decades, and plentiful RAM. Fun fact: after leaving it run for 5 times as much as it had run for the first million files, it hasn’t even done half more…

One could generate the necessary hundreds-of-gigabytes index manually, that wouldn’t be too hard, and assuming it could be done at about 1 GB/s on a good machine with a good SSD, we’d be able to craft a close-to-explosion index within a few minutes. But we’d still lack the RAM to load it.

So, here is the open question: should I report that integer overflow?

Wow, that was some serious procrastination.

Edit: Epilogue: Actually, oops, there is a separate integer overflow on the reading side that can trigger a buffer overflow, that doesn’t actually require a large index, just a crafted header, demonstrating that yes, not all integer overflows are equal.

2016-07-08 13:18:34+0900

p.d.o, p.m.o | 1 Comment »

July 4th, 2016

Announcing git-cinnabar 0.4.0 beta 1

Git-cinnabar is a git remote helper to interact with mercurial repositories. It allows to clone, pull and push from/to mercurial remote repositories, using git.

Get it on github.

These release notes are also available on the git-cinnabar wiki.

What’s new since 0.3.2?

  • Various bug fixes.
  • Updated git to 2.9 for cinnabar-helper.
  • Now supports bundle2 for both fetch/clone and push (https://www.mercurial-scm.org/wiki/BundleFormat2).
  • Now Supports git credential for HTTP authentication.
  • Removed upgrade path from repositories used with version < 0.3.0.
  • Experimental (and partial) support for using git-cinnabar without having mercurial installed.
  • Use a mercurial subprocess to access local mercurial repositories.
  • Cinnabar-helper now handles fast-import, with workarounds for performance issues on macOS.

2016-07-04 12:46:15+0900

cinnabar, p.m.o | No Comments »

May 10th, 2016

Using git to access mercurial repositories, without mercurial

If you’ve been following this blog, you know I’ve been working on a git remote helper that gives access to mercurial repositories, named git-cinnabar. So far, it has been using libraries from mercurial itself in order to talk to local or remote repositories.

That is, until today. The current master branch now has experimental support for direct access to remote mercurial repositories, without mercurial.

2016-05-10 17:45:35+0900

cinnabar, p.m.o | 3 Comments »

April 28th, 2016

Announcing git-cinnabar 0.3.2

Git-cinnabar is a git remote helper to interact with mercurial repositories. It allows to clone, pull and push from/to mercurial remote repositories, using git.

Get it on github.

These release notes are also available on the git-cinnabar wiki.

This is mostly a bug and regression-fixing release.

What’s new since 0.3.1?

  • Fixed a performance regression when cloning big repositories on OSX.
  • git configuration items with line breaks are now supported.
  • Fixed a number of issues with corner cases in mercurial data (such as, but not limited to nodes with no first parent, malformed .hgtags, etc.)
  • Fixed a stack overflow, a buffer overflow and a use-after free in cinnabar-helper.
  • Better work with git worktrees, or when called from subdirectories.
  • Updated git to 2.7.4 for cinnabar-helper.
  • Properly remove all refs meant to be removed when using git version lower than 2.1.

2016-04-28 00:42:56+0900

cinnabar, p.m.o | No Comments »