Archive for March, 2017

Why is the git-cinnabar master branch slower to clone?

Apart from the memory considerations, one thing that the data presented in the "When the memory allocator works against you" post that I haven't touched in the followup posts is that there is a large difference in the time it takes to clone mozilla-central with git-cinnabar 0.4.0 vs. the master branch.

One thing that was mentioned in the first followup is that reducing the amount of realloc and substring copies made the cloning more than 15 minutes faster on master. But the same code exists in 0.4.0, so this isn't part of the difference.

So what's going on? Looking at the CPU usage during the clone is enlightening.

On 0.4.0:

On master:

(Note: the data gathering is flawed in some ways, which explains why the git-remote-hg process goes above 100%, which is not possible for this python process. The data is however good enough for the high level analysis that follows, so I didn't bother to get something more acurate)

On 0.4.0, the git-cinnabar-helper process was saturating one CPU core during the File import phase, and the git-remote-hg process was saturating one CPU core during the Manifest import phase. Overall, the sum of both processes usually used more than one and a half core.

On master, however, the total of both processes barely uses more than one CPU core.

What happened?

This and that happened.

Essentially, before those changes, git-remote-hg would send instructions to git-fast-import (technically, git-cinnabar-helper, but in this case it's only used as a wrapper for git-fast-import), and use marks to track the git objects that git-fast-import created.

After those changes, git-remote-hg asks git-fast-import the git object SHA1 of objects it just asked to be created. In other words, those changes replaced something asynchronous with something synchronous: while it used to be possible for git-remote-hg to work on the next file/manifest/changeset while git-fast-import was working on the previous one, it now waits.

The changes helped simplify the python code, but made the overall clone process much slower.

If I'm not mistaken, the only real use for that information is for the mapping of mercurial to git SHA1s, which is actually rarely used during the clone, except at the end, when storing it. So what I'm planning to do is to move that mapping to the git-cinnabar-helper process, which, incidentally, will kill not 2, but 3 birds with 1 stone:

It will restore the asynchronicity, obviously (at least, that's the expected main outcome).
Storing the mapping in the git-cinnabar-helper process is very likely to take less memory than what it currently takes in the git-remote-hg process. Even if it doesn't (which I doubt), that should still help stay under the 2GB limit of 32-bit processes.
The whole thing that spikes memory usage during the finalization phase, as seen in previous post, will just go away, because the git-cinnabar-helper process will just have prepared the git notes-like tree on its own.

So expect git-cinnabar 0.5 to get moar faster, and to use moar less memory.

2017-03-23 16:38:05+0900

cinnabar, p.m.o | No Comments »

Analyzing git-cinnabar memory use

In previous post, I was looking at the allocations git-cinnabar makes. While I had the data, I figured I'd also look how the memory use correlates with expectations based on repository data, to put things in perspective.

As a reminder, this is what the allocations look like (horizontal axis being the number of allocator function calls):

There are 7 different phases happening during a git clone using git-cinnabar, most of which can easily be identified on the graph above:

Negotiation.
During this phase, git-cinnabar talks to the mercurial server to determine what needs to be pulled. Once that is done, a getbundle request is emitted, which response is read in the next three phases. This phase is essentially invisible on the graph.
Reading changeset data.
The first thing that a mercurial server sends in the response for a getbundle request is changesets. They are sent in the RevChunk format. Translated to git, they become commit objects. But to create commit objects, we need the entire corresponding trees and files (blobs), which we don't have yet. So we keep this data in memory.

In the git clone analyzed here, there are 345643 changesets loaded in memory. Their raw size in RawChunk format is 237MB. I think by the end of this phase, we made 20 million allocator calls, have about 300MB of live data in about 840k allocations. (No certainty because I don't actually have definite data that would allow to correlate between the phases and allocator calls, and the memory usage change between this phase and next is not as clear-cut as with other phases). This puts us at less than 3 live allocations per changeset, with "only" about 60MB overhead over the raw data.
Reading manifest data.
In the stream we receive, manifests follow changesets. Each changeset points to one manifest ; several changesets can point to the same manifest. Manifests describe the content of the entire source code tree in a similar manner as git trees, except they are flat (there's one manifest for the entire tree, where git trees would reference other git trees for sub directories). And like git trees, they only map file paths to file SHA1s. The way they are currently stored by git-cinnabar (which is planned to change) requires knowing the corresponding git SHA1s for those files, and we haven't got those yet, so again, we keep everything in memory.

In the git clone analyzed here, there are 345398 manifests loaded in memory. Their raw size in RawChunk format is 1.18GB. By the end of this phase, we made 23 million more allocator calls, and have about 1.52GB of live data in about 1.86M allocations. We're still at less than 3 live allocations for each object (changeset or manifest) we're keeping in memory, and barely over 100MB of overhead over the raw data, which, on average puts the overhead at 150 bytes per object.

The three phases so far are relatively fast and account for a small part of the overall process, so they don't appear clear-cut to each other, and don't take much space on the graph.
Reading and Importing files.
After the manifests, we finally get files data, grouped by path, such that we get all the file revisions of e.g. .cargo/.gitignore, followed by all the file revisions of .cargo/config.in, .clang-format, and so on. The data here doesn't depend on anything else, so we can finally directly import the data.

This means that for each revision, we actually expand the RawChunk into the full file data (RawChunks contain patches against a previous revision), and don't keep the RawChunk around. We also don't keep the full data after it was sent to the git-cinnabar-helper process (as far as cloning is concerned, it's essentially a wrapper for git-fast-import), except for the previous revision of the file, which is likely the patch base for the next revision.

We however keep in memory one or two things for each file revision: a mapping of its mercurial SHA1 and the corresponding git SHA1 of the imported data, and, when there is one, the file metadata (containing information about file copy/renames) that lives as a header in the file data in mercurial, but can't be stored in the corresponding git blobs, otherwise we'd have irrelevant data in checkouts.

On the graph, this is where there is a steady and rather long increase of both live allocations and memory usage, in stairs for the latter.

In the git clone analyzed here, there are 2.02M file revisions, 78k of which have copy/move metadata for a cumulated size of 8.5MB of metadata. The raw size of the file revisions in RawChunk format is 3.85GB. The expanded data size is 67GB. By the end of this phase, we made 622 million more allocator calls, and peaked at about 2.05GB of live data in about 6.9M allocations. Compared to the beginning of this phase, that added about 530MB in 5 million allocations.

File metadata is stored in memory as python dicts, with 2 entries each, instead of raw form for convenience and future-proofing, so that would be at least 3 allocations each: one for each value, one for the dict, and maybe one for the dict storage ; their keys are all the same and are probably interned by python, so wouldn't count.

As mentioned above, we store a mapping of mercurial to git SHA1s, so for each file that makes 2 allocations, 4.04M total. Plus the 230k or 310k from metadata. Let's say 4.45M total. We're short 550k allocations, but considering the numbers involved, it would take less than one allocation per file on average to go over this count.

As for memory size, per this answer on stackoverflow, python strings have an overhead of 37 bytes, so each SHA1 (kept in hex form) will take 77 bytes (Note, that's partly why I didn't particularly care about storing them as binary form, that would only save 25%, not 50%). That's 311MB just for the SHA1s, to which the size of the mapping dict needs to be added. If it were a plain array of pointers to keys and values, it would take 2 * 8 bytes per file, or about 32MB. But that would be a hash table with no room for more items (By the way, I suspect the stairs that can be seen on the requested and in-use bytes is the hash table being realloc()ed). Plus at least 290 bytes per dict for each of the 78k metadata, which is an additional 22M. All in all, 530MB doesn't seem too much of a stretch.
Importing manifests.
At this point, we're done receiving data from the server, so we begin by
dropping objects related to the bundle we got from the server. On the graph, I assume this is the big dip that can be observed after the initial increase in memory use, bringing us down to 5.6 million allocations and 1.92GB.

Now begins the most time consuming process, as far as mozilla-central is concerned: transforming the manifests into git trees, while also storing enough data to be able to reconstruct manifests later (which is required to be able to pull from the mercurial server after the clone).

So for each manifest, we expand the RawChunk into the full manifest data, and generate new git trees from that. The latter is mostly performed by the git-cinnabar-helper process. Once we're done pushing data about a manifest to that process, we drop the corresponding data, except when we know it will be required later as the delta base for a subsequent RevChunk (which can happen in bundle2).

As with file revisions, for each manifest, we keep track of the mapping of SHA1s between mercurial and git. We also keep a DAG of the manifests history (contrary to git trees, mercurial manifests track their ancestry ; files do too, but git-cinnabar doesn't actually keep track of that separately ; it just relies on the manifests data to infer file ancestry).

On the graph, this is where the number of live allocations increases while both requested and in-use bytes decrease, noisily.

By the end of this phase, we made about 1 billion more allocator calls. Requested allocations went down to 1.02GB, for close to 7 million live allocations. Compared to the end of the dip at the beginning of this phase, that added 1.4 million allocations, and released 900MB. By now, we expect everything from the "Reading manifests" phase to have been released, which means we allocated around 620MB (1.52GB - 900MB), for a total of 3.26M additional allocations (1.4M + 1.86M).

We have a dict for the SHA1s mapping (345k * 77 * 2 for strings, plus the hash table with 345k items, so at least 60MB), and the DAG, which, now that I'm looking at memory usage, I figure has the one of the possibly worst structure, using 2 sets for each node (at least 232 bytes per set, that's at least 160MB, plus 2 hash tables with 345k items). I think 250MB for those data structures would be largely underestimated. It's not hard to imagine them taking 620MB, because really, that DAG implementation is awful. The number of allocations expected from them would be around 1.4M (4 * 345k), but I might be missing something. That's way less than the actual number, so it would be interesting to take a closer look, but not before doing something about the DAG itself.

Fun fact: the amount of data we're dealing with in this phase (the expanded size of all the manifests) is close to 2.9TB (yes, terabytes). With about 4700 seconds spent on this phase on a real clone (less with the release branch), we're still handling more than 615MB per second.
Importing changesets.
This is where we finally create the git commits corresponding to the mercurial changesets. For each changeset, we expand its RawChunk, find the git tree we created in the previous phase that corresponds to the associated manifest, and create a git commit for that tree, with the right date, author, and commit message. For data that appears in the mercurial changeset that can't be stored or doesn't make sense to store in the git commit (e.g. the manifest SHA1, the list of changed files[*], or some extra metadata like the source of rebases), we keep some metadata we'll store in git notes later on.

[*] Fun fact: the list of changed files stored in mercurial changesets does not necessarily match the list of files in a `git diff` between the corresponding git commit and its parents, for essentially two reasons:
- Old buggy versions of mercurial have generated erroneous lists that are now there forever (they are part of what makes the changeset SHA1).
- Mercurial may create new revisions for files even when the file content is not modified, most notably during merges (but that also happened on non-merges due to, presumably, bugs).
... so we keep it verbatim.

On the graph, this is where both requested and in-use bytes are only slightly increasing.

By the end of this phase, we made about half a billion more allocator calls. Requested allocations went up to 1.06GB, for close to 7.7 million live allocations. Compared to the end of the previous phase, that added 700k allocations, and 400MB. By now, we expect everything from the "Reading changesets" phase to have been released (at least the raw data we kept there), which means we may have allocated at most around 700MB (400MB + 300MB), for a total of 1.5M additional allocations (700k + 840k).

All these are extra data we keep for the next and final phase. It's hard to evaluate the exact size we'd expect here in memory, but if we divide by the number of changesets (345k), that's less than 5 allocations per changeset and less than 2KB per changeset, which is low enough not to raise eyebrows, at least for now.
Finalizing the clone.
The final phase is where we actually go ahead storing the mappings between mercurial and git SHA1s (all 2.7M of them), the git notes where we store the data necessary to recreate mercurial changesets from git commits, and a cache for mercurial tags.

On the graph, this is where the requested and in-use bytes, as well as the number of live allocations peak like crazy (up to 21M allocations for 2.27GB requested).

This is very much unwanted, but easily explained with the current state of the code. The way the mappings between mercurial and git SHA1s are stored is via a tree similar to how git notes are stored. So for each mercurial SHA1, we have a file that points to the corresponding git SHA1 through git links for commits or directly for blobs (look at the output of git ls-tree -r refs/cinnabar/metadata^3 if you're curious about the details). If I remember correctly, it's faster if the tree is created with an ordered list of paths, so the code created a list of paths, and then sorted it to send commands to create the tree. The former creates a new str of length 42 and a tuple of 3 elements for each and every one of the 2.7M mappings. With the 37 bytes overhead by str instance and the 56 + 3 * 8 bytes per tuple, we have at least 429MB wasted. Creating the tree itself keeps the corresponding fast-import commands in a buffer, where each command is going to be a tuple of 2 elements: a pointer to a method, and a str of length between 90 and 93. That's at least another 440MB wasted.

I already fixed the first half, but the second half still needs addressing.

Overall, except for the stupid spike during the final phase, the manifest DAG and the glibc allocator runaway memory use described in previous posts, there is nothing terribly bad with the git-cinnabar memory usage, all things considered. Mozilla-central is just big.

The spike is already half addressed, and work is under way for the glibc allocator runaway memory use. The manifest DAG, interestingly, is actually mostly useless. It's only used to track the heads of the DAG, and it's very much possible to track heads of a DAG without actually storing the entire DAG. In fact, that's what git-cinnabar already does for changeset heads... so we would only need to do the same for manifest heads.

One could argue that the 1.4GB of raw RevChunk data we're keeping in memory for later user could be kept on disk instead. I haven't done this so far because I didn't want to have to handle temporary files (and answer questions like "where to put them?", "what if there isn't enough disk space there?", "what if disk access is slow?", etc.). But the majority of this data is from manifests. I'm already planning changes in how git-cinnabar stores manifests data that will actually allow to import them directly, instead of keeping them in memory until files are imported. This would instantly remove 1.18GB of memory usage. The downside, however, is that this would be more CPU intensive: Importing changesets will require creating the corresponding git trees, and getting the stored manifest data. I think it's worth, though.

Finally, one thing that isn't obvious here, but that was found while analyzing why RSS would be going up despite memory usage going down, is that git-cinnabar is doing way too many reallocations and substring allocations.

So let's look at two metrics that hopefully will highlight the problem:

The cumulated amount of requested memory. That is, the sum of all sizes ever given to malloc, realloc, calloc, etc.
The compensated cumulated amount of requested memory (naming is hard). That is, the sum of all sizes ever given to malloc, calloc, etc. except realloc. For realloc, we only count the delta in size between what the size was before and after the realloc.

Assuming all the requested memory is filled at some point, the former gives us an upper bound to the amount of memory that is ever filled or copied (the amount that would be filled if no realloc was ever in-place), while the the latter gives us a lower bound (the amount that would be filled or copied if all reallocs were in-place).

Ideally, we'd want the upper and lower bounds to be close to each other (indicating few realloc calls), and the total amount at the end of the process to be as close as possible to the amount of data we're handling (which we've seen is around 3TB).

... and this is clearly bad. Like, really bad. But we already knew that from the previous post, although it's nice to put numbers on it. The lower bound is about twice the amount of data we're handling, and the upper bound is more than 10 times that amount. Clearly, we can do better.

We'll see how things evolve after the necessary code changes happen. Stay tuned.

2017-03-23 13:30:26+0900

cinnabar, p.m.o | No Comments »

When the memory allocator works against you, part 2

This is a followup to the "When the memory allocator works against you" post from a few days ago. You may want to read that one first if you haven't, and come back. In case you don't or didn't read it, it was all about memory consumption during a git clone of the mozilla-central mercurial repository using git-cinnabar, and how the glibc memory allocator is using more than one would expect. This post is going to explore how/why it's happening.

I happen to have written a basic memory allocation logger for Firefox, so I used it to log all the allocations happening during a git clone exhibiting the runaway memory increase behavior (using a python that doesn't use its own allocator for small allocations).

The result was a 6.5 GB log file (compressed with zstd ; 125 GB uncompressed!) with 2.7 billion calls to malloc, calloc, free, and realloc, recorded across (mostly) 2 processes (the python git-remote-hg process and the native git-cinnabar-helper process ; there are other short-lived processes involved, but they do less than 5000 calls in total).

The vast majority of those 2.7 billion calls is done by the python git-remote-hg process: 2.34 billion calls. We'll only focus on this process.

Replaying those 2.34 billion calls with a program that reads the log allowed to reproduce the runaway memory increase behavior to some extent. I went an extra mile and modified glibc's realloc code in memory so it doesn't call memcpy, to make things faster. I also ran under setarch x86_64 -R to disable ASLR for reproducible results (two consecutive runs return the exact same numbers, which doesn't happen with ASLR enabled).

I also modified the program to report the number of live allocations (allocations that haven't been freed yet), and the cumulated size of the actually requested allocations (that is, the sum of all the sizes given to malloc, calloc, and realloc calls for live allocations, as opposed to what the memory allocator really allocated, which can be more, per malloc_usable_size).

RSS was not tracked because the allocations are never filled to make things faster, such that pages for large allocations are never dirty, and RSS doesn't grow as much because of that.

Full disclosure: it turns out the "system bytes" and "in-use bytes" numbers I had been collecting in the previous post were smaller than what they should have been, and were excluding memory that the glibc memory allocator would have mmap()ed. That however doesn't affect the trends that had been witnessed. The data below is corrected.

(Note that in the graph above and the graphs that follow, the horizontal axis represents the number of allocator function calls performed)

While I was here, I figured I'd check how mozjemalloc performs, and it has a better behavior (although it has more overhead).

What doesn't appear on this graph, though, is that mozjemalloc also tells the OS to drop some pages even if it keeps them mapped (madvise(MADV_DONTNEED)), so in practice, it is possible the actual RSS decreases too.

And jemalloc 4.5:

(It looks like it has better memory usage than mozjemalloc for this use case, but its stats are being thrown off at some point, I'll have to investigate)

Going back to the first graph, let's get a closer look at what the allocations look like when the "system bytes" number is increasing a lot. The highlights in the following graphs indicate the range the next graph will be showing.

So what we have here is a bunch of small allocations (small enough that they don't seem to move the "requested" line ; most under 512 bytes, so under normal circumstances, they would be allocated by python, a few between 512 and 2048 bytes), and a few large allocations, one of which triggers a bump in memory use.

What can appear weird at first glance is that we have a large allocation not requiring more system memory, later followed by a smaller one that does. What the allocations actually look like is the following:


void *ptr0 = malloc(4850928); // #1391340418
void *ptr1 = realloc(some_old_ptr, 8000835); // #1391340419
free(ptr0); // #1391340420
ptr1 = realloc(ptr1, 8000925); // #1391340421
/* ... */
void *ptrn = malloc(879931); // #1391340465
ptr1 = realloc(ptr1, 8880819); // #1391340466
free(ptrn); // #1391340467

As it turns out, inspecting all the live allocations at that point, while there was a hole large enough to do the first two reallocs (the second actually happens in place), at the point of the third one, there wasn't a large enough hole to fit 8.8MB.

What inspecting the live allocations also reveals, is that there is a large number of large holes between all the allocated memory ranges, presumably coming from previous similar patterns. There are, in fact, 91 holes larger than 1MB, 24 of which are larger than 8MB. It's the accumulation of those holes that can't be used to fulfil larger allocations that makes the memory use explode. And there aren't enough small allocations happening to fill those holes. In fact, the global trend is for less and less memory to be allocated, so, smaller holes are also being poked all the time.

Really, it's all a straightforward case of memory fragmentation. The reason it tends not to happen with jemalloc is that jemalloc groups allocations by sizes, which the glibc allocator doesn't seem to be doing. The following is how we got a hole that couldn't fit the 8.8MB allocation in the first place:


ptr1 = realloc(ptr1, 8880467); // #1391324989; ptr1 is 0x5555de145600
/* ... */
void *ptrx = malloc(232); // #1391325001; ptrx is 0x5555de9bd760 ; that is 13 bytes after the end of ptr1.
/* ... */
free(ptr1); // #1391325728; this leaves a hole of 8880480 bytes at 0x5555de145600.

All would go well if ptrx was free()d, but it looks like it's long-lived. At least, it's still allocated by the time we reach the allocator call #1391340466. And since the hole is 339 bytes too short for the requested allocation, the allocator has no other choice than request more memory to the system.

What's bothering, though, is that the allocator chose to allocate ptrx in the space following ptr1, when it allocated similarly sized buffers after allocating ptr1 and before allocating ptrx in completely different places, and while there are plenty of holes in the allocated memory where it could fit.

Interestingly enough, ptrx is a 232 bytes allocation, which means under normal circumstances, python itself would be allocating it. In all likeliness, when the python allocator is enabled, it's allocations larger than 512 bytes that become obstacles to the larger allocations. Another possibility is that the 256KB fragments that the python allocator itself allocates to hold its own allocations become the obstacles (my original hypothesis). I think the former is more likely, though, putting back the blame entirely on glibc's shoulders.

Now, it looks like the allocation pattern we have here is suboptimal, so I re-ran a git clone under a debugger to catch when a realloc() for 8880819 bytes happens (the size is peculiar enough that it only happened once in the allocation log). But doing that with a conditional breakpoint is just too slow, so I injected a realloc wrapper with LD_PRELOAD that sends a SIGTRAP signal to the process, so that an attached debugger can catch it.

Thanks to the support for python in gdb, it was then posible to pinpoint the exact python instructions that made the realloc() call (it didn't come as a surprise ; in fact, that was one of the places I had in mind, but I wanted definite proof):


new = ''
end = 0
# ...
for diff in RevDiff(rev_patch):
    new += data[end:diff.start]
    new += diff.text_data
    end = diff.end
    # ...
new += data[end:]

What happens here is that we're creating a mercurial manifest we got from the server in patch form against a previous manifest. So data contains the original manifest, and rev_patch the patch. The patch essentially contains instructions of the form "replace n bytes at offset o with the content c".

The code here just does that in the most straightforward way, implementation-wise, but also, it turns out, possibly the worst way.

So let's unroll this loop over a couple iterations:

new = ''

This allocates an empty str object. In fact, this doesn't actually allocate anything, and only creates a pointer to an interned representation of an empty string.

new += data[0:diff.start]

This is where things start to get awful. data is a str, so data[0:diff.start] creates a new, separate, str for the substring. One allocation, one copy.

Then appends it to new. Fortunately, CPython is smart enough, and just assigns data[0:diff.start] to new. This can easily be verified with the CPython REPL:

>>> foo = ''
>>> bar = 'bar'
>>> foo += bar
>>> foo is bar
True

(and this is not happening because the example string is small here ; it also happens with larger strings, like 'bar' * 42000)

Back to our loop:

new += diff.text_data

Now, new is realloc()ated to have the right size to fit the appended text in it, and the contents of diff.text_data is copied. One realloc, one copy.

Let's go for a second iteration:

new += data[diff.end:new_diff.start]

Here again, we're doing an allocation for the substring, and one copy. Then new is realloc()ated again to append the substring, which is an additional copy.

new += new_diff.text_data

new is realloc()ated yet again to append the contents of new_diff.text_data.

We now finish with:

new += data[new_diff.end:]

which, again creates a substring from the data, and then proceeds to realloc()ate new one freaking more time.

That's a lot of malloc()s and realloc()s to be doing...

It is possible to limit the number of realloc()s by using new = bytearray() instead of new = ''. I haven't looked in the CPython code what the growth strategy is, but, for example, appending a 4KB string to a 500KB bytearray makes it grow to 600KB instead of 504KB, like what happens when using str.
It is possible to avoid realloc()s completely by preallocating the right size for the bytearray (with bytearray(size)), but that requires looping over the patch once first to know the new size, or using an estimate (the new manifest can't be larger than the size of the previous manifest + the size of the patch) and truncating later (although I'm not sure it's possible to truncate a bytearray without a realloc()). As a downside, this initializes the buffer with null bytes, which is a waste of time.
Another possibility is to reuse bytearrays previously allocated for previous manifests.
Yet another possibility is to accumulate the strings to append and use ''.join(). CPython is smart enough to create a single allocation for the total size in that case. That would be the most convenient solution, but see below.
It is possible to avoid the intermediate allocations and copies for substrings from the original manifest by using memoryview.
Unfortunately, you can't use ''.join() on a list of memoryviews before Python 3.4.

After modifying the code to implement the first and fifth items, memory usage during a git clone of mozilla-central looks like the following (with the python allocator enabled):

(Note this hasn't actually landed on the master branch yet)

Compared to what it looked like before, this is largely better. But that's not the only difference: the clone was also about 1000 seconds faster. That's more than 15 minutes! But that's not all so surprising when you know the volumes of data handled here. More insight about this coming in an upcoming post.

But while the changes outlined above make the glibc allocator behavior less likely to happen, it doesn't totally obliviate it. In fact, it seems it is still happening by the end of the manifest import phase. We're still allocating increasingly large temporary buffers because the size of the imported manifests grows larger and larger, and every one of them is the result of patching a previous one.

The only way to avoid those large allocations creating holes would be to avoid doing them in the first place. My first attempt at doing that, keeping manifests as lists of lines instead of raw buffers, worked, but was terribly slow. So slow, in fact, that I had to stop a clone early and estimated the process would likely have taken a couple days. Iterating over multiple generators at the same time, a lot, kills performance, apparently. I'll have to try with significantly less of that.

2017-03-22 15:57:46+0900

cinnabar, p.m.o | 4 Comments »

When the memory allocator works against you

Cloning mozilla-central with git-cinnabar requires a lot of memory. Actually too much memory to fit in a 32-bits address space.

I hadn't optimized for memory use in the first place. For instance, git-cinnabar keeps sha-1s in memory as hex values (40 bytes) rather than raw values (20 bytes). When I wrote the initial prototype, it didn't matter that much, and while close(ish) to the tipping point, it didn't require more than 2GB of memory at the time.

Time passed, and mozilla-central grew. I suspect the recent addition of several thousands of commits and files has made things worse.

In order to come up with a plan to make things better (short or longer term), I needed data. So I added some basic memory resource tracking, and collected data while cloning mozilla-central.

I must admit, I was not ready for what I witnessed. Follow me for a tale of frustrations (plural).

I was expecting things to have gotten worse on the master branch (which I used for the data collection) because I am in the middle of some refactoring and did many changes that I was suspecting might have affected memory usage. I wasn't, however, expecting to see the clone command using 10GB(!) memory at peak usage across all processes.

(Note, those memory sizes are RSS, minus "shared")

It also was taking an unexpected long time, but then, I hadn't cloned a large repository like mozilla-central from scratch in a while, so I wasn't sure if it was just related to its recent growth in size or otherwise. So I collected data on 0.4.0 as well.

Less time spent, less memory usage... ok. There's definitely something wrong on master. But wait a minute, that slope from ~2GB to ~4GB on the git-remote-hg process doesn't actually make any kind of sense. I mean, I'd understand it if it were starting and finishing with the "Import manifest" phase, but it starts in the middle of it, and ends long before it finishes. WTH?

First things first, since RSS can be a variety of things, I checked /proc/$pid/smaps and confirmed that most of it was, indeed, the heap.

That's the point where you reach for Google, type something like "python memory profile" and find various tools. One from the results that I remembered having used in the past is guppy's heapy.

Armed with pdb, I broke execution in the middle of the slope, and tried to get memory stats with heapy. SIGSEGV. Ouch.

Let's try something else. I reached out to objgraph and pympler. SIGSEGV. Ouch again.

Tried working around the crashes for a while (too long while, retrospectively, hindsight is 20/20), and was somehow successful at avoiding them by peaking at a smaller set of objects. But whatever I did, despite being attached to a process that had 2.6GB RSS, I wasn't able to find more than 1.3GB of data. This wasn't adding up.

It surely didn't help that getting to that point took close to an hour each time. Retrospectively, I wish I had investigated using something like Checkpoint/Restore in Userspace.

Anyways, after a while, I decided that I really wanted to try to see the whole picture, not smaller peaks here and there that might be missing something. So I resolved myself to look at the SIGSEGV I was getting when using pympler, collecting a core dump when it happened.

Guess what? The Debian python-dbg package does not contain the debug symbols for the python package. The core dump was useless.

Since I was expecting I'd have to fix something in python, I just downloaded its source and built it. Ran the command again, waited, and finally got a backtrace. First Google hit for the crashing function? The exact (unfixed) crash reported on the python bug tracker. No patch.

Crashing code is doing:


((f)->f_builtins != (f)->f_tstate->interp->builtins)

And (f)->f_tstate is NULL. Classic NULL deref.

Added a guard (assessing it wouldn't break anything). Ran the command again. Waited. Again. SIGSEGV.

Facedesk. Another crash on the same line. Did I really use the patched python? Yes. But this time (f)->f_tstate->interp is NULL. Sigh.

Same player, shoot again.

Finally, no crash... but still stuck on only 1.3GB accounted for. Ok, I know not all python memory profiling tools are entirely reliable, let's try heapy again. SIGSEGV. Sigh. No debug info on the heapy module, where the crash happens. Sigh. Rebuild the module with debug info, try again. The backtrace looks like heapy is recursing a lot. Look at %rsp, compare with the address space from /proc/$pid/maps. Confirmed. A stack overflow. Let's do ugly things and increase the stack size in brutal ways.

Woohoo! Now heapy tells me there's even less memory used than the 1.3GB I found so far. Like, half less. Yeah, right.

I'm not clear on how I got there, but that's when I found gdb-heap, a tool from Red Hat's David Malcolm, and the associated talk "Dude, where's my RAM?" A deep dive into how Python uses memory (slides).

With a gdb attached, I would finally be able to rip python's guts out and find where all the memory went. Or so I thought. The gdb-heap tool only found about 600MB. About as much as heapy did, for that matter, but it could be coincidental. Oh. Kay.

I don't remember exactly what went through my mind then, but, since I was attached to a running process with gdb, I typed the following on the gdb prompt:


gdb> call malloc_stats()

And that's when the truth was finally unvealed: the memory allocator was just acting up the whole time. The ouput was something like:


Arena 0:
system bytes    =  some number above (but close to) 2GB
in use bytes    =  some number above (but close to) 600MB

Yes, the glibc allocator was just telling it had allocated 600MB of memory, but was holding onto 2GB. I must have found a really bad allocation pattern that causes massive fragmentation.

One thing that David Malcolm's talk taught me, though, is that python uses its own allocator for small sizes, so the glibc allocator doesn't know about them. And, roughly, adding the difference between RSS and what glibc said it was holding to to the use bytes it reported somehow matches the 1.3GB I had found so far.

So it was time to see how those things evolved in time, during the entire clone process. I grabbed some new data, tracking the evolution of "system bytes" and "in use bytes".

There are two things of note on this data:

There is a relatively large gap between what the glibc allocator says it has gotten from the system, and the RSS (minus "shared") size, that I'm expecting corresponds to the small allocations that python handles itself.
Actual memory use is going down during the "Import manifests" phase, contrary to what the evolution of RSS suggests.

In fact, the latter is exactly how git-cinnabar is supposed to work: It reads changesets and manifests chunks, and holds onto them while importing files. Then it throws away those manifests and changesets chunks one by one while it imports them. There is, however, some extra bookkeeping that requires some additional memory, but it's expected to be less memory consuming than keeping all the changesets and manifests chunks in memory.

At this point, I thought a possible explanation is that since both python and glibc are mmap()ing their own arenas, they might be intertwined in a way that makes things not go well with the allocation pattern happening during the "Import manifest" phase (which, in fact, allocates and frees increasingly large buffers for each manifest, as manifests grow in size in the mozilla-central history).

To put the theory at work, I patched the python interpreter again, making it use malloc() instead of mmap() for its arenas.

"Aha!" I thought. That definitely looks much better. Less gap between what glibc says it requested from the system and the RSS size. And, more importantly, no runaway increase of memory usage in the middle of nowhere.

I was preparing myself to write a post about how mixing allocators could have unintended consequences. As a comparison point, I went ahead and ran another test, with the python allocator entirely disabled, this time.

Heh. It turns out glibc was acting up all alone. So much for my (plausible) theory. (I still think mixing allocators can have unintended consequences.)

(Note, however, that the reason why the python allocator exists is valid: without it, the overall clone took almost 10 more minutes)

And since I had been getting all this data with 0.4.0, I gathered new data without the python allocator with the master branch.

This paints a rather different picture than the original data on that branch, with much less memory use regression than one would think. In fact, there isn't much difference, except for the spike at the end, which got worse, and some of the noise during the "Import manifests" phase that got bigger, implying larger amounts of temporary memory used. The latter may contribute to the allocation patterns that throw glibc's memory allocator off.

It turns out tracking memory usage in python 2.7 is rather painful, and not all the tools paint a complete picture of it. I hear python 3.x is somewhat better in that regard, and I hope it's true, but at the moment, I'm stuck with 2.7. The most reliable tool I've used here, it turns out, is pympler. Or rebuilding the python interpreter without its allocator, and asking the system allocator what is allocated.

With all this data, I now have some defined problems to tackle, some easy (the spike at the end of the clone), and some less easy (working around glibc allocator's behavior). I have a few hunches as to what kind of allocations are causing the runaway increase of RSS. Coincidentally, I'm half-way through a refactor of the code dealing with manifests, and it should help dealing with the issue.

But that will be the subject of a subsequent post.

2017-03-12 10:47:12+0900

cinnabar, p.m.o | 5 Comments »