Git now faster than Mercurial to clone Mozilla Mercurial repos
How is that for clickbait?
With the now released git-cinnabar 0.5.2, the cinnabarclone feature is enabled by default, which means it doesn't need to be enabled manually anymore.
Cinnabarclone is to git-cinnabar what clonebundles is to Mercurial (to some extent). Clonebundles allow Mercurial to download a pre-generated bundle of a repository, which reduces work on the server side. Similarly, Cinnabarclone allows git-cinnabar to download a pre-generated bundle of the git form of a Mercurial repository.
Thanks to Connor Sheehan, who deployed the necessary extension and configuration on the server side, cinnabarclone is now enabled for mozilla-central and mozilla-unified, making git-cinnabar clone faster than ever for these repositories. In fact, under some conditions (mostly depending on network bandwidth), cloning with git-cinnabar is now faster than cloning with Mercurial:
$ time git clone hg::https://hg.mozilla.org/mozilla-unified mozilla-unified_git
Cloning into 'mozilla-unified_git'...
Fetching cinnabar metadata from https://index.taskcluster.net/v1/task/github.glandium.git-cinnabar.bundle.mozilla-unified/artifacts/public/bundle.git
Receiving objects: 100% (12153616/12153616), 2.67 GiB | 41.41 MiB/s, done.
Resolving deltas: 100% (8393939/8393939), done.
Reading 172 changesets
Reading and importing 170 manifests
Reading and importing 758 revisions of 570 files
Importing 172 changesets
It is recommended that you set "remote.origin.prune" or "fetch.prune" to "true".
git config remote.origin.prune true
or
git config fetch.prune true
Run the following command to update tags:
git fetch --tags hg::tags: tag "*"
Checking out files: 100% (279688/279688), done.
real 4m57.837s
user 9m57.373s
sys 0m41.106s
$ time hg clone https://hg.mozilla.org/mozilla-unified
destination directory: mozilla-unified
applying clone bundle from https://hg.cdn.mozilla.net/mozilla-unified/5ebb4441aa24eb6cbe8dad58d232004a3ea11b28.zstd-max.hg
adding changesets
adding manifests
adding file changes
added 537259 changesets with 3275908 changes to 523698 files (+13 heads)
finished applying clone bundle
searching for changes
adding changesets
adding manifests
adding file changes
added 172 changesets with 758 changes to 570 files (-1 heads)
new changesets 8b3c35badb46:468e240bf668
537259 local changesets published
updating to branch default
(warning: large working directory being used without fsmonitor enabled; enable fsmonitor to improve performance; see "hg help -e fsmonitor")
279688 files updated, 0 files merged, 0 files removed, 0 files unresolved
real 21m9.662s
user 21m30.851s
sys 1m31.153s
To be fair, the Mozilla Mercurial repos also have a faster "streaming" clonebundle that they only prioritize automatically if the client is on AWS currently, because they are much larger, and could take longer to download. But you can opt-in with the --stream
command line argument:
$ time hg clone --stream https://hg.mozilla.org/mozilla-unified mozilla-unified_hg
destination directory: mozilla-unified_hg
applying clone bundle from https://hg.cdn.mozilla.net/mozilla-unified/5ebb4441aa24eb6cbe8dad58d232004a3ea11b28.packed1.hg
525514 files to transfer, 2.95 GB of data
transferred 2.95 GB in 51.5 seconds (58.7 MB/sec)
finished applying clone bundle
searching for changes
adding changesets
adding manifests
adding file changes
added 172 changesets with 758 changes to 570 files (-1 heads)
new changesets 8b3c35badb46:468e240bf668
updating to branch default
(warning: large working directory being used without fsmonitor enabled; enable fsmonitor to improve performance; see "hg help -e fsmonitor")
279688 files updated, 0 files merged, 0 files removed, 0 files unresolved
real 1m49.388s
user 2m52.943s
sys 0m43.779s
If you're using Mercurial and can download 3GB in less than 20 minutes (in other words, if you can download faster than 2.5MB/s), you're probably better off with the streaming clone.
Bonus fact: the Git clone is smaller than the Mercurial clone
The Mercurial streaming clone bundle contains data in a form close to what Mercurial puts on disk in the .hg
directory, meaning the size of .hg
is close to that of the clone bundle. The Cinnabarclone bundle contains a git pack, meaning the size of .git
is close to that of the bundle, plus some more for the pack index file that unbundling creates.
The amazing fact is that, to my own surprise, the git pack, containing the repository contents along with all git-cinnabar needs to recreate Mercurial changesets, manifests and files from the contents, takes less space than the Mercurial streaming clone bundle.
And that translates in local repository size:
$ du -h -s --apparent-size mozilla-unified_hg/.hg
3.3G mozilla-unified_hg/.hg
$ du -h -s --apparent-size mozilla-unified_git/.git
3.1G mozilla-unified_git/.git
And because Mercurial creates so many files (essentially, two per file that ever was in the repository), there is a larger difference in block size used on disk:
$ du -h -s mozilla-unified_hg/.hg
4.7G mozilla-unified_hg/.hg
$ du -h -s mozilla-unified_git/.git
3.1G mozilla-unified_git/.git
It's even more mind blowing when you consider that Mercurial happily creates delta chains of several thousand revisions, when the git pack's longest delta chain is 250 (set arbitrarily at pack creation, by which I mean I didn't pick a larger value because it didn't make a significant difference). For the casual readers, Git and Mercurial try to store object revisions as a diff/delta from a previous object revision because that takes less space. You get a delta chain when that previous object revision itself is stored as a diff/delta from another object revision itself stored as a diff/delta ... etc.
My guess is that the difference is mainly caused by the use of line-based deltas in Mercurial, but some Mercurial developer should probably take a deeper look. The fact that Mercurial cannot delta across file renames is another candidate.
2019-07-02 10:06:50+0900
Responses are currently closed, but you can trackback from your own site.
2019-07-02 17:34:20+0900
Is #883 relevant to this? See https://www.mercurial-scm.org/wiki/RenameSpaceSavingPlan
2019-07-02 17:47:10+0900
@Faheem: that’s a possibility, cf. the last sentence in the post. https://www.mercurial-scm.org/wiki/PackedRepoPlan is also relevant, I guess.
2019-07-03 00:26:59+0900
Great work!
I’m curious about the repo size. Back when I last measured this, ignoring per-file overhead, Mercurial was able to produce a much smaller repository. I was never able to get ‘git repack’ to produce anything close – and I once burned several dozen CPU hours with very aggressive window length and depth sizes to try!
I chalked up the difference to the “sorting algorithm†being used for delta compression. Mercurial groups all data by the filename, essentially. Git does something much more complex. I recall git-cinnabar using its own logic for generating packfiles – one based more closely on Mercurial’s object ordering/sorting semantics. So if you are achieving smaller packfile sizes with custom Git object ordering than you do with ‘git repack’, that would be an interesting result indeed and would seemingly provide evidence that Git’s default packfile algorithm is unnecessarily complex in some circumstances!
This is awesome work.
2019-07-03 09:27:53+0900
@Gregory IIRC, the “sorting algorithm” kind of relies on the window size covering up its limitations, but large repositories like Mozilla’s would need very large window sizes to be able to do anything. git-cinnabar just creates a pack with objects in the order they come in from Mercurial, which means files are close together and diffed against similar content in a much more reliable manner than git repack would normally do.
2019-07-03 10:20:52+0900
Git’s “sorting algorithm” is kinda crazy. Read https://github.com/git/git/blob/master/Documentation/technical/pack-heuristics.txt. That file hasn’t been touched in years and it is quite possible the actual implementation has drifted a bit. It is substantially more complex than Mercurial’s approach. And at least in the case of the Mozilla repository, it seems that Git’s “sorting algorithm” yields worse results than Mercurial! So basically git-cinnabar achieves its smaller repository sizes (ignoring inode/block overhead) because it is using Mercurial’s object sorting instead of Git’s! I bet if you run
git repack -a -f -d
(the key flag being-f
to prevent delta reuse), the Git repository will swell in size.2019-07-04 07:44:33+0900
@Gregory: Surprisingly,
git repack -a -f -d
does not as bad as I was expecting from knowing the size of the Mozilla repositories on github. Butgit gc --aggressive
does do wonders:Original pack from cinnabarclone: 2.88GB
After
git repack -a -f -d
: 3.56GBAfter
git gc --aggressive
: 2.56GBFor reference:
Original pack from a clone of https://github.com/mozilla/gecko: 4.73GB
After
git repack -a -f -d
: 2.67GBAfter
git gc --aggressive
: 2.05GBThe latter includes CVS history that the cinnabar clone doesn’t have, but doesn’t include the cinnabar metadata that the cinnabar clone has.
Edit: since I wrote the above, a Github engineer kindly repacked https://github.com/mozilla/gecko and https://github.com/mozilla/gecko-dev on the server side, so they clone to a much smaller size now (about as small as after
git gc --aggressive
, as a matter of fact). I shall actually do the same on the cinnabarclone packs.