Archive for February, 2008

Testing the crazy idea

So, MadCoder had doubts about the crazy idea, so I took a little time to do a test with a package I maintain, namely, xulrunner.

First, take all the deb files in the history of the package (at least, everything that is available on snapshot.debian.net today).

wget -O - -q http://snapshot.debian.net/archive/pool/x/xulrunner/binary-amd64/Packages.gz | gzip -cd > list
awk '/^Filename:/{print $2}' list | xargs -I{} wget http://snapshot.debian.net/archive/{}

Next, commit all these in a different repo per package name :

perl -e 'use Dpkg::Version qw(vercmp); sub v { my $f = $_[0]; $f =~ s/.*_(.*)_.*/$1/; $f } print sort { vercmp(v($a), v($b)); } map { s/^Filename: .*\///; $_ } grep { /^Filename:/ } <>;' list | while read f; do
    pkg=${f%_*_*}
    [ ! -d $pkg ] && mkdir $pkg && ( cd $pkg ; git init )
    cd $pkg
    ar -x ../$f
    mkdir data control
    tar -C data -zxf data.tar.gz
    tar -C control -zxf control.tar.gz
    git add data control
    git commit -q -m $f
    rm -rf data control data.tar.gz control.tar.gz debian-binary
    cd ..
done

Finally, evaluate sizes for each package, respectively, of all .deb files, their content imported in git (only the .git directory, including the index ; some space could be gained removing it), and the "optimized" git repository (after git gc, without modifying delta depth or window size, which may even improve the result)

awk '/^Package:/{print $2}' list | sort -u | while read p; do
    du -c --si ${p}_*.deb | tail -1
    du -s --si $p
    cd $p; git gc ; cd ..
    echo
done

17M total
16M libmozjs0d
4.7M libmozjs0d

34M total
31M libmozjs0d-dbg
14M libmozjs0d-dbg

4.7M total
7.1M libnspr4-0d
1.3M libnspr4-0d

9.2M total
12M libnspr4-0d-dbg
2.8M libnspr4-0d-dbg

25M total
28M libnss3-0d
4.6M libnss3-0d

81M total
61M libnss3-0d-dbg
21M libnss3-0d-dbg

15M total
16M libnss3-tools
3.1M libnss3-tools

288M total
265M libxul0d
146M libxul0d

2.0G total
1.8G libxul0d-dbg
1.1G libxul0d-dbg

5.1M total
6.8M python-xpcom
1.1M python-xpcom

2.5M total
5.6M spidermonkey-bin
861k spidermonkey-bin

13M total
14M xulrunner
2.0M xulrunner

3.3M total
5.9M xulrunner-gnome-support
979k xulrunner-gnome-support

So these packages, stored in git, take between 15 and roughly 50 percent of the .deb size, which may be a nice improvement. Il would be interesting to know how these numbers evolve with time. Some files, such as the changelog.Debian.gz files, would also benefit from being stored in plain text instead of gzipped form.

Note git gc took a while and a lot of memory for libxul0d-dbg. Also note these don't include delta files that would be necessary to recreate the original .deb file, but this shouldn't make a huge difference.

2008-02-24 22:42:11+0900

miscellaneous, p.d.o | Comments Off on Testing the crazy idea

Crazy ideas

I often have a bunch of somewhat crazy ideas, and I don't have any time available to test or implement them, which is sad. So just in case these crazy ideas would scratch someone's itch, I'm going to throw them in the wild.

I've been using git for a few months, now, and used it not only for source code management, but for efficient storage, too. *VERY* efficient. I'll have to write about that some day.

Anyways, while installing pristine-tar, today, I just thought it would be neat to have an equivalent pristine-deb, to store deb files efficiently. I'm pretty sure someone else thought about this possibility, but it's still better that such ideas come to the ears (eyes, actually) of someone that could implement them.

Such a pristine-deb tool could be used to... store packages from snapshot.debian.net. That would reduce the amount of space required for the archive dramatically, IMHO. I'm pretty sure old packages are not requested that much, so they could be generated on-the-fly from a CGI script placed as a GET action, so that urls wouldn't change.

The same could probably be applied to archive.debian.org. It could even save enough space that archive.debian.org could host snapshot.debian.net. But that depends on the average package content and its average evolution, which I have absolutely no idea about.

Update: It would also be interesting to have the .diff.gz files in there, too ; it would obviously allow to have an easy view of the contents, such as copyright files, changelogs, and other bits of information available on packages.debian.org.

Update 2: Actually, pristine-deb would as easy as storing 2 pristine-tars (one for control.tar.gz and one for data.tar.gz), and a debian-binary file. The .deb can be aggregated with

ar -rc file.deb debian-binary control.tar.gz data.tar.gz

2008-02-24 12:52:57+0900

miscellaneous, p.d.o | 3 Comments »

Obligatory FOSDEM post

I'm NOT going to FOSDEM

2008-02-23 12:55:26+0900

miscellaneous, p.d.o | Comments Off on Obligatory FOSDEM post