Author Archive

Detecting per-process namespaces

Back with per-process namespaces. This time, I will give several methods to more or less reliably detect their use, some requiring root access, some not. We will, though, limit ourselves to "mount namespaces", that is, not PID, UTS or network namespaces.

The first method, which is based on the contents of /proc/pid/mounts should work on any kernel supporting namespaces (since 2.4.19):

$ md5sum /proc/[0-9]*/mounts | awk -F'[ /]*' '{ ns[$1] = ns[$1] ? ns[$1] "," $3 : $3 } END { for (n in ns) { print n, ns[n] } }'

/proc/[0-9]* is used instead of /proc/* to avoid listing self, too. The awk script displays a summary : a md5 sum followed by the PID of all processes which /proc/self/mounts file has this md5 sum.
This method relies on the fact that the /proc/pid/mounts files may be different between namespaces, even when the same mount points are used because the ordering may differ. This method is obviously unreliable, except when all namespaces have different mount points. During my tests, it has been reliable to detect whether a process uses the same namespace as the init process, but different processes with different namespaces were reliably listed together as having the same namespace.

The next method, based on the contents of /proc/pid/mountinfo will work on any kernel implementing this file (since 2.6.26):

$ md5sum /proc/[0-9]*/mountinfo | awk -F'[ /]*' '{ ns[$1] = ns[$1] ? ns[$1] "," $3 : $3 } END { for (n in ns) { print n, ns[n] } }'

The script is essentially the same, only pointing at mountinfo instead of mounts.
This method is quite reliable, as /proc/pid/mountinfo does contain different information even when no mount point was changed within a namespace. Unfortunately, chrooted processes can have a different content while not being in a separate namespace. I've not investigated further, but you may be able to spot chrooted processes by comparing /proc/pid/mountinfo and /proc/pid/mounts.

Both the above methods can be executed as unprivileged user, which alleviates the fact they are not totally reliable.

The last method, on the other hand, is reliable, but requires root privileges, and cgroup filesystem support in the kernel (since 2.6.24).

$ mkdir -p /dev/cgroup/ns $ mount -t cgroup -o ns /dev/cgroup/ns $ find /dev/cgroup/ns -name tasks | xargs -l1 sed ':a;N;s/\n/,/g;ta'

The output here is a group of PID on separate lines for separate namespaces. The sed one-liner replaces all carriage returns in the tasks files with a comma.
You don't actually need root privileges for the last command, if the cgroup filesystem is already mounted. But there has been no normalization as to where the cgroup filesystems should be mounted yet, and they aren't mounted by default on most GNU/Linux distributions, I believe.

Cgroups (Control groups) allow to put sets of processes together to give them special behaviour (cpu provisioning, memory limitations, namespaces, etc.). They also allow to do so in a hierarchical way, such that a cgroup can inherit the properties of its parent cgroup.

As its name and the way it is mounted indicates, our /dev/cgroup/ns is specialized for namespaces cgroups. The directory itself contains a few files, most notably a tasks file containing a list of processes in the current set/cgroup (in our case, namespace). It also contains a subdirectory for each child namespace. Each of these subdirectories is similarly structured, so that it also contains a tasks list and possible subdirectories for its own children.

This means that not only can you spot different namespaces, but you can also spot how they relate to each other.

There are a few glitches, though. As with other cgroups subsystems, you can create new cgroups by creating a directory in the cgroup filesystem, but it happens not to create a new namespace in the case of the namespace subsystem. With all the kind of namespaces that now exist in Linux, this is not a big surprise, but there doesn't seem to be neither a feedback in the cgroup filesystem to know what kind of namespace is involved, nor a control to force creation of a really separate namespace. You also can't move processes from one mount namespace into another by writing a PID in the tasks file in a given namespace ; that just gives an "Operation not permitted" error.

On the other hand, contrary to the second method, chrooted processes are not separated out.

I will investigate further cgroups with PID, UTS and network namespaces and will post my findings.

2009-04-08 21:43:10+0900

p.d.o | Comments Off on Detecting per-process namespaces

How not to provide robust clustered storage with Linux and GFS

(The title is a bit strong, on purpose)

LWN links to an article describing how to provide robust clustered storage with Linux and GFS.

While explaining how to setup GFS can be nice, the incentive made me jump.

The author writes:

Load balancing is difficult; often we need to share file systems via NFS or other mechanisms to provide a central location for the data. While you may be protected against a Web server node failure, you are still sharing fate with the central storage node. Using GFS, the free clustered file system in Linux, you can create a truly robust cluster that does not depend on other servers. In this article, we show you how to properly configure GFS.

In case you don't know, GFS is not exactly a "clustered storage". It is more a "shared storage". You have one storage, and several clients accessing it. You have one storage array, compared to the NFS case, where you have one central server for the data. But what is a storage array except a special (and expensive) kind of server ? You don't depend on other servers, but you depend on other servers ? How is that supposed to be different ?

Conceptually, a clustered file system allows multiple operating systems to mount the same file system, and write to it at the same time. There are many clustered file systems available including Sun's Lustre, OCFS from Oracle, and GFS for Linux.

OCFS and GFS are the same class of file systems, but Lustre is definitely out of league and would, actually, provide a truly robust cluster that does not depend on other servers. Lustre is a truly clustered filesystem, that distributes data on several nodes such that losing some nodes don't make you lose access to the data.

With the incentive given by the author, and considering he lists Lustre as an example, I would actually have preferred an article about setting up Lustre.

2009-04-07 20:29:04+0900

miscellaneous, p.d.o | 13 Comments »

A revolution happening at the W3C ?

It seems undergoing discussion is leading towards a free (as in speech) HTML5 specification.

That would really be great.

2009-04-02 21:31:00+0900

miscellaneous, p.d.o | Comments Off on A revolution happening at the W3C ?

I’m weak

Not only have I not stopped working on WebKit, but I even touched the debian/copyright file :-/

Update: and now I even uploaded a new xulrunner package...

2009-03-27 22:52:22+0900

webkit, xulrunner | 2 Comments »

Orphaning packages

Since I am expected to spend more than half my packaging time updating the debian/copyright file, I am hereby orphaning nspr, nss, iceape and xulrunner. I am also stopping work on webkit and iceweasel, but they don't end up in the orphan state since they are comaintained.

Good luck to my fellow developpers. (And sorry, sincerely)

Update: As I do realize writing while being pissed doesn't help making the motives right, and apparently, some people have seen this as an extortion, let me make things clearer:

I was starting to work on xulrunner 1.9.1 when the discussion about the copyright files came up. It will require a significant amount of time, and while Noah Slater's opinion alone wouldn't really have carried me that far, despite me saying so because I got pissed by his words, 2 messages from JÃ¶rg Jaspert (the only ones he posted so far in the thread, by the way) did make it clear that my work on xulrunner 1.9.1 was going to be a waste of time, which I already lack to properly handle the bugs in my maintained packages, let alone keeping the copyright file up-to-date.

As I am obviously unable to handle the amount of work required to maintain big packages, as drawing new blood in the mozilla team has always failed so far, I just prefer to stop than to over-overflow. Call it extortion to get people in the mozilla team if you want, I'm fine with the notion.

I've been thinking to stop working on big packages for nearly a year already, but never mentionned it but to a few people in a few occasions. I couldn't resolve myself to do it, though I did reduce the amount of time I spend on these packages (I was overflown, a year ago). I just found an excuse to actually do it.

I must say I feel awkward now, and I still don't know if I will be able to keep this resolution.

As for the new copyright file format, with full licensing information and copyright holders list, I *did* try, on a significantly smaller piece of software than the mozilla packages, namely on Webkit, which is not really small either but still 6 times less files than xulrunner. I must say I hate to have to list copyright holders and file names with a passion, and the amount of time it takes. It is the main reason why there wasn't more uploads of WebKit svn snapshots in the archive...

Last but not least, thanks for the nice comments.

2009-03-21 18:54:03+0900

firefox, iceape, webkit, xulrunner | 16 Comments »

WebKit 1.1.3 in experimental

You may have noticed, or not, but WebKit 1.1.3, which has been released a few days ago, is available in experimental. The great news is that we now have a real maintenance team, because I now have a co-maintainer, who actually did most of the work getting the 1.1.x releases in shape for experimental.

Now, some JavaScript performance figures, as I have been doing for most WebKit releases I uploaded to the archive, with a recap of previous episodes:

svn 27674: 3932.0ms
svn 31842: 2612.2ms
release 1.0.1: 1658.4ms
release 1.1.3: 1333.2ms

All there tests have obviously been run on the same machine, under the same conditions. The machine in question is my x86-64 laptop, which means all these test are with a x86-64 binary.

This also means the last release, 1.1.3, doesn't take advantage of the Just In Time JavaScript compiler, which is only available on x86 binaries.

With the x86 binaries under a x86 personality chroot, I get under one second:

release 1.1.3, with JIT, x86: 985.4ms

But, in the last few days, I've been working on getting JIT first to build and then to work on x86-64 linux, and with the help of folks on the webkit-dev list, that just happened. And the result is just... wow.

release 1.1.3, with JIT, x86-64: 623.0ms

(Note that a few tests are actually slower than on x86)

That's so many times faster than what we had a year ago that it's almost unbelievable.

Expect the next upstream release, planned for some time in the near future, to have my patches applied. In the meanwhile, you can get them here and here.

2009-03-20 23:27:39+0900

webkit | Comments Off on WebKit 1.1.3 in experimental

ERROR: This RRD was created on other architecture

I can't believe we still have such things in widespread software in 2009... But it happened to me on the BTS graphs I kind of maintain on people.d.o. It so happens that ravel was upgraded to Lenny, and switched from x86 to x86-64 in the process...

Also, it's nice to see, once again, that the mailing list specially created to announce such changes, namely, debian-infrastructure-announce (which I still believe is useless as such announces should go to d-d-a) got no notification of this change...

2009-02-28 11:41:09+0900

p.d.o | 3 Comments »

The niceties of proprietary software

On a deployment I'm currently working on, I've seen two different cases of proprietary software use leading to both madness and sadness, which are just so typical that I can't resist to tell you. Keep in mind, for the rest of this reading, that the whole platform is running under Solaris 10 x86-64.

The first case is a data quality management software that we will keep anonymous. The editor promised before the deployment began that they had a version of the software for the operating system that we would be using. Well, it turned out they do have a Solaris version... for sparc, and a x86-64 version... for Linux. No Solaris x86-64 version, and no way to get a rebuild in a timely fashion.

The second case is a content management software that we will keep anonymous. It comes in the form of a java web application and a java application container. Part of integrating this software involves a proprietary plugin for Apache HTTPd that acts as a mix of mod_proxy_balancer, mod_disk_cache, and htcacheclean, as well as a cache invalidator.

Originally, the java web application was supposed to be installed within a JBoss Application Server instead of the editor provided container, and Apache HTTPd would reverse proxy requests to the JBoss server. This means we already had an Apache HTTPd in place (latest version ; 2.2.11 at the time), and since we have x86-64 processors, it was built as a 64-bits binary.

Contrary to the first case, this time, we had a Solaris x86 binary. Yes, you read correctly: x86 ; 32-bits only.

After going through the pain of rebuilding Apache HTTPd in 32-bits (there are various reasons why we don't use sunfreeware software), it turned out the module was a 2.0 ABI module not compatible with the 2.2 ABI. It also turned out there was a 2.2 ABI version of the module for Solaris... sparc.

It finally worked after another build of Apache HTTPd, a 2.0.63 release, this time.

The more you get used to free software, the more these kind of things get frustrating.

2009-02-20 23:01:44+0900

p.d.o | 3 Comments »

Testing small upgrades with namespaces and unionfs

If you are following this blog, you probably remember per-process namespaces. Today, I'm going to tell you how I did use them in the process of preparing this server to be upgraded to Lenny.

I must say I was not using the most recent stuff on this server, and this is why I needed such preparation. First, this server is still running with php4. And well, the following line in the apt-get dist-upgrade output got my full attention (emphasis mine):

The following packages will be REMOVED: Â Â cacti libapache2-mod-php4 libarchive-tar-perl libcurl3-openssl-dev Â Â libgssapi2 libpci2 librpm4 libssp0 linux-kernel-headers modutils php4-mysql

While this server is not important enough not to be broken for a couple hours, I do like to test procedures that could help on servers that are.

The first thing I needed to do on this server was obviously to upgrade php. But we all know how php applications are not fully compatible with all versions of php, so I also needed to test the upgrade was not breaking anything.

On a server you don't care much about, you can just upgrade, test if all is okay, and be done with it. Obviously, if all is not okay, your visitors will see it, and you may also have a bad time downgrading.

Another way to perform the upgrade is to have a similarly installed server on the side, test and validate the upgrade there, and replicate the upgrade on the production server if everything is fine. However, that does require additional resources, and possibly to setup them if they don't already exist.

A cheaper way to do the above is to do it on the production server, both in-place and on the side (you'll see what I mean), using unionfs and per-process namespaces. Full containers could be used instead of per-process namespaces (such as openvz, vserver or lxc), but they still require much more setup, especially when you don't use them in the first place. Chroots could work just as well as per-process namespaces, but one of the ideas here is to expose the per-process namespaces feature, and allow for improvement of this procedure with pid and network namespaces, which are not available in Etch (where I'm starting from), but are in Lenny.

Unionfs allows to merge several directories into a single one, accessing some read-only, and others read-write. Installing unionfs is as easy as running the following command:

apt-get install unionfs-modules-`uname -r`

I won't describe all the kinds of setups that are possible with unionfs, but only one typical use case, which is what we will be using here:

# mkdir /tmp/root-cow # mount -t unionfs -o dirs=/tmp/root-cow:/=ro none /mnt

The first thing we do here is to create an empty directory. Next, we merge it with the root filesystem (/), that we will keep read-only (meaning unionfs won't allow itself to write there), and mount the merged filesystem under /mnt. none could just be anything, as there is no device to be mounted.

The result, in /mnt, is just something that looks like the root filesystem:

# ls / bin boot dev etc home initrd lib media mnt opt proc root sbin srv sys tmp usr var # ls /mnt bin boot dev etc home initrd lib media mnt opt proc root sbin srv sys tmp usr var

But creating or modifying a file will do so in /tmp/root-cow:

# echo a > /mnt/a # cat /a cat: /a: No such file or directory # cat /tmp/root-cow/a a # echo foo.com > /mnt/etc/mailname # cat /etc/mailname glandium.org # cat /tmp/root-cow/etc/mailname foo.com # find /tmp/root-cow /tmp/root-cow /tmp/root-cow/etc /tmp/root-cow/etc/mailname /tmp/root-cow/a

Keep in mind that it doesn't include submounts, though:

# ls /var backups cache lib local lock log mail opt run spool tmp www # ls /mnt/var

This means we'll have to also mount /var, /proc, /dev, and /sys.

That's enough testing for now, and we'll first do some cleanup before going after the real job:

# umount /mnt # rm -rf /tmp/root-cow

Using the newns utility from my first post on per-process namespaces, let's create a new namespace to keep our testing private, and populate it with the necessary mount points:

# newns # umount /tmp # mount -t tmpfs tmpfs /tmp # mkdir /tmp/root-cow /tmp/var-cow # mount -t unionfs -o dirs=/tmp/root-cow:/=ro none /mnt # mount -t unionfs -o dirs=/tmp/var-cow:/var=ro none /mnt/var # cd /mnt # pivot_root . mnt # mount --move /mnt/proc /proc # mount --move /mnt/sys /sys # mount --move /mnt/lib/init/rw /lib/init/rw # mount --move /mnt/tmp /tmp # mount --move /mnt/dev /dev

The second and third statements are useful to avoid sharing /tmp with the main namespace, which means the directories we create in the fourth statement won't be visible in /tmp outside this namespace.

The fifth and sixth statements put the union filesystems in place. As my system only has a separate /var filesystem (no /usr, and I don't care about /home here), I only need to setup these two. Feel free to add more union filesystems as necessary.

The pivot_root call allows to switch to the unionfs'ed root: everything under /mnt (/mnt and /mnt/var in our case) will be remounted under /, while what was under / is remounter under /mnt under the new root.
This means, in our case, that we have / and /var as union filesystems, while the old / and /var are respectively in /mnt and /mnt/var.

It also means /dev, /proc, /sys and other filesystems are remounted as /mnt/dev, /mnt/proc, /mnt/sys, etc., which is why we next mount --move all of them to a better place in our namespace.

Once all this setup is done, we are ready to do our php upgrade test. As Etch doesn't have support for neither PID namespaces nor network namespaces, we'll still have some conflicts with the main namespace for TCP port binding and process handling, so we need to be a little careful:

# rm /var/run/apache2.pid # echo Listen 8080 > /etc/apache2/ports.conf

With these changes, we can now safely start apache2 in this namespace, at the same time the main apache2 runs in the main namespace:

# /etc/init.d/apache2 start

Now, there are actually 2 problems showing up in apache in this setup.

The first one is that displaying static files doesn't work. At all. It appears unionfs doesn't support sendfile(). Which is fine. But apache doesn't check for sendfile()'s error cases and doesn't fallback to a working solution when sendfile() doesn't work. So we have to manually disable it:

# echo EnableSendfile off > /etc/apache2/conf.d/sendfile.conf

The second one is that access to the mysql socket doesn't work properly under the unionfs. I didn't want to investigate further, so I only worked around the issue by forcing to use the socket outside the unionfs:

# mount --bind /mnt/var/run/mysqld /var/run/mysqld

Finally, we can test and validate our php upgrade:

# apt-get install php5-mysql php5-cli php5-snmp libapache2-mod-php5 # /etc/init.d/apache2 restart # apt-get remove --purge php4-common

Note that installing libapache2-mod-php5 will remove libapache2-mod-php4, and apache2 gets restarted, taking into account this change. But the php modules (php5-mysql and php5-snmp) install is only going to happen after that, and no apache2 restart is triggered then, which leaves a half working php setup...
Also note that the cacti setup in etch supposes php4 is installed, using a IfModule statement against mod_php4.c and none for mod_php5.c, which means a part of its setup doesn't work out of the box (most notably, the DirectoryIndex).

Once all is validated, all we need to do is to stop apache2 and exit from the namespace. The union filesystems and temporary filesystems are then automagically cleaned-up and the namespace disappears as well as all modifications we just did, since all processes using the namespace ended. We are now free to upgrade on the production server, as we know all the side effects.

I'll now be able to upgrade to Lenny without php getting removed.

2009-02-15 18:25:04+0900

p.d.o | 1 Comment »

SSH using a SOCKS or HTTP proxy

If you follow planet debian, you may already know about the ProxyCommand directive in $HOME/.ssh/config. It allows OpenSSH to connect to a remote host through a given command.

One setup that I use a lot is to have connections be established through a SOCKS proxy. Until today, I was using connect, a small tool written by Shun-ichi GotÃ´. The typical setup I used is:

Host *.mydomain.com ProxyCommand connect -S socksserver:1080 %h %p

I also use jump hosts occasionally, with a setup like this:

Host somehost.mydomain.com ProxyCommand ssh otherhost.mydomain.com nc -w1 %h %p

And today I discovered that netcat-openbsd does support connexions through a proxy, either SOCKS or HTTP. Why keep using two different tools when you can use one? ;) So I changed my setup to:

Host *.mydomain.com ProxyCommand nc -xsocksserver:1080 -w1 %h %p

The default is to use SOCKS 5, add -X4 for SOCKS 4 and -Xconnect for HTTP CONNECT proxies. Note that it doesn't support choosing which end does the name resolutions like connect does with the -R option.

2009-01-27 21:29:53+0900

p.d.o | Comments Off on SSH using a SOCKS or HTTP proxy