Author Archive

Debian becoming loony again

As every time a release is approaching, Debian is turning into a loony. There must be some combination of stress, accumulated tensions between people, whatever, but the net result is that Debian turns crazy and tries all its best not to release.

So, instead of trying to fix some of the symptoms by implementing crazy stuff such as pseudo moderated mailing lists or code of conducts, let's just fix the root cause: have Debian never release again.

2008-12-22 21:12:27+0900

debian | 7 Comments »

Useless use of …

(Inaugurating a new category which will be kind of my own dailty wtf)

A pretty well known matter is how some tools can be abused. Usually, it is cat or grep. Typical examples include the following two:

cat foo | grep bar

grep bar foo | awk '{print $2}'

Obviously, the first one can be reduced to:

grep bar foo

and the second, to:

awk '/bar/ {print $2}' foo

awk and sed are often part of the reason grep gets abused, because a lot of people don't know the latter form above or its sed equivalent. They don't know they could restrict the match on a specific column, too, which could be important in some cases.

Anyways, a few months ago, I saw the most beautiful short piece of combined useless use of something I've ever seen, which also happens to be a pretty unusual type:

more $file | wc | awk '{print $1}'

Not only does it use more where even cat, while still being useless, would be a better choice, but it also manages to use awk where wc could take an option to achieve the same goal.

Yes, the above can be simplified as:

wc -l < $file

We never found who was responsible for this nicety.

2008-12-20 11:17:51+0900

wtf | 8 Comments »

Shared subtrees and per-process namespaces

Now we have seen what per-process namespaces and shared subtrees are and how to operate them, we can try to combine these two features.

We'll be using our newns tool from this earlier post to create new namespaces. And for practical reasons, let's say you have a terminal with a 1$ prompt and a second terminal with a 2$ prompt (that will allow me to skip "go to the second terminal" phrases).

In the kernel, per-process namespaces are a bit like bind mounts, such that shared subtrees work with namespaces like they do with bind mounts.

As with standard bind mounts, the default shared subtree mode is private, which means mounts done on either namespaces will be private to the namespace. Only mount points active at the time of the new namespace creation will be in both namespaces:

1$ ./newns
1$ mount /dev/sda1 /mnt
1$ ls /mnt
config-2.6.26-1-amd64 grub initrd.img-2.6.26-1-amd64 System.map-2.6.26-1-amd64 vmlinuz-2.6.26-1-amd64
2$ ls /mnt
2$ mount /dev/sda1 /cdrom
2$ ls /cdrom
config-2.6.26-1-amd64 grub initrd.img-2.6.26-1-amd64 System.map-2.6.26-1-amd64 vmlinuz-2.6.26-1-amd64
1$ ls /cdrom
1$ umount /mnt
1$ exit
# Exit from newns
2$ umount /cdrom

shared mode allows both namespaces to share subsequent mounts. As with bind mounts, and for obvious reasons, you need to change the subtree mode before creating the new namespace:

1$ mount --make-rshared /
1$ ./newns
1$ mount --bind /usr /mnt
1$ ls /mnt
bin games include lib lib32 lib64 local sbin share src X11R6
2$ ls /mnt
bin games include lib lib32 lib64 local sbin share src X11R6

Like with shared bind mounts, the new mount point can be unmounted from either namespace:

2$ umount /mnt
1$ ls /mnt
1$ exit
# Exit from newns

A slave subtree will have mount points under its master (shared) subtrees propagated, while propagation won't happen in the other direction. Again, very much like bind mounts:

1$ mount --make-rshared / # For completeness, we already did that earlier
1$ ./newns
1$ mount --make-rslave /
1$ mount --bind /usr /mnt
1$ ls /mnt
bin games include lib lib32 lib64 local sbin share src X11R6
2$ ls /mnt
2$ mount --bind /usr /cdrom
2$ ls /cdrom
bin games include lib lib32 lib64 local sbin share src X11R6
1$ ls /cdrom
bin games include lib lib32 lib64 local sbin share src X11R6
1$ exit
# Exit from newns
2$ umount /cdrom

Contrary to shared and slave, unbindable doesn't add value when used in two different namespaces. This is only something that will have impact on the current namespace:

1$ mount --make-rshared / # For completeness, we already did that earlier
1$ ./newns
2$ mount --make-runbindable /
2$ mount --bind /usr /mnt
mount: wrong fs type, bad option, bad superblock on /usr,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail or so
1$ mount --bind /usr /mnt
bin games include lib lib32 lib64 local sbin share src X11R6
1$ exit
# Exit from newns
1$ mount --make-rprivate # Set back to default

Now we've seen what can be done with namespaces and shared subtrees, let's see what nice features can be implemented with both.

As Russell revealed, pam-namespace is able to polyinstanciate some directories following rules set in /etc/security/namespace.conf (see namespace.conf(5)). The sad thing is that it apparently can't just create a new namespace without polyinstanciating a directory, which could be useful if you only want separate namespaces, but no polyinstanciated directory.

Russell's recipe goes as follows:

  • At boot time, / is put in shared mode and /tmp in private mode.
  • When opening a session, pam will create a new namespace, and bind mount /tmp-inst/$USER to /tmp.

What this means is that in the newly created namespace, a user reading in /tmp will actually be reading in /tmp-inst/$USER without the user knowing. Also, if root mounts something on the parent namespace, it will be propagated (/ being shared) to the user namespace. This means that mounting USB storage, for example, will be propagated to the user namespace. This also means that something mounted from the user namespace will also be propagated to the parent namespace. In both cases, this only applies to mounts occuring outside of /tmp, for which mounts don't get propagated.

Note that without setting /tmp as private, when PAM would be mounting /tmp/inst/$USER, it would propagate as well, setting /tmp to /tmp/inst/$USER for everyone. So setting /tmp as private is mandatory.

There is, however, a flaw in Russell's recipe: If any of the user mounts something under a submount of /, under /var for example, if /var is mounted, it won't be mounted to a user that already had a session opened. For that to be possible, you have to use --make-rshared instead of --make-shared in Russell's recipe.

Some nice setup that can be done with all these, is the following:

Add the following to /etc/security/namespace.conf:

/tmp tmpfs tmpfs root

Add the following to /etc/pam.d/common-session:

session required pam_namespace.so

Until here, it is pretty much the same as Russell's, except /tmp is a tmpfs instead of a real directory in /tmp-inst/, which can have some advantages. It doesn't seem to be possible to give pam-namespace a size for the tmpfs, though.

Add the following to /etc/rc.local:

mount --make-rshared /
mount --bind /tmp /tmp
mount --make-private /tmp

Here again, this is the same as Russell's except for the correction for --make-rshared as seen above. If /tmp is already a mount point (on my systems, it is a tmpfs), you can remove mount --bind /tmp /tmp from above.

Add the following to /etc/security/namespace.init:

mount --make-rslave /

Now, this is where it gets interesting: we're setting the whole tree as slave in the user namespace which means that if a user mounts a file system anywhere, it will only be seen in his session. Which means the user can more safely mount encrypted volumes: they won't be available to other users (root can still go wandering in /dev/kmem, though). And you don't even need SE linux for that.

The caveat is that if you open a root session with su, from the user session, you are still inside the user boundaries, and don't have access to the virtual filesystem as it is for init. And if you mount something as root then, while it will appear in the user namespace, it won't appear neither in other users namespace nor in init's, which can be a good thing in some cases, but a burden in others. It means you may need to setup a special user that won't get a new namespace in /etc/security/namespace.conf.

An alternative model would be to only set the user's home as slave, which means anything mounted by the user in his home directory would stay private in his namespace, while anything mounted outside would be shared with other namespaces.

For that, you may want to replace the lines we added to /etc/security/namespace.init with the following:

HOME=$(getent passwd $3 | cut -d: -f6)
mount --bind "$HOME" "$HOME"
mount --make-rslave "$HOME"

Either way, the remaining problem is that a root session opened with su from the user session won't get access to the original /tmp.

As we saw, there are various use cases for namespaces and shared subtrees. I'll follow-up again on these features soon, as I'll be using them in yet another way to achieve a very different purpose.

2008-12-13 17:01:12+0900

p.d.o | 1 Comment »

Shared subtrees

As reply to my previous post about per-process namespaces, Russell wrote about pam-namespace and shared subtrees. As he reports, pam-namespace (that I discovered by the occasion) can do what I was suggesting would be nice for pam-tmpdir.

Anyways, I was actually already planning to write about shared subtrees and how they can be useful with namespaces. In this first post, I will introduce shared subtrees, while in a follow-up post I will introduce how they can usefully be used with namespaces.

As a preliminary note, you should know that while etch's kernel supports shared subtrees, etch's mount doesn't support the necessary options. Lenny's mount is unfortunately uninstallable on etch due to its dependency on a newer libc. But you can find a smount utility's source code in Documentation/filesystems/sharedsubtree.txt in the kernel source, where other explanations about the feature are available. So, you can either backport lenny's util-linux to etch, or build smount to replicate what I will be doing below. If you are using smount, just replace

mount --make-type path

with

smount path type

As seen in my previous post, bind-mounting allows to attach a file hierarchy at some other place in the virtual file system.

mount --bind / /mnt

will make all the contents of / (/bin, /etc, ...) available through /mnt (/mnt/bin, /mnt/etc, ...).

On the other hand, sub-mounts will be ignored in such a case. For instance, /usr is usually a different mount point from /. It means /mnt/usr will be empty (provided it is empty in the underlying physical filesystem), instead of containing the same as /usr:

$ mount --bind / /mnt
$ ls /mnt/usr
$ ls /usr
bin games include lib lib32 lib64 local sbin share src X11R6
$ umount /mnt

If you also want /usr to be bind-mounted, you must use --rbind instead of --bind:

$ mount --rbind / /mnt
$ ls /mnt/usr
bin games include lib lib32 lib64 local sbin share src X11R6
$ ls /usr
bin games include lib lib32 lib64 local sbin share src X11R6

Obviously, sub-sub-mounts will also be propagated.

Since recursively bind-mounting will create a bunch of mount points, unmounting can become a hassle. You can use the following command to unmount everything:

$ awk '$2 ~ /^\/mnt/ {print $2}' /proc/mounts | sort -r | xargs umount

Now, this is where shared subtrees come in. After the bind mount has been done, if you mount something on either side of the bind mount, the new mount is not propagated. This is called private subtrees, and is the default. But before doing anything else, let's setup our testing environment:

$ mkdir -p /a/a /b
$ touch /a/b /a/c
$ ls /a
a b c
$ ls /b

After, bind-mounting /a onto /b, let's see what happens when mounting something under /a, then under /b:

$ mount --bind /a /b
$ ls /b
a b c
$ mount --bind /usr /a/a
$ ls /a/a
bin games include lib lib32 lib64 local sbin share src X11R6
$ ls /b/a
$ umount /a/a
$ mount --bind /usr /b/a
$ ls /a/a
$ ls /b/a
bin games include lib lib32 lib64 local sbin share src X11R6
$ umount /b/a

As I said earlier, these new mounts are not propagated. Note that I used bind-mounts as sub mounts, but it would work all the same with other kind of mounts (device, fuse, etc.).

There are 2 other modes that allow to have some propagation: shared and slave.

shared allows mounts on both ends to be shared. Note you need to set the mode before bind-mounting:

$ umount /b
$ mount --bind /a /a
# This is needed because /a is not initially a mount point and you can only apply subtree modes to mount points.
$ mount --make-shared /a
$ mount --bind /a /b
$ mount /dev/sda1 /a/a
$ ls /a/a
config-2.6.26-1-amd64 grub initrd.img-2.6.26-1-amd64 System.map-2.6.26-1-amd64 vmlinuz-2.6.26-1-amd64
$ ls /b/a
config-2.6.26-1-amd64 grub initrd.img-2.6.26-1-amd64 System.map-2.6.26-1-amd64 vmlinuz-2.6.26-1-amd64
$ umount /a/a
$ mount /dev/sda1 /b/a
$ ls /a/a
config-2.6.26-1-amd64 grub initrd.img-2.6.26-1-amd64 System.map-2.6.26-1-amd64 vmlinuz-2.6.26-1-amd64
$ ls /b/a
config-2.6.26-1-amd64 grub initrd.img-2.6.26-1-amd64 System.map-2.6.26-1-amd64 vmlinuz-2.6.26-1-amd64
$ umount /b/a

You can even mount on one end and unmount from the other:

$ mount /dev/sda1 /a/a
$ umount /b/a
$ ls /a/a

slave allows mounts on the "master" end (/a in our case) to propagate to the "slave" end (/b), but not the other way around. The "master" end need to be shared :

$ umount /b
$ mount --make-shared /a
# Only for completeness, we already set /a as shared earlier.
$ mount --bind /a /b
$ mount --make-slave /b
$ mount /dev/sda1 /a/a
$ ls /a/a
config-2.6.26-1-amd64 grub initrd.img-2.6.26-1-amd64 System.map-2.6.26-1-amd64 vmlinuz-2.6.26-1-amd64
$ ls /b/a
config-2.6.26-1-amd64 grub initrd.img-2.6.26-1-amd64 System.map-2.6.26-1-amd64 vmlinuz-2.6.26-1-amd64
$ umount /a/a
$ mount /dev/sda1 /b/a
$ ls /a/a
$ ls /b/a
config-2.6.26-1-amd64 grub initrd.img-2.6.26-1-amd64 System.map-2.6.26-1-amd64 vmlinuz-2.6.26-1-amd64
$ umount /b/a

There is a third mode, unbindable, that does something different:

$ umount /b
$ mount --make-unbindable /a
$ mount --bind /a /b
mount: wrong fs type, bad option, bad superblock on /a,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail or so
$ mount --bind /a/a /b
mount: wrong fs type, bad option, bad superblock on /a,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail or so

As you can see, it disallows bind mounting of /a and its subdirectories to some other place.

Finally, to put back the default mode, you can use:

$ mount --make-private /a

Similarly to --bind, there is also a recursive version of each: rshared, rslave, runbindable and rprivate, to apply these to sub-mounts.

2008-12-13 12:45:47+0900

p.d.o | 1 Comment »

Per-process namespaces

Linux has had this neat feature for quite some time now: since 2.4.19 according to the docs. Yet, it is neither very known nor very used. I couldn't even find a program that would create a new namespace for its subprocesses, similar to what chroot does with the root of the file hierarchy.

This neat feature allows each process to have a different set of mount points. While you most of the time want processes to share their mount points, there are some cases where you may want to have some processes have a different set of mount points. Combined with bind mounts, it can allow some useful setups.

In case you are not familiar with bind mounts, they allow to "attach" a part of the file hierarchy to some place else. For example:

$ ls /mnt
$ ls /usr
bin games include lib lib32 lib64 local sbin share src X11R6
$ mount --bind /usr /mnt
$ ls /mnt
bin games include lib lib32 lib64 local sbin share src X11R6

Now, take pam-tmpdir, for example. It sets $TMPDIR and $TMP to point to a user-specific temporary directory. Sadly, it is pretty useless for applications that don't follow the standards of using these environment variables.

Without namespaces, if you'd create a temporary directory and bind-mount it to /tmp, this new /tmp would be visible to everyone, to every process. But with namespaces, you can make this new /tmp only available to subprocesses. If pam-tmpdir were to do this, it would also allow applications trying to write to /tmp without resorting to $TMP or $TMPDIR to be using the temporary space, without impacting external processes, that would still be using the original /tmp.

On x86-64, you can run both 64-bits and 32-bits applications. 64-bits applications would take libraries from /usr/lib, and 32-bits applications would search libraries in /usr/lib32. But badly crafted 32-bits applications could be trying to load libraries from /usr/lib, where only 64-bits versions are available.

With namespaces, the broken 32-bits application could have /usr/lib32 bind-mounted to /usr/lib without the 64-bits applications knowing.

You could certainly get a similar result with the following set of commands:

$ mount --rbind / /chroot
$ mount --rbind /usr/lib32 /chroot/usr/lib
$ chroot /chroot $application

(--rbind also attaches submounts, contrary to --bind)

The downside, here, is that external processes will see all this setup under /chroot. The whole setup would be invisible to external processes if namespaces were used.

Another nice use of namespaces would be to mount encrypted volumes under a different namespace, so that only a limited set of processes would be allowed to read the decrypted data. The sad thing is that you need the admin capability to create a new namespace, so that would need to be done by a setuid root program.

There are, as far as a few hours fiddling showed me, 2 system calls that will setup a new namespace: clone(2), and unshare(2). The second is easier to use, though only available since 2.6.16. But while etch ships 2.6.18, the glibc coming with it doesn't implement unshare(2), so we need to use syscall(2) instead. The following code will run /bin/sh, or any command given as argument after creating a new namespace. The new process and its subprocesses will inherit the new namespace.

#include <sched.h>
#include <syscall.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
  syscall(SYS_unshare, CLONE_NEWNS);
  if (argc > 1)
    return execvp(argv[1], &argv[1]);
  return execv("/bin/sh", NULL);
}

This tool, once built, is called newns in the following example.

$ mkdir /tmp/abc
$ ./newns
$ mount -n --bind /tmp/abc /tmp
$ touch /tmp/a
$ ls /tmp
a
<in another terminal>
$ ls /tmp
abc
$ ls /tmp/abc
a

When using namespaces, it is better not to have mount fill /etc/mtab, using the -n argument. /proc/mounts will contain the proper mount information about the namespace of the process reading it. /proc/PID/mounts will contain the mount information for the given process.

As bind mounts also work on files, you can override some files. The following will run dash instead of bash (in subprocesses, too, obviously):

$ ./newns sh -c "mount -n --bind /bin/dash /bin/bash; /bin/bash"

Back to the idea of having encrypted volumes only available to some processes, the following should work (unverified):

$ ./newns sh -c "encfs /tmp/crypt-raw /tmp/crypt; /bin/bash"

Only the opened bash, and its subprocesses, would have access to /tmp/crypt.

The newns tool used above could, to allow normal users to be able to fiddle with namespaces, be improved to be a setuid root program that would drop its privileges right after unshare(2) to take the same privileges as the calling process.

As you can see, per-process namespaces have a wide range of possible uses ; it's astonishing that it's not more used yet, considering its age.

Additionally to per-process namespaces, there are also a bunch of other (more recent) features that allow to implement vserver-like features with a vanilla kernel, such as networking netspaces (work under progress, though), PID namespaces and utsname (see uname(2)) namespaces. Actually, these features are designed to be used by vserver and openvz.

I am looking forward to having unprivileged mounts implemented, so that users could fool around with bind mounts. Unprivileged namespaces would be a nice addition.

2008-12-12 23:06:33+0900

p.d.o | 4 Comments »

On open source and mobile phones

No software is perfect, and most users happen to have some itches they would like to scratch on them. I don't know anyone who hasn't thought some day "why can't that software do <some feature>". But not everybody is able to scratch these itches. For a starter, most people aren't developers, or aren't able to modify source code to reach some known goal, be it fixing a bug or implementing a missing feature.

But even when they are able to do so, most applications can't be fixed or improved easily: most widespread applications are still proprietary software where access to the source code is either impossible or prohibitively expensive. That's the good thing about open source and free software: you are able to itch these scratches yourself ; and as a free software user and developer, I often do so in the software I use.

I must say it's even frustrating when you are used to this possibility but end up on damn stupid bugs or limitations you can't fix in proprietary software. I have quite a bunch of such examples, the most recent ones being with VMware ESX, but that's another story.

Mobile phones are everywhere nowadays. In some countries, there are even more mobile phones than people. A huge amount of people are using them daily. Or maybe should I say "endure".

When you go to shop for a laptop, digital camera, or digital portable music player, you can find demonstration models you can fiddle with, see what the user interface looks like, how it behaves, see if whatever feature you like on such devices is present and how easily you can do what you want.

With a mobile phone, you don't have all that (at least in France and Japan, where I experienced this). You're lucky if one of the sellers owns the exact model you'd like to test and lends it to you for a minute. Most of the time, the "thing" to watch is a plastic model, not even remotely related to the real thing in colour, weight... and obviously, it has as much use as a brick (not even so, actually, as you can't, for instance, break windows with it). And most of the time, you end up regretting the choice you made, because the UI is so unuseable, or so slow after a while, or so buggy, etc.

More and more phones nowadays have multimedia as their main feature. While it can be nice, especially when you see how some are pretty good at it, and when you consider it avoids you to carry both a phone and a multimedia device. Sadly, they usually forget to be good phones, too.

Sometimes even, multimedia(-related) features are crippled by the carriers. As an example, my phone, while able to read mp3 music, has been limited by the carrier to refuse to use them as ring tones, while, according to the maker and to various forums, the phone is perfectly able to do so. It's even worse than that, because even converted as 3gp, which it otherwise allows as ring tones if you download them at a ridiculous price, it still doesn't allow to put your music as a ring tone. You got it, features in phones are also limited for big money-making. Fortunately, there is a hole in my phone where I can set a 3gp as ring tone if I do it from the music player instead of the file browser.

Most articles I read about Android, or the iPhone SDK, relate how they are great opportunities to enable developers to provide nice applications for users, using the provided frameworks. But the thing is : they're only applications. An application is not the core of the phone. An application won't take care of ring tones, or vibration types.

And that's what open source in mobile phones should also allow users to do. That's what it all should be about. As a user and free-software developer, I would like to enhance my phone like I can enhance my laptop.

For instance, my phone will only allow crappy polyphonic tones for SMS or e-mail reception. Digital music is only available when you receive a call. Why ? There is obviously nothing preventing digital musing when receiving SMS or e-mail, except a stupid software restriction that could be fixed if the firmware were open-source.

Very few phones in France appear to allow japanese input, let alone proper display of japanese characters in e-mail or on the web. My previous phone was able to display japanese correctly. My current phone doesn't. Neither of them is able to input japanese (which is not unexpected in France, actually). I know the iPhone does both. Anyways, these are informations very hard to get a hand on before buying a model, and something that open-source would allow me to fix. Even better, it would allow me to take the pieces from the asian version of the same (or similar) phone which probably has the functionality.

The phone I had when I was living in Japan had a nice feature that I haven't seen in any phone in France (but again, this kind of information is very hard to get): different vibration types. For example, you had a vibration type making short vibrations, another one long vibrations, yet another one 3 short vibrations, a short pause, then 3 short vibrations again, etc. This is very useful when you disable ring tones during a meeting or some other occasions where you don't want to bother the people around you with your phone ringing: depending on how the phone vibrates, you know whether you have to take the call or can ignore it, or if it's only an e-mail, without taking a look at the phone. Again, an open-source firmware would allow me to implement this.

I'm sad so much of what open-source can bring to mobile phones is so badly publicized.

2008-12-06 15:31:22+0900

p.d.o | 3 Comments »

apt-get remove nspluginwrapper

Finally, it happened: the infamous nspluginwrapper is not needed to use the proprietary flash plugin on amd64. Get your plugin from adobe themselves. It's supposedly alpha quality, but it really can't be worse than having nswrapper.bin either eating your cpu or crashing, or only seeing a grey area instead of flash content.

2008-11-17 20:57:19+0900

p.d.o | 5 Comments »

Book meme reloaded

If I'm not mistaken, this is at least the second time I see this meme going on, though I didn't participate the first time...

Here are the rules:

  • Grab the nearest book.
  • Open it to page 56.
  • Find the fifth sentence.
  • Post the text of the sentence in your journal along with these instructions.
  • Don’t dig for your favorite book, the cool book, or the intellectual one: pick the CLOSEST.

And the result:

それが好きで仕方がなくて、それをすることによって人にも喜んでもらえる仕事のことです。

From スピリヒュアル生活12ヵ月 by 江原啓之

2008-11-12 21:45:18+0900

p.d.o | 1 Comment »

Another tool to mount virtual disk images

Vboxmount is public, says Christian Kellner.

It is a tool using VirtualBox APIs to bring virtual disks (not sure if it is limited to VirtualBox format or works for any format VirtualBox knows) as block devices under linux, using the network block device driver (nbd).

I have to say there are 2 things I don't like about nbd.

First, it adds useless overhead, since it has to go through the network stack, even when the whole thing is local.

Second, I have had bad experience with nbd stability, though I must say I only tested on old stuff. A while ago, VMware ESX 2.5 had a tool, named vmware-mount, that would basically do what Vboxmount does, for ESX vmdk files, on the ESX service console (a 2.4.something kernel). The fact is, the whole thing would bring the whole server down (kernel panic or deadlock, I can't remember) more often than not. Which is why the tool has not been provided since ESX 3.0. This tool was using nbd.

There are IMHO better ways to implement something like this, though the nicest doesn't exist yet.

  • dm-userspace allows for something similar, but requires the image file to be backed by a loopback device, and the data on the image needn't be compressed.
  • Fuse would allow to present a flattened image as a file, that you could turn into a block device with the loopback device driver. I happen to have written something like that, except its legal status is unsure.
  • As for the nicest solution I can see, that doesn't exist yet, as said above, it would be some kind of "Buse" (Block device in USEr space), or process-backed loopback device, call it like you want, that would allow a process to answer to random reads in a (virtual) block device, in a similar way a Fuse file system process would answer to random reads in a (virtual) file. This has been discussed several times on several mailing lists, but has not yet been implemented, as far as I know.

2008-11-06 22:54:44+0900

p.d.o | 2 Comments »

Emptying a deleted file

Yesterday, at work, we had the typical case where df would say there is (almost) no space left on some device, while du doesn't see as much data present as you would expect from this situation. This happens when you delete a file that another process has opened (and, obviously, not yet closed).

In typical UNIX filesystems, files are actually only entries in a directory, pointing (linking) to the real information about the content, the inode.

The inode contains the information about how many such links exist on the filesystem, the link count. When you create a hard link (ln without -s), you create another file entry in some directory, linking to the same inode as the "original" file. You also increase the link count for the inode.

Likewise, when removing a file, the entry in the directory is removed (though most of the time, really only skipped, but that's another story), and the link count decreased. When the link count is zero, usually, the inode is marked as deleted.

Except when the usage count is not zero.

When a process opens a file, the kernel keeps a usage count for the corresponding inode in memory. When some process is reading from a file, it doesn't really expect it to disappear suddenly. So, as long as the usage count is not null, even when the link count in the inode is zero, the content is kept on the disk and still takes space on the filesystem.

On the other hand, since there is no entry left in any directory linking to the inode, the size for this content can't be added to du's total.

Back to our problem, the origin was that someone had to free some space on a 1GB filesystem, and thought a good idea would be to delete that 860MB log file that nobody cares about. Except that it didn't really remove it, but he didn't really check.

Later, the "filesystem full" problem came back at someone else, who came to ask me what files from a small list he could remove. But the files were pretty small, and that wouldn't have freed enough space. That gave me the feeling that we probably were in this typical case I introduced this post with, which du -sk confirmed: 970MB used on the filesystem according to df, but only 110MB worth of data...

Just in case you would need to find the pid of the process having the deleted file still opened, or even better, get access to the file itself, you can use the following command:

find -L /proc/*/fd -type f -links 0

(this works on Linux ; remove -L on recent Solaris ; on other OSes, you can find the pid with lsof)

Each path this command returns can be opened and its content accessed with a program, such as cat. That will give access to the deleted content.

I already adressed how to re-link such a file, which somehow works under Linux, but in my case, all that mattered was to really remove the file, this time. But we didn't know if it was safe to stop the process still holding the file, nor how to properly restart it. We were left without a possible resolution, but still needed to come up with something before the filesystem gets really full while waiting to be able to deal with the root of the problem.

The first crazy idea I had was to attach a debugger to the process, and use it to close the file descriptor and open a new file instead (I think you can find some examples with google). But there was no debugger installed.

So, I had this other crazy idea: would truncate() work on these /proc/$pid/fd files?

You know what? It does work. So I bought us some time by running:

perl -e 'truncate("/proc/$pid/fd/$fd", 0);'

(somehow, there is no standard executable to do a truncate(), so I always resort to perl)

Afterwards, I also verified the same works under Linux (where you wouldn't really know what it'd do with these files that are symbolic links to somewhere that doesn't exist).

The even simpler following command works, too.

> /proc/$pid/fd/$fd

It doesn't truncate() but open() with O_WRONLY | O_CREAT | O_TRUNC, and close() right after (to simplify), which has the same effect.

Good to know, isn't it?

2008-11-06 22:23:33+0900

miscellaneous, p.d.o | 5 Comments »