Shared subtrees and per-process namespaces

Now we have seen what per-process namespaces and shared subtrees are and how to operate them, we can try to combine these two features.

We'll be using our newns tool from this earlier post to create new namespaces. And for practical reasons, let's say you have a terminal with a 1$ prompt and a second terminal with a 2$ prompt (that will allow me to skip "go to the second terminal" phrases).

In the kernel, per-process namespaces are a bit like bind mounts, such that shared subtrees work with namespaces like they do with bind mounts.

As with standard bind mounts, the default shared subtree mode is private, which means mounts done on either namespaces will be private to the namespace. Only mount points active at the time of the new namespace creation will be in both namespaces:

1$ ./newns
1$ mount /dev/sda1 /mnt
1$ ls /mnt
config-2.6.26-1-amd64 grub initrd.img-2.6.26-1-amd64 System.map-2.6.26-1-amd64 vmlinuz-2.6.26-1-amd64
2$ ls /mnt
2$ mount /dev/sda1 /cdrom
2$ ls /cdrom
config-2.6.26-1-amd64 grub initrd.img-2.6.26-1-amd64 System.map-2.6.26-1-amd64 vmlinuz-2.6.26-1-amd64
1$ ls /cdrom
1$ umount /mnt
1$ exit
# Exit from newns
2$ umount /cdrom

shared mode allows both namespaces to share subsequent mounts. As with bind mounts, and for obvious reasons, you need to change the subtree mode before creating the new namespace:

1$ mount --make-rshared /
1$ ./newns
1$ mount --bind /usr /mnt
1$ ls /mnt
bin games include lib lib32 lib64 local sbin share src X11R6
2$ ls /mnt
bin games include lib lib32 lib64 local sbin share src X11R6

Like with shared bind mounts, the new mount point can be unmounted from either namespace:

2$ umount /mnt
1$ ls /mnt
1$ exit
# Exit from newns

A slave subtree will have mount points under its master (shared) subtrees propagated, while propagation won't happen in the other direction. Again, very much like bind mounts:

1$ mount --make-rshared / # For completeness, we already did that earlier
1$ ./newns
1$ mount --make-rslave /
1$ mount --bind /usr /mnt
1$ ls /mnt
bin games include lib lib32 lib64 local sbin share src X11R6
2$ ls /mnt
2$ mount --bind /usr /cdrom
2$ ls /cdrom
bin games include lib lib32 lib64 local sbin share src X11R6
1$ ls /cdrom
bin games include lib lib32 lib64 local sbin share src X11R6
1$ exit
# Exit from newns
2$ umount /cdrom

Contrary to shared and slave, unbindable doesn't add value when used in two different namespaces. This is only something that will have impact on the current namespace:

1$ mount --make-rshared / # For completeness, we already did that earlier
1$ ./newns
2$ mount --make-runbindable /
2$ mount --bind /usr /mnt
mount: wrong fs type, bad option, bad superblock on /usr,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail or so
1$ mount --bind /usr /mnt
bin games include lib lib32 lib64 local sbin share src X11R6
1$ exit
# Exit from newns
1$ mount --make-rprivate # Set back to default

Now we've seen what can be done with namespaces and shared subtrees, let's see what nice features can be implemented with both.

As Russell revealed, pam-namespace is able to polyinstanciate some directories following rules set in /etc/security/namespace.conf (see namespace.conf(5)). The sad thing is that it apparently can't just create a new namespace without polyinstanciating a directory, which could be useful if you only want separate namespaces, but no polyinstanciated directory.

Russell's recipe goes as follows:

  • At boot time, / is put in shared mode and /tmp in private mode.
  • When opening a session, pam will create a new namespace, and bind mount /tmp-inst/$USER to /tmp.

What this means is that in the newly created namespace, a user reading in /tmp will actually be reading in /tmp-inst/$USER without the user knowing. Also, if root mounts something on the parent namespace, it will be propagated (/ being shared) to the user namespace. This means that mounting USB storage, for example, will be propagated to the user namespace. This also means that something mounted from the user namespace will also be propagated to the parent namespace. In both cases, this only applies to mounts occuring outside of /tmp, for which mounts don't get propagated.

Note that without setting /tmp as private, when PAM would be mounting /tmp/inst/$USER, it would propagate as well, setting /tmp to /tmp/inst/$USER for everyone. So setting /tmp as private is mandatory.

There is, however, a flaw in Russell's recipe: If any of the user mounts something under a submount of /, under /var for example, if /var is mounted, it won't be mounted to a user that already had a session opened. For that to be possible, you have to use --make-rshared instead of --make-shared in Russell's recipe.

Some nice setup that can be done with all these, is the following:

Add the following to /etc/security/namespace.conf:

/tmp tmpfs tmpfs root

Add the following to /etc/pam.d/common-session:

session required pam_namespace.so

Until here, it is pretty much the same as Russell's, except /tmp is a tmpfs instead of a real directory in /tmp-inst/, which can have some advantages. It doesn't seem to be possible to give pam-namespace a size for the tmpfs, though.

Add the following to /etc/rc.local:

mount --make-rshared /
mount --bind /tmp /tmp
mount --make-private /tmp

Here again, this is the same as Russell's except for the correction for --make-rshared as seen above. If /tmp is already a mount point (on my systems, it is a tmpfs), you can remove mount --bind /tmp /tmp from above.

Add the following to /etc/security/namespace.init:

mount --make-rslave /

Now, this is where it gets interesting: we're setting the whole tree as slave in the user namespace which means that if a user mounts a file system anywhere, it will only be seen in his session. Which means the user can more safely mount encrypted volumes: they won't be available to other users (root can still go wandering in /dev/kmem, though). And you don't even need SE linux for that.

The caveat is that if you open a root session with su, from the user session, you are still inside the user boundaries, and don't have access to the virtual filesystem as it is for init. And if you mount something as root then, while it will appear in the user namespace, it won't appear neither in other users namespace nor in init's, which can be a good thing in some cases, but a burden in others. It means you may need to setup a special user that won't get a new namespace in /etc/security/namespace.conf.

An alternative model would be to only set the user's home as slave, which means anything mounted by the user in his home directory would stay private in his namespace, while anything mounted outside would be shared with other namespaces.

For that, you may want to replace the lines we added to /etc/security/namespace.init with the following:

HOME=$(getent passwd $3 | cut -d: -f6)
mount --bind "$HOME" "$HOME"
mount --make-rslave "$HOME"

Either way, the remaining problem is that a root session opened with su from the user session won't get access to the original /tmp.

As we saw, there are various use cases for namespaces and shared subtrees. I'll follow-up again on these features soon, as I'll be using them in yet another way to achieve a very different purpose.

2008-12-13 17:01:12+0900

p.d.o

Both comments and pings are currently closed.

One Response to “Shared subtrees and per-process namespaces”

  1. Tincho Says:

    Why not mounting $HOME/tmp to /tmp, as plan9 does?