Another union filesystem approach

By Jake Edge
September 1, 2010

Creating a union of two (or more) filesystems is a commonly requested feature for Linux that has never made it into the mainline. Various implementations have been tried (part 1 and part 2 of Valerie Aurora's look from early 2009), but none has crossed the threshold for inclusion. Of late, union mounts have been making some progress, but there is still work to do there. A hybrid approach—incorporating both filesystem- and VFS-based techniques—has recently been posted in an RFC patchset by Miklos Szeredi.

The idea behind unioning filesystems is quite simple, but the devil is in the details. In a union, one filesystem is mounted "atop" another, with the contents of both filesystems appearing to be in a single filesystem encompassing both. Changes made to the filesystem are reflected in the "upper" filesystem, and the "lower" filesystem is treated as read-only. One common use case is to have a filesystem on read-only media (e.g. CD) but allow users to make changes by writing to the upper filesystem stored on read-write media (e.g. flash or disk).

There are a number of details that bedevil developers of unions, however, including various problems with namespace handling, dealing with deleted files and directories, the POSIX definition of readdir(), and so on. None of them are insurmountable, but they are difficult, and it is even harder to implement them in a way that doesn't run afoul of the technical complaints of the VFS maintainers.

Szeredi's approach blends the filesystem-based implementations, like unionfs and aufs, with the VFS-based implementation of union mounts. For file objects, an open() is forwarded to whichever of the two underlying filesystems contains it, while directories are handled by the union filesystem layer. Neil Brown's very helpful first cut at documentation for the patches lumped directory handling in with files, but Szeredi called that a bug. Directory access is never forwarded to the other filesystems and directories need to "come from the union itself for various reasons", he said.

As outlined in Brown's document, most of the action for unions takes place in directories. For one thing, it is more accurate to look at the feature as unioning directory trees, rather than filesystems, as there is no requirement that the two trees reside in separate filesystems. In theory, the lower tree could even be another union, but the current implementation precludes that.

The filesystem used by the upper tree needs to support the "trusted" extended attributes (xattrs) and it must also provide valid d_type (file type) for readdir() responses, which precludes NFS. Whiteouts—that is files that exist in the lower tree, but have been removed in the upper—are handled using the "trusted.union.whiteout" xattr. Similarly, opaque directories, which do not allow entries in the lower tree to "show through", are handled with the "trusted.union.opaque" xattr.

Directory entries are merged with fairly straightforward rules: if there are entries in both the upper and lower layers with the same name, the upper always takes precedence unless both are directories. In that case, a directory in the union is created that merges the entries from each. The initial mount creates a merged directory of the roots of the upper and lower directory trees and subsequent lookups follow the rules, creating merged directories that get cached in the union dentry as needed.

Write access to lower layer files is handled by the traditional "copy up" approach. So, opening a lower file for write or changing its metadata will cause the file to be copied to the upper tree. That may require creating any intervening directories if the file is several levels down in a directory tree on the lower layer. Once that's done, though, the hybrid union filesystem has little further interaction with the file, at least directly, because operations and handed off to the upper filesystem.

The patchset is relatively small, and makes very few small changes to VFS—except for a change to struct inode_operations that ripples through the filesystem tree. The permissions() member of that structure currently takes a struct inode *, but the hybrid union filesystem needs to be able to access the filesystem-specific data (d_fsdata) that is stored in the dentry, so it was changed to take a struct dentry * instead. David P. Quigley questioned the need for the change, noting that unionfs and aufs did not require it. Aurora pointed out that union mounts would require something similar and that, along with Brown's documentation, seemed to put the matter to rest.

The rest of the patches make minor changes. The first adds a new struct file_operations member called open_other() that is used to forward open() calls to the upper or lower layers as appropriate. Another allows filesystems to set a FS_RENAME_SELF_ALLOW flag so that rename() will still process renames on the identical dummy inodes that the filesystem uses for non-directories. The bulk of the code (modulo the permissions() change) is the new fs/union filesystem itself.

While "union" tends to be used for these kinds of filesystems (or mounts), Brown noted that it is confusing and not really accurate, suggesting that "overlay" be used in its place. Szeredi is not opposed to that, saying that "overlayfs" might make more sense. Aurora more or less concurred, saying that union mounts were called "writable overlays" for one release. The confusion stemming from multiple uses of "union" in existing patches (unionfs, union mounts) may provide additional reason to rename the hybrid union filesystem to overlayfs.

The readdir() semantics are a bit different for the hybrid union as well. Changes to merged directories while they are being read will not appear in the entries returned by readdir(), and offsets returned from telldir() may not return to the same location in a merged directory on subsequent directory opens. The lists of directory entries in merged directories are created and cached on the first readdir() call, with offsets assigned sequentially as they are read. For the most part, these changes are "unlikely to be noticed by many programs", as Brown's documentation says.

A bigger issue is one that all union implementations struggle with: how to handle changes to either layer that are done outside of the context of the union. If users or administrators directly change the underlying filesystems, there are a number of ugly corner cases. Making the lower filesystem be read-only is an attractive solution, but it is non-trivial to enforce, especially for filesystems like NFS.

Szeredi would like to define the problem away or find some way to enforce the requirements that unioning imposes:

The easiest way out of this mess might simply be to enforce exclusive modification to the underlying filesystems on a local level, same as the union mount strategy. For NFS and other remote filesystems we either

a) add some way to enforce it,
b) live with the consequences if not enforced on the system level, or
c) disallow them to be part of the union.

There was some discussion of the problem, without much in the way of conclusions other than a requirement that changing the trees out from under the union filesystem not cause deadlocks or panics.

In some ways, hybrid union seems a simpler approach than union mounts. Whether it can pass muster with Al Viro and other filesystem maintainers remains to be seen however. One way or another, though, some kind of solution to the lack of an overlay/union filesystem in the mainline seems to be getting closer.

Index entries for this article
Kernel	Filesystems/Union
Kernel	Overlayfs

Another union filesystem approach

Posted Sep 2, 2010 12:20 UTC (Thu) by liljencrantz (guest, #28458) [Link] (2 responses)

From my perspective, if you union mount e.g. an NFS file system and then star modifying the underlying filesystem directly, you deserve every bit of pain coming to you. It makes perfect sense to enforce anything that can be reasonably enforced, such as making sure that local file systems must be mounted read only in order to be part of a union mount, but I fail to see why one should artificially exclude e.g. NFS file systems simply because making those sanity checks aren't possible on a remote file system.

How about having an ounce of trust in the universe; competent sysadmins will get it right, and the rooting out incompetent sysadmins quickly is actually a good thing?

Another union filesystem approach

Posted Sep 2, 2010 13:19 UTC (Thu) by neilbrown (subscriber, #359) [Link]

It isn't about trust, it is about providing predictable behaviour in all circumstances, even weird corner cases... So maybe that is about trust - you should be able to trust the unionfs to behave predictably.

In your taxonomy of sys-admins you forgot to include the brilliant/insane ones who *know* exactly how every union-mount is being used and *knows* that a particular file that they want to upgrade isn't being used at the moment so if they replace it on the common underlay then everyone will smoothly see the new content.

To serve their interests you want unionfs to perform predictably in that situation, so that if they try something and it works, then it is likely that it will work again next time. So it is important for unionfs to understand and handle any changes in the underlying fs.

Another union filesystem approach

Posted Sep 3, 2010 0:35 UTC (Fri) by dlang (guest, #313) [Link]

if your underlying filesystem is a default system image and your union is then a specific system, it makes a huge amount of sense to want the ability to update the underlying filesystem and have everything using a union pick up the changes.

you may be able to unmount the overlay, but then when you re-mount it, how do you know what to do to resolve changes? In some cases you may want the new file from the underlying layer, in some cases you want the modified version from the top layer, and in many cases what you really want is the changes that were made between the old underlying file and the old upper layer to be made to the new underlying file into the new upper layer.

Another union filesystem approach

Posted Sep 2, 2010 15:01 UTC (Thu) by dpquigl (guest, #52852) [Link] (1 responses)

I'd like to clarify my stance on inode_permission a bit. In this implementation what they want to do would be needed. However something that wasn't captured since Val and I had a brief exchange offlist was that I believe that her proposed implementation is superior to pushing the dentry into inode_permission. She had a new function called path_permission. With the inclusion of the path based hooks in the LSM framework I think if you want to add anything that will be checking permissions based on path we've decided that it should be its own check. That's why adding a path_permission check at the appropriate points in the vfs is a superior situation to pushing the dentry down into the inode operation.

Another union filesystem approach

Posted Sep 3, 2010 5:34 UTC (Fri) by neilbrown (subscriber, #359) [Link]

Here is a question for you - why should 'readlink' take a dentry while 'permission' only gets the inode?

I don't know either, but given the prevalence of dentry being passed around, it seems hard to justify not letting permission get a dentry.

The core reason that the hybrid unionfs needs permission() to take a dentry is because Miklos chose to store the 'struct union_entry' in the dentry rather than in the inode. It would be fairly straight forward to store that structure in the inode instead, thus removing any need to change 'permission'. However that would require allocating an inode for every active file (rather than just for each directory) which might be seen as a waste of memory.

The concept of "permission checking based on path", while seemingly suggested by the change-log entry for the patch which gives dentry to permission(), is actually irrelevant here.

Another union filesystem approach

Posted Sep 11, 2010 19:46 UTC (Sat) by Baylink (guest, #755) [Link]

> The easiest way out of this mess might simply be to enforce exclusive modification to the underlying filesystems on a local level, same as the union mount strategy. For NFS and other remote filesystems we either

> a) add some way to enforce it,
> b) live with the consequences if not enforced on the system level, or
> c) disallow them to be part of the union.

d) provide the system administrator -- who *knows* the expected semantics of a given mount -- with a *knob* to select which behavior s/he expects from that particular mount, with a reasonable default.

It's the *default* which must be decided on, not the behavior.