Another union filesystem approach
Creating a union of two (or more) filesystems is a commonly requested feature for Linux that has never made it into the mainline. Various implementations have been tried (part 1 and part 2 of Valerie Aurora's look from early 2009), but none has crossed the threshold for inclusion. Of late, union mounts have been making some progress, but there is still work to do there. A hybrid approach—incorporating both filesystem- and VFS-based techniques—has recently been posted in an RFC patchset by Miklos Szeredi.
The idea behind unioning filesystems is quite simple, but the devil is in the details. In a union, one filesystem is mounted "atop" another, with the contents of both filesystems appearing to be in a single filesystem encompassing both. Changes made to the filesystem are reflected in the "upper" filesystem, and the "lower" filesystem is treated as read-only. One common use case is to have a filesystem on read-only media (e.g. CD) but allow users to make changes by writing to the upper filesystem stored on read-write media (e.g. flash or disk).
There are a number of details that bedevil developers of unions, however, including various problems with namespace handling, dealing with deleted files and directories, the POSIX definition of readdir(), and so on. None of them are insurmountable, but they are difficult, and it is even harder to implement them in a way that doesn't run afoul of the technical complaints of the VFS maintainers.
Szeredi's approach blends the filesystem-based implementations, like
unionfs and aufs, with the VFS-based implementation of union mounts.
For file objects, an open() is forwarded to whichever of the two
underlying filesystems contains it, while directories are handled by the
union filesystem layer.
Neil Brown's very helpful first cut at documentation for the patches lumped directory
handling in with files, but Szeredi called
that a bug. Directory access is never forwarded to the other
filesystems and directories need to "come from the union itself
for various reasons
", he said.
As outlined in Brown's document, most of the action for unions takes place in directories. For one thing, it is more accurate to look at the feature as unioning directory trees, rather than filesystems, as there is no requirement that the two trees reside in separate filesystems. In theory, the lower tree could even be another union, but the current implementation precludes that.
The filesystem used by the upper tree needs to support the "trusted" extended attributes (xattrs) and it must also provide valid d_type (file type) for readdir() responses, which precludes NFS. Whiteouts—that is files that exist in the lower tree, but have been removed in the upper—are handled using the "trusted.union.whiteout" xattr. Similarly, opaque directories, which do not allow entries in the lower tree to "show through", are handled with the "trusted.union.opaque" xattr.
Directory entries are merged with fairly straightforward rules: if there are entries in both the upper and lower layers with the same name, the upper always takes precedence unless both are directories. In that case, a directory in the union is created that merges the entries from each. The initial mount creates a merged directory of the roots of the upper and lower directory trees and subsequent lookups follow the rules, creating merged directories that get cached in the union dentry as needed.
Write access to lower layer files is handled by the traditional "copy up" approach. So, opening a lower file for write or changing its metadata will cause the file to be copied to the upper tree. That may require creating any intervening directories if the file is several levels down in a directory tree on the lower layer. Once that's done, though, the hybrid union filesystem has little further interaction with the file, at least directly, because operations and handed off to the upper filesystem.
The patchset is relatively small, and makes very few small changes to VFS—except for a change to struct inode_operations that ripples through the filesystem tree. The permissions() member of that structure currently takes a struct inode *, but the hybrid union filesystem needs to be able to access the filesystem-specific data (d_fsdata) that is stored in the dentry, so it was changed to take a struct dentry * instead. David P. Quigley questioned the need for the change, noting that unionfs and aufs did not require it. Aurora pointed out that union mounts would require something similar and that, along with Brown's documentation, seemed to put the matter to rest.
The rest of the patches make minor changes. The first adds a new struct file_operations member called open_other() that is used to forward open() calls to the upper or lower layers as appropriate. Another allows filesystems to set a FS_RENAME_SELF_ALLOW flag so that rename() will still process renames on the identical dummy inodes that the filesystem uses for non-directories. The bulk of the code (modulo the permissions() change) is the new fs/union filesystem itself.
While "union" tends to be used for these kinds of filesystems (or mounts), Brown noted that it is confusing and not really accurate, suggesting that "overlay" be used in its place. Szeredi is not opposed to that, saying that "overlayfs" might make more sense. Aurora more or less concurred, saying that union mounts were called "writable overlays" for one release. The confusion stemming from multiple uses of "union" in existing patches (unionfs, union mounts) may provide additional reason to rename the hybrid union filesystem to overlayfs.
The readdir() semantics are a bit different for the hybrid union as
well. Changes to merged directories while they are being read will not
appear in the entries returned by readdir(), and offsets returned
from telldir() may not return to the same location in a merged
directory on subsequent directory opens. The lists of directory entries in
merged directories are created and cached on the first readdir()
call, with offsets assigned sequentially as they are read. For the most
part, these changes are "unlikely to be noticed by many
programs
", as Brown's documentation says.
A bigger issue is one that all union implementations struggle with: how to handle changes to either layer that are done outside of the context of the union. If users or administrators directly change the underlying filesystems, there are a number of ugly corner cases. Making the lower filesystem be read-only is an attractive solution, but it is non-trivial to enforce, especially for filesystems like NFS.
Szeredi would like to define the problem away or find some way to enforce the requirements that unioning imposes:
a) add some way to enforce it,
b) live with the consequences if not enforced on the system level, or
c) disallow them to be part of the union.
There was some discussion of the problem, without much in the way of conclusions other than a requirement that changing the trees out from under the union filesystem not cause deadlocks or panics.
In some ways, hybrid union seems a simpler approach than union mounts.
Whether it can pass muster with Al Viro and other filesystem maintainers
remains to be seen however. One way or another, though, some kind of
solution to the lack of an overlay/union filesystem in the
mainline seems to be getting closer.
Index entries for this article | |
---|---|
Kernel | Filesystems/Union |
Kernel | Overlayfs |
Posted Sep 2, 2010 12:20 UTC (Thu)
by liljencrantz (guest, #28458)
[Link] (2 responses)
How about having an ounce of trust in the universe; competent sysadmins will get it right, and the rooting out incompetent sysadmins quickly is actually a good thing?
Posted Sep 2, 2010 13:19 UTC (Thu)
by neilbrown (subscriber, #359)
[Link]
In your taxonomy of sys-admins you forgot to include the brilliant/insane ones who *know* exactly how every union-mount is being used and *knows* that a particular file that they want to upgrade isn't being used at the moment so if they replace it on the common underlay then everyone will smoothly see the new content.
To serve their interests you want unionfs to perform predictably in that situation, so that if they try something and it works, then it is likely that it will work again next time. So it is important for unionfs to understand and handle any changes in the underlying fs.
Posted Sep 3, 2010 0:35 UTC (Fri)
by dlang (guest, #313)
[Link]
you may be able to unmount the overlay, but then when you re-mount it, how do you know what to do to resolve changes? In some cases you may want the new file from the underlying layer, in some cases you want the modified version from the top layer, and in many cases what you really want is the changes that were made between the old underlying file and the old upper layer to be made to the new underlying file into the new upper layer.
Posted Sep 2, 2010 15:01 UTC (Thu)
by dpquigl (guest, #52852)
[Link] (1 responses)
Posted Sep 3, 2010 5:34 UTC (Fri)
by neilbrown (subscriber, #359)
[Link]
I don't know either, but given the prevalence of dentry being passed around, it seems hard to justify not letting permission get a dentry.
The core reason that the hybrid unionfs needs permission() to take a dentry is because Miklos chose to store the 'struct union_entry' in the dentry rather than in the inode. It would be fairly straight forward to store that structure in the inode instead, thus removing any need to change 'permission'. However that would require allocating an inode for every active file (rather than just for each directory) which might be seen as a waste of memory.
The concept of "permission checking based on path", while seemingly suggested by the change-log entry for the patch which gives dentry to permission(), is actually irrelevant here.
Posted Sep 11, 2010 19:46 UTC (Sat)
by Baylink (guest, #755)
[Link]
> a) add some way to enforce it,
d) provide the system administrator -- who *knows* the expected semantics of a given mount -- with a *knob* to select which behavior s/he expects from that particular mount, with a reasonable default.
It's the *default* which must be decided on, not the behavior.
Another union filesystem approach
Another union filesystem approach
Another union filesystem approach
Another union filesystem approach
Another union filesystem approach
Another union filesystem approach
> b) live with the consequences if not enforced on the system level, or
> c) disallow them to be part of the union.