Introduction
Distributed file systems provide persistent storage of unstructured data, which are organized in a hierarchical namespace of files that is shared among networked nodes. Files are explicitly created and they can survive the lifetime of processes and nodes until explicit deletion. As such they can be seen as the glue of a distributed computing infrastructure. Distributed file systems resemble the API of local file systems. To applications, it should be transparent whether data is stored on a local file system or on a distributed file system. This data model and the interface to applications distinguishes distributed file systems from other types of distributed storage such as databases. Virtually all physics experiments store their data in distributed file systems. Large experiment collaborations, such as the experiment collaborations at the Large Hadron Collider (LHC), store data in a global federation of various cluster file systems rather than in a single, globally distributed file system. For LHC experiments, such globally federated and accessible storage sums up to more than 1 billion files and several hundred petabytes. There is a variety of file systems available to choose from [1ā14] and often it is not clear what are the particular strengths, weaknesses, and implications of using one distributed file system over the other. Several previous studies presented taxonomies, case studies, and performance comparisons on distributed file systems [15ā20]. This survey is focused on the underlying building blocks of distributed file systems and what to expect from them with respect to physics applications.
How are Distributed File Systems Used?
Even though the file system interface is general and fits a broad spectrum of applications, most distributed file system implementations are optimized for a particular class of applications. For instance, the Andrew File System (AFS) is optimized for usersā home directories [2], XrootD is optimized for high-throughput access to high-energy physics data sets [7], the Hadoop File System (HDFS) is designed as a storage layer for the MapReduce framework [10, 21], the CernVM File System (CVMFS) is optimized to distribute software binaries [12], and Lustre is optimized as a scratch space for cooperating applications on supercomputers [5]. These use cases differ both quantitatively and qualitatively. Consider a multi-dimensional vector describing different levels of properties or requirements for a particular class of data that consists of data value, data confidentiality, redundancy, volume, median file size, change frequency, and request rate. Every single use case above poses high requirements in only some of the dimensions. All of the use cases combined, however, would require a distributed file system with outstanding performance in every dimension. Moreover, some requirements contradict each other: a high level of redundancy (e. g. for recorded experiment data) inevitably reduces the write throughput in cases where redundancy is not needed (e. g. for a scratch area). The file system interface provides no standard way to specify quality of service properties for particular files or directories. Instead, we have to resort to using a number of distributed file systems, each with implicit quality of service guarantees and mounted at a well-known location (/afs, /eos, /cvmfs, /data, /scratch, . . . ). Quantitative file system studies, which are unfortunately rare, provide precise workload characterizations to guide file system implementers [22ā24]. 2.1. Application Integration From the point of view of applications there are different levels of integration a distributed file system can provide. Most file systems provide a library with an interface that closely resembles POSIX file systems. The advantage is that the interface can be adapted to the use case at hand. For instance, the Google File System (GFS) extends POSIX semantics by an atomic append [25], a feature particularly useful for the merging phase of MapReduce jobs. A library interface comes at the cost of transparency; applications need to be developed and compiled for a particular distributed file system. So called interposition systems introduce a layer of indirection that transparently redirects file system calls of an application to a library or an external process. The Parrot system creates a sandbox around a user-space process and intercepts its system calls [26]. The Fuse kernel-level file system redirects file system calls to a special user-land process (āupcallā) [27]. Interposition systems come with a performance penalty for the indirection layer. This penalty is smaller for Fuse than for a pure user-level interposition system, but Fuse requires co-operation from the kernel. Some distributed file systems are implemented as an extension of the operating system kernel (e. g. NFS [1], AFS, Lustre). That can provide better performance compared to interposition systems but the deployment is difficult and implementation errors typically crash the operating system kernel. Distributed file systems do not fully comply with the POSIX file system standard. Each distributed file system needs to be tested with real applications. Functionality that is often poorly supported in distributed file systems is file locking, atomic renaming of files and directories, multiple hardlinks, and deletion of open files. Sometimes, deviations from the POSIX standard are subtle. HDFS, for instance, writes file sizes asynchronously and thus it returns the real size of a file only some time after the file has been written.
Architecture Evolution
The simplest architecture for a distributed file system is a single server that exports a local directory tree to a number of clients (e. g. NFSv3). This architecture is obviously limited by the capabilities of the exporting server. An approach to overcome some of these limitations is to delegate ownership and responsibility of certain file system subtrees to different servers, as done by AFS. In order to provide access to remote servers, AFS allows for lose coupling of multiple file system trees (ācellsā). Across cells, this architecture is not network-transparent: moving a file from one cell to another requires a change of path. It also involves a copy through the node which triggers the move, e. g. move is not a namespace-only operation. Furthermore, the partitioning of a file system tree is static and changing it requires administrative intervention. In object-based file systems, data management and meta-data management is separated (e. g. GFS). Files are spread over a number of servers that handle read and write operations. A meta-data server maintains the directory tree and takes care of data placement. As long as meta-data load is much smaller than data operations (i. e. files are large), this architecture allows for incremental scaling. As the load increases, data servers can be added one by one with minimal administrative overhead. The architecture is refined by parallel file systems (e. g. Lustre) that cut every file in small blocks and distribute the blocks over many nodes. Thus read and write operations are executed in parallel on multiple servers for better maximum throughput. A distributed meta-data architecture (as for instance in Ceph [9]) overcomes the bottleneck of the central meta-data server. Distributed meta-data handling is more complex than distributed data handling because all meta-data servers are closely coupled and need to agree on a single state of the file system tree. That involves either distributed consensus algorithms [28, 29] or distributed transaction protocols such as two-phase commit. Object-based architectures involve two servers (meta-data server, data server) for read and write operations. Instead of asking a meta-data server, decentralized or peer-to-peer file systems (e. g. GlusterFS [13]) let clients compute the location of data and meta-data by means of a distributed hash table. Zero-hop routing in a distributed hash table is restricted to a local network, however, in which every node is aware of every other node. On a global scale, tree based routing as being done by XrootD is simpler to implement and it shows better lookup performance than globally distributed hash tables. Furthermore, high peer churn (servers that frequently join and leave the network) pose a hard challenge on distributed hash tables. 3.1. Decomposition There is a tendency of decomposition and modularization in distributed file systems. Examples are the offloading of authorization to Kerberos in AFS, the offloading of distributed consensus to Chubby [30] in GFS (resp. ZooKeeper [31] in HDFS), or the layered implementation of Ceph with the independent RADOS key-value store as building block beneath the file system. Another example is the separation of a distributed file system namespace and the data access. In the grid, for instance, the namespace is controlled by experimentsā file catalogs, which, in combination with grid middleware, federates globally distributed cluster file systems. For a globally distributed and administratively independent computing infrastructure used in high-energy physics, modularization is important because it allows for deployment of incremental improvements. A completely new file system takes many years to stabilize and roll-out throughout the grid. 4. Mechanisms and Techniques A distributed file system should be fast and it should scale to many files, users, and nodes. At the same time, it should sustain hardware faults and recover gracefully from them and ensure the integrity of the file system over long storage periods and long-distance network links. This section highlights techniques that are used to achieve these goals and that are particularly relevant for distributed file systems in high-energy physics.
File System Integrity
Global file systems often need to transfer data via untrusted connections and still ensure integrity and authenticity of the data. Cryptographic hashes of the content of files are often used to ensure data integrity. Cryptographic hashes provide a short, constant length, unique identifier for data of any size. Collisions are virtually impossible to occur neither by chance nor by clever crafting, which makes cryptographic hashes a means to protect against data tampering. Many globally distributed file systems use cryptographic hashes in the form of content-addressable storage [12, 32ā34], where the name of a file is derived from its cryptographic content hash. This allows for verification of the data independently of the meta-data. It also results in immutable data, which eliminates the problem of detecting stale cache entries and keeping cache consistency. Furthermore, redundant data and duplicated files are automatically de-duplicated, which in some use cases (backups, scientific software binaries) reduces the actual storage space utilization by many factors
NFS is a distributed file system that can be used to "tie together" computers that are running different operating systems. For example, systems running DOS can share files with systems running UNIX.
NFS makes the actual physical location of the file system irrelevant to the user. You can use NFS to allow users to see all the relevant files, regardless of location. Instead of placing copies of commonly used files on every system, NFS allows you to place one copy on one system's disk and let all other systems access it across the network. Under NFS, remote file systems are virtually indistinguishable from local ones.
A system becomes an NFS server if it has file systems to share orĀ
exportĀ over the network. A server keeps a list of currently exported file systems and their access restrictions (such as read/write or read-only).
You may want to share resources, such as files, directories, or devices from one system on the network (typically, a server) with other systems. For example, you might want to share third-party applications or source files with users on other systems.
When you share a resource, you make it available for mounting by remote systems.
You can share a resource in these ways:
- By using theĀ shareĀ orĀ shareallĀ command
- By adding an entry to theĀ /etc/dfs/dfstabĀ (distributed file system table) file