Homework-6: A review about technologies of distributed file systems Version 0 |
|
👤 Author: by ahasan4277gmailcom 2019-05-15 08:31:38 |
Distributed file systems provide persistent storage of unstructured data, which are organized in a hierarchical namespace of files that is shared among networked nodes. Files are explicitly created and they can survive the lifetime of processes and nodes until explicit deletion. As such they can be seen as the glue of a distributed computing infrastructure. Distributed file systems resemble the API of local file systems. To applications, it should be transparent whether data is stored on a local file system or on a distributed file system. This data model and the interface to applications distinguishes distributed file systems from other types of distributed storage such as databases.
Virtually all physics experiments store their data in distributed file systems. Large experiment collaborations, such as the experiment collaborations at the Large Hadron Collider (LHC), store data in a global federation of various cluster file systems rather than in a single, globally distributed file system. For LHC experiments, such globally federated and accessible storage sums up to more than 1 billion files and several hundred petabytes. There is a variety of file systems available to choose from [1–14] and often it is not clear what are the particular strengths, weaknesses, and implications of using one distributed file system over the other. Several previous studies presented taxonomies, case studies, and performance comparisons on distributed file systems. This survey is focused on the underlying building blocks of distributed file systems and what to expect from them with respect to physics applications.
How are Distributed File Systems Used?
Even though the file system interface is general and fits a broad spectrum of applications, most distributed file system implementations are optimized for a particular class of applications. For instance, the Andrew File System (AFS) is optimized for users’ home directories , XrootD is optimized for high-throughput access to high-energy physics data sets, the Hadoop File
System (HDFS) is designed as a storage layer for the Map Reduce framework [10,21], the CernVM File System (CVMFS) is optimized to distribute software binaries, and Lustre is optimized as a scratch space for cooperating applications on supercomputers [5]. These use cases differ both quantitatively and qualitatively. Consider a multi-dimensional vector describing different levels of properties or requirements for a particular class of data that consists of data value, data confidentiality, redundancy, volume, median file size, change frequency, and request rate. Every single use case above poses high requirements in only some of the dimensions. All of the use cases combined, however, would require a distributed file system with outstanding performance in every dimension. Moreover, some requirements contradict each other: a high level of redundancy (e. g.
for recorded experiment data) inevitably reduces the write throughput in cases where redundancy is not needed (e. g. for a scratch area). The file system interface provides no standard way to specify quality of service properties for particular files or directories. Instead, we have to resort to using a number of distributed file systems, each with implicit quality of service guarantees and
mounted at a well-known location (/afs, /eos, /cvmfs, /data, /scratch, . . . ). Quantitative file system studies, which are unfortunately rare, provide precise workload characterizations to guide file system implementers.