Distributed File System (DFS)
A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored on the local client machine. The DFS makes it convenient to share information and files among users on a network in a controlled and authorized way. The server allows the client users to share files and store data just like they are storing the information locally. However, the servers have full control over the data and give access control to the clients.
Distributed file systems constitute the primary support for data management. They provide an interface whereby to store information in the form of files and later access them for read and write operations. Among the several implementations of file systems, few of them specifically address the management of huge quantities of data on a large number of nodes. Mostly these file systems constitute the data storage support for large computing clusters,
supercomputers, massively
parallel architectures, and lately, storage/computing clouds.
Lustre. The Lustre file system is a massively parallel distributed file system that covers the needs of a small workgroup of clusters to a large-scale computing cluster. The file system is used by several of the Top 500 supercomputing systems, including the one rated the most powerful supercomputer in the June 2012 list.6 Lustre is designed to provide access to petabytes (PBs) of storage to serve thousands of clients with an I/O throughput of hundreds of gigabytes per second (GB/s). The system is composed of a metadata server that contains the metadata about the file system and a collection of object
storage servers that are in charge of providing storage. Users access the file system via a POSIX-compliant client, which can be either mounted as a module in the kernel or through a library. The file system implements a robust failover strategy and recovery mechanism, making server failures and recoveries transparent to clients.
IBM General
Parallel File System (GPFS). GPFS [88] is the high-performance distributed file system developed by IBM that provides support for the RS/6000 supercomputer and Linux computing clusters. GPFS is a multiplatform distributed file system built over several years of academic research and provides advanced recovery mechanisms. GPFS is built on the concept of shared disks, in which a collection of disks is attached to the file system nodes by means of some switching fabric. The file system makes this infrastructure transparent to users and stripes large files over the disk array by replicating portions of the file to ensure high availability. By means of this infrastructure, the system is able to support petabytes of storage, which is accessed at a
high throughput and without losing consistency of data. Compared to other implementations, GPFS distributes the metadata of the entire file system and provides transparent access to it, thus eliminating a
single point of failure.
Google File System (GFS). GFS [54] is the storage infrastructure that supports the execution of distributed applications in Google’s computing cloud. The system has been designed to be a fault-tolerant, highly available, distributed file system built on commodity hardware and standard Linux operating systems. Rather than a generic implementation of a distributed file system, GFS specifically addresses Google’s needs in terms of distributed storage for applications, and it has been designed with the following assumptions:
•
The system is built on top of commodity hardware that often fails.
•
The system stores a modest number of large files; multi-GB files are common and should be treated efficiently, and small files must be supported, but there is no need to optimize for that.
•
The workloads primarily consist of two kinds of reads: large streaming reads and small random reads.
•
The workloads also have many large, sequential writes that append data to files.
•
High-sustained bandwidth is more important than low latency.
The architecture of the file system is organized into a single master, which contains the metadata of the entire file system, and a collection of chunk servers, which provide storage space. From a logical point of view the system is composed of a collection of software daemons, which implement either the master server or the chunk server. A file is a collection of chunks for which the size can be configured at file system level. Chunks are replicated on multiple nodes in order to tolerate failures. Clients look up the master server and identify the specific chunk of a file they want to access. Once the chunk is identified, the interaction happens between the client and the chunk server. Applications interact through the file system with a specific interface supporting the usual operations for file creation, deletion, read, and write. The interface also supports
snapshots and
record append operations that are frequently performed by applications. GFS has been conceived by considering that failures in a large distributed infrastructure are common rather than a rarity; therefore, specific attention has been given to implementing a highly available, lightweight, and fault-tolerant infrastructure. The potential single point of failure of the single-master architecture has been addressed by giving the possibility of replicating the master node on any other node belonging to the infrastructure. Moreover, a stateless daemon and extensive logging capabilities facilitate the system’s recovery from failures.
Sector. Sector [84] is the storage cloud that supports the execution of
data-intensive applications defined according to the Sphere framework. It is a user space file system that can be deployed on commodity hardware across a wide-area network. Compared to other file systems, Sector does not partition a file into blocks but replicates the entire files on multiple nodes, allowing users to customize the replication strategy for better performance. The system’s architecture is composed of four nodes: a
security server, one or more master nodes, slave nodes, and client machines. The security server maintains all the information about
access control policies for user and files, whereas master servers coordinate and serve the I/O requests of clients, which ultimately interact with slave nodes to access files. The protocol used to exchange data with slave nodes is UDT [89], which is a lightweight connection-oriented protocol optimized for
wide-area networks.
Amazon Simple Storage Service (S3). Amazon S3 is the online storage service provided by Amazon. Even though its internal details are not revealed, the system is claimed to support high availability, reliability, scalability, infinite storage, and low latency at commodity cost. The system offers a flat storage space organized into buckets, which are attached to an
Amazon Web Services (AWS) account. Each bucket can store multiple objects, each identified by a unique key. Objects are identified by unique URLs and exposed through HTTP, thus allowing very simple
get-put semantics. Because of the use of HTTP, there is no need for any specific library for accessing the storage system, the objects of which can also be retrieved through the
Bit Torrent protocol.7 Despite its simple semantics, a POSIX-like client library has been developed to mount S3 buckets as part of the
local file system. Besides the minimal semantics, security is another limitation of S3. The visibility and accessibility of objects are linked to AWS accounts, and the owner of a bucket can decide to make it visible to other accounts or the public. It is also possible to define authenticated URLs, which provide public access to anyone for a limited (and configurable) period of time.
Besides these examples of storage systems, there exist other implementations of distributed file systems and storage clouds that have architecture that is similar to the models discussed here. Except for the S3 service, it is possible to sketch a general reference architecture in all the systems presented that identifies two major roles into which all the nodes can be classified. Metadata or master nodes contain the information about the location of files or file chunks, whereas slave nodes are used to provide direct access to the storage space. The architecture is completed by client libraries, which provide a simple interface for accessing the file system, which is to some extent or completely compliant to the POSIX specification. Variations of the reference architecture can include the ability to support multiple masters, to distribute the metadata over multiple nodes, or to easily interchange the role of nodes. The most important aspect common to all these different implementations is the ability to provide fault-tolerant and highly available storage systems.
Two main purposes of using files:
1. Permanent storage of information on a secondary storage media.
2. Sharing of information between applications.
A file system is a subsystem of the operating system that performs file management activities such as organization, storing, retrieval, naming, sharing, and protection of files.
A file system frees the programmer from concerns about the details of space allocation and layout of the secondary storage device.
The design and implementation of a distributed file system is more complex than a conventional file system due to the fact that the users and storage devices are physically dispersed.
- In addition to the functions of the file system of a single-processor system, the distributed file system supports the following:
1. Remote information sharing
Thus any node, irrespective of the physical location of the file, can access the file.
2. User mobility
User should be permitted to work on different nodes.
3. Availability
For better fault-tolerance, files should be available for use even in the event of temporary failure of one or more nodes of the system. Thus the system should maintain multiple copies of the files, the existence of which should be transparent to the user.
4. Diskless workstations
A distributed file system, with its transparent remote-file accessing capability, allows the use of diskless workstations in a system.
A distributed file system provides the following types of services:
1. Storage service
Allocation and management of space on a secondary storage device thus providing a logical view of the storage system.
2. True file service
Includes file-sharing semantics, file-caching mechanism, file replication mechanism, concurrency control, multiple copy update protocol etc.
3. Name/Directory service
Responsible for directory related activities such as creation and deletion of directories, adding a new file to a directory, deleting a file from a directory, changing the name of a file, moving a file from one directory to another etc.
(This chapter deals with the design and implementation issues of the true file service component of distributed file systems).
Desirable features of a distributed file system:
1. Transparency
- Structure transparency
Clients should not know the number or locations of file servers and the storage devices. Note: multiple file servers provided for performance, scalability, and reliability.
- Access transparency
Both local and remote files should be accessible in the same way. The file system should automatically locate an accessed file and transport it to the clients site.
- Naming transparency
The name of the file should give no hint as to the location of the file. The name of the file must not be changed when moving from one node to another.
- Replication transparency
If a file is replicated on multiple nodes, both the existence of multiple copies and their locations should be hidden from the clients.
2. User mobility
Automatically bring the users environment (e.g. users home directory) to the node where the user logs in.
3. Performance
Performance is measured as the average amount of time needed to satisfy client requests. This time includes CPU time + time for accessing secondary storage + network access time. It is desirable that the performance of a distributed file system be comparable to that of a centralized file system.
4. Simplicity and ease of use
User interface to the file system be simple and number of commands should be as small as possible.
5. Scalability
Growth of nodes and users should not seriously disrupt service.
6. High availability
A distributed file system should continue to function in the face of partial failures such as a link failure, a node failure, or a storage device crash.
A highly reliable and scalable distributed file system should have multiple and independent file servers controlling multiple and independent storage devices.
7. High reliability
Probability of loss of stored data should be minimized. System should automatically generate backup copies of critical files.
8. Data integrity
Concurrent access requests from multiple users who are competing to access the file must be properly synchronized by the use of some form of concurrency control mechanism. Atomic transactions can also be provided.
9. Security
Users should be confident of the privacy of their data.
10. Heterogeneity
There should be easy access to shared data on diverse platforms (e.g. Unix workstation, Wintel platform etc).
File Models
1. Unstructured and Structured files
In the unstructured model, a file is an unstructured sequence of bytes. The interpretation of the meaning and structure of the data stored in the files is up to the application (e.g. UNIX and MS-DOS). Most modern operating systems use the unstructured file model.
In structured files (rarely used now) a file appears to the file server as an ordered sequence of records. Records of different files of the same file system can be of different sizes.
2. Mutable and immutable files
Based on the modifiability criteria, files are of two types, mutable and immutable. Most existing operating systems use the mutable file model. An update performed on a file overwrites its old contents to produce the new contents.
In the immutable model, rather than updating the same file, a new version of the file is created each time a change is made to the file contents and the old version is retained unchanged. The problems in this model are increased use of disk space and increased disk activity.
File Accessing Models
This depends on the method used for accessing remote files and the unit of data access.
1. Accessing remote files
A distributed file system may use one of the following models to service a clients file access request when the accessed file is remote:
a. Remote service model
Processing of a clients request is performed at the servers node. Thus, the clients request for file access is delivered across the network as a message to the server, the server machine performs the access request, and the result is sent to the client. Need to minimize the number of messages sent and the overhead per message.
b. Data-caching model
This model attempts to reduce the network traffic of the previous model by caching the data obtained from the server node. This takes advantage of the locality feature of the found in file accesses. A replacement policy such as LRU is used to keep the cache size bounded.
While this model reduces network traffic it has to deal with the cache coherency problem during
writes, because the local cached copy of the data needs to be updated, the original file at the server node needs to be updated and copies in any other caches need to be updated.
Advantage of Data-caching model over the Remote service model:
The data-caching model offers the possibility of increased performance and greater system scalability because it reduces network traffic, contention for the network, and contention for the file servers. Hence almost all distributed file systems implement some form of caching.
Example, NFS uses the remote service model but adds caching for better performance.
2. Unit of Data Transfer
In file systems that use the data-caching model, an important design issue is to decide the unit of data transfer. This refers to the fraction of a file that is transferred to and form clients as a result of single read or write operation.
File-level transfer model
In this model when file data is to be transferred, the entire file is moved. Advantages: file needs to be transferred only once in response to client request and hence is more efficient than transferring page by page which requires more network protocol overhead. Reduces server load and network traffic since it accesses the server only once. This has better scalability. Once the entire file is cached at the client site, it is immune to server and network failures.
Disadvantage: requires sufficient storage space on the client machine. This approach fails for very large files, especially when the client runs on a diskless workstation. If only a small fraction of a file is needed, moving the entire file is wasteful.
Block-level transfer model
File transfer takes place in file blocks. A file block is a contiguous portion of a file and is of fixed length (can also be a equal to a virtual memory page size).
Advantages: Does not require client nodes to have large storage space. It eliminates the need to copy an entire file when only a small portion of the data is needed.
Disadvantages: When an entire file is to be accessed, multiple server requests are needed, resulting in more network traffic and more network protocol overhead. NFS uses block-level transfer model.
Byte-level transfer model
Unit of transfer is a byte. Model provides maximum flexibility because it allows storage and retrieval of an arbitrary amount of a file, specified by an offset within a file and length. Drawback is that cache management is harder due to the variable-length data for different access requests.
Record-level transfer model
This model is used with structured files and the unit of transfer is the record.
File-Sharing Semantics
Multiple users may access a shared file simultaneously. An important design issue for any file system is to define when modifications of file data made by a user are observable by other users.
UNIX semantics:
This enforces an absolute time ordering on all operations and ensures that every read operation on a file sees the effects of all previous write operations performed on that file.s