Homework 7: write an article about GFS and HDFS, and their differences Version 0 |
|
š¤ Author: by aeonorbitgmailcom 2019-04-30 09:14:59 |
2.2 Snapshot A Snapshot is an integral function involved in functioning of Google File System, which ensures consistency-control. It helps a user to create a copy of a file or a directory instantaneously, without hindering the on-going mutations. It is more commonly used while creating checkpoints of the current state in order to perform experimentation on data which can later be committed or rolled back.
2.3 Data Integrity Given the fact that a typical GFS cluster consists of thousands of machines, it experiences disk failures time and again, which ultimately causes corruption or loss of data reads and data writes. Google considered this to be a norm rather than an exception. To avoid this problem, each chunkserver maintains its own copy of checksum. A checksum ensures integrity of chunkserver byverifying it with the master. Each chunk broken up into 64KB block has a corresponding checksum of 32 bits. Similar to metadata, checksums are also stored with logging, independent from user data.
2.4 Garbage Collection Instead of immediately reclaiming the physical storage space after a file or a chunk is deleted, GFS implies a lazy action policy of Garbage Collection. This approach ensures that system is much simpler and more reliable.
2.5 Implication GFS relies on appends rather than overwrites. It believes in the strategy of checkpointing in order to fully utilize its fault tolerance capabilities. By differentiating file system control (master) from data transfer (chunkserver and clients), it ensures that masterās involvement in common operations is minimized, resulting in faster execution.
2.6 Successor of GFS- Colossus Google lately changed its focus to automation of data. This led to an entire architectural change in existing distributed storage and processing. This new automated type of storage was named āColossusā and is being used by Google presently. Colossus is the next generation cluster level file system with data being written to it using āReed Solomon error correction codesā which ultimately account for a 1.5 times redundancy compared to its predecessor. It is a client driven system with full access given to client to replicate and encode data. 3.
HADOOP DISTRIBUTED FILE SYSTEM
Hadoop is an Open Source implementation of a large-scale batch processing system. It is a software framework which is used for distributed storage and also for distributed processing of data on clusters of commodity hardware . Hadoop is currently being used for index web searches, email spam detection, recommendation engines, prediction in financial services, genome manipulation in life sciences, and for analysis of unstructured data such as log, text, and click stream . It implements MapReduce framework initially introduced by Google. It provides Java based environment but custom codes in other languages can also be processed. Depending upon the complexity of the process and volume of data, parallel processing response time can vary from minutes to hours . Besides fast processing, the major advantage Hadoop brings along is its massive scalability . HDFS is a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. Hadoop Distributed File System is designed to handle mammoth amount of data with streaming data access patterns across multiple machines . It is a distributed, fault tolerant file system used for massive data clustered storage. Its work is to split the incoming data into blocks which are duplicated and distributed across several nodes on the Hadoop cluster.
3.1 Architecture HDFS follows a master-slave relationship. It has a single ānamenodeā and multiple ādatanodesā. The namenode stores the metadata and henceforth acts as an index which can be referred to gain knowledge about the location of data. The datanode simply contains the data. The later versions of Hadoop have an additional node named āsecondarynodeā. This node is an imitation of namenode stored at some location other than the namenode. Its purpose is to perform periodic checkpoints. It periodically downloads the current namenode image and edits log files, joins them in new image and uploads new image back to the namenode . Each file in HDFS appears as a large conterminous stream of bytes on which parallel map reduce programming is done. Figure 2 depicts the Hadoop architecture .
3.2 Naming HDFS uses inodes (used to record attributes like permissions, access times and modification ) to handle its namespaces in form of hierarchial files and directories. Namenode manages the metadata(data/index about data) and namespaces and is also responsible for mapping filename and file blocks on datanodes.
3.3 Cache consistency HDFS follows a write only, read many model which ensures concurrency control as the client is not allowed to modify the file once created but is allowed to read it several times. In case of a network or server failure, data consistency is ensured by checksum, which is generated for each file block. The client can compare his computed checksum with the checksum of file block stored in HDFS in order to ensure a valid read.
3.4 Replication of Data HDFS follows a default placement policy to replicate and distribute data blocks across several datanodes. By default, three copies of a block are created. This number can be varied by the userĀ whenever desired. It is ensured that no datanode holds more than one copy of the block and no rack holds more than two copies of same block. Namenode verifies that intended number of copies is placed at suitable locations and it takes action in case of over-replication or under-replication by balancing the amount of available disk space across the racks.
3.5 Load balancing Utilization of server is ensured by computing the ratio of the space used to the total capacity of the server. This ratio is between a threshold range of (0, 1). A cluster is balanced if server usage for each datanode is no more than threshold value. Otherwise, it is unbalanced and copies are moved from it to other servers to ensure load balancing.
3.6 Fault Detection Each datanode is assigned a namespace ID which is compared with those held by namenode, every time the server starts. If a mismatch occurs, datanodes are shut down, ensuring the data integrity is maintained. Datanodes also perform registration with namenode which helps in their recognition in case they start with different IP address or port. A heartbeat message is sent every three seconds to the namenode to confirm availability. A datanode is considered out-of-service if no heatbeat is received in a time span of ten minutes.
Please login to reply. Login