yvsou.com

HDFS is the storage unit of Hadoop that is used to store and process huge volumes of data on multiple datanodes. It is designed with low cost hardware that provides data across multiple Hadoop clusters. It has high fault tolerance and throughput.

Large file is broken down into small blocks of data. HDFS has a default block size of 128 MB which can be increased as per requirement. Multiple copies of each block are stored in the cluster in a distributed manner on different nodes.

As the number of internet users grew in the early 2000, Google faced the problem of storing increasing user data on its traditional data servers. Thousands of search queries were raised per second. There was a need for large, distributed, highly fault tolerant file system to store and process the queries. The solution to this was

Google File System (GFS).

GFS consists of a single master and multiple chunk servers.

Files are divided into fixed sized chunks.

Each chunk has 64 MB of data in it. Each chunk is replicated on multiple chunk servers (3 by default). Even if any chunk server crashes, the data file will still be present in other chunk servers.

This helped Google to store and process huge volumes of data in a distributed manner.

Node Division:

HDFS contain single NameNode and many DataNodes in is file system. GFS contain single Master Node and multiple Chunk Servers and is accessed by multiple clients.

Block size:

Default block size in GFS is 64MB and default block size in HDFS is 128MB and both of this file system blocks can be altered by the user.

Location of Chunk:

In GFS master does not keep a persistent record of which chunkservers have a replica of a given chunk. It simply check the status of the chunkservers for that information at startup. In HDFS chunk location information consistently maintained by NameNode.

Atomic Record Appends:

GFS provides record append with this client specifies the offset at which data is to be written. It allows many clients on different machines append to the same file concurrently with this random file writes are possible. GFS allows multiple writer and multiple reader. In HDFS only append is possible.

Data Integrity:

In GFS ChunkServers use checksums to detect corruption of the stored data. Comparison of the replicas is another alternative. The HDFS client software implements checksum checking on the contents of HDFS files when DataNode arrives corrupted.

Deletion:

GFS has unique garbage collection process. The resources of deleted files are not reclaimed immediately and are renamed in the hidden namespace which are further deleted if they are found existing for 3 days of regular scan. In HDFS Deleted files are renamed into a particular folder and are then removed via garbage collection process.

Snapshot:

Individual files and directories can be snapshotted in GFS. Up to 65,536 snapshots allowed for each directory in HDFS.

References:

HDFS Architecture Guide

GFS Architecture Guide

GFS and HDFS, and their differences Version 0
👤 Author: by kaamssabrygmailcom 2019-05-13 17:01:37

Reversion History