yvsou.com

COMPARISON OF GFS AND HDFS

properties	GFS	HDFS
Design Goals	The main goal of GFS is to support large files Built based on the assumption that terabyte data sets will be distributed across thousands of disks attached to commodity compute nodes. Used for data intensive computing [8]. Store data reliably, even when failures occur within chunk servers, master, or network partitions. GFS is designed more for batch processing rather than interactive use by users	One of the main goals of HDFS is to support large files. Built based on the assumption that terabyte data sets will be distributed across thousands of disks attached to commodity compute nodes. Used for data intensive computing [8]. Store data reliably, even when failures occur within name nodes, data nodes, or network partitions. HDFS is designed more for batch processing rather than interactive use by users
Processes	Master and chunk server	Name node and Data node
File Management	Files are organized hierarchically in directories and identified by path names. GFS is exclusively for Google only.	HDFS supports a traditional hierarchical file organization · HDFS also supports third-party file systems such as Cloud Store and Amazon Simple Storage Service [9].
Scalability	Cluster based architecture The file system consists of hundreds or even thousands of storage machines built from inexpensive commodity parts. The largest cluster have over 1000 storage nodes, over 300 TB of disk storage, and are heavily accessed by hundreds of clients on distinct machines on a continuous basis.	Cluster based architecture Hadoop currently runs on clusters with thousands of nodes. E.g. Face book has 2 major clusters: - A 1100-machine cluster with 8800 cores and about 12PB raw storage. - A 300-machine cluster with 2400 cores and about 3PB raw storage. - Each (commodity) node has 8 cores and 12 TB of storage. EBay uses 532 nodes cluster (8532 cores, 5.3PB) Yahoo uses more than 100,000 CPUs in >40,000 computers running Hadoop - biggest cluster: 4500 nodes (24cpu boxes w 4*1TB disk & 16GB RAM) [10] K.Talattinis et.al concluded in their work that Hadoop is really efficient while running in a fully distributed mode, however in order to achieve optimal results and get advantage of Hadoop scalability, it is necessary to use large clusters of computers[11]
Protection	Google have their own file system called GFS. With GFS, files are split up and stored in multiple pieces on multiple machines. Filenames are random (they do not match content type or owner). There are hundreds of thousands of files on a single disk, and all the data is obfuscated so that it is not human readable. The algorithms uses for obfuscation changes all the time [12].	The HDFS implements a permission model for files and directories that shares much of the POSIX model. File or directory has separate permissions for the user that is the owner, for other users that are members of the group, and for all other users [13].
Security	Google has dozens of datacenters for redundancy. These datacenters are in undisclosed locations and most are unmarked for protection. Access is allowed to authorized employees and vendors only. Some of the protections in place include: 24/7 guard coverage, Electronic key access, Access logs, Closed circuit televisions, Alarms linked to guard stations, Internal and external patrols, Dual utility power feeds and Backup power UPS and generators [12].	HDFS security is based on the POSIX model of users and groups. Currently is security is limited to simple file permissions. The identity of a client process is just whatever the host operating system says it is. Network authentication protocols like Kerberos for user authentication and encryption of data transfers are yet not supported [14].
Database Files	Bigtable is the database used by GFS. Bigtable is a proprietary distributed database of Google Inc	HBase[15] provides Bigtable (Google) [16]-like capabilities on top of Hadoop Core.
File Serving	A file in GFS is comprised of fixed sized chunks. The size of chunk is 64MB. Parts of a file can be stored on different nodes in a cluster satisfying the concepts load balancing and storage management.	HDFS is divided into large blocks for storage and access, typically 64MB in size. Portions of the file can be stored on different cluster nodes, balancing storage resources and demand [17].
Cache Management	Clients do cache metadata. Neither the sever nor the client caches the file data. Chunks are stored as local files in a Linux system. So, Linux buffer cache already keeps frequently accessed data in memory. Therefore chunk servers need not cache file data.	HDFS uses distributed cache It is a facility provided by Mapreduce framework to cache application-specific, large, read-only files (text, archives, jars and so on) Private (belonging to one user) and Public (belonging to all the user of the same node) Distributed Cache Files [18].
Cache Consistency	Append-once-read-many model is adapted by Google. It avoids the locking mechanism of files for writing in distributed environment is avoided. Client can append the data to the existing file.	HDFS’s write-once-read-many model that relaxes concurrency control requirements, simplifies data coherency, and enables high throughput access [9]. Client can only append to existing files (yet not supported)
Communication	TCP connections are used for communication. Pipelining is used for data transfer over TCP connections.	· RPC based protocol on top of TCP/IP
Replication Strategy	Chunk replicas are spread across the racks. Master automatically replicates the chunks. A user can specify the number of replicas to be maintained. The master re-replicates a chunk replica as soon as the number of available replicas falls below a user-specified number.	Automatic replication system. · Rack based system. By default two copies of each block are stored by different Data Nodes in the same rack and a third copy is stored on a Data Node in a different rack (for greater reliability) [17]. An application can specify the number of replicas of a file that should be maintained by HDFS [9]. Replication pipelining in case of write operations.
Available Implementation	GFS is a proprietary distributed file system developed by Google for its own use	Yahoo, Facebook, IBM etc. are based on HDFS.

Homework-7- An article about GFS and HDFS, and their differences Version 0
👤 Author: by mahrubachyoutlookcom 2019-05-12 23:41:29

Reversion History