ALL > Computer and Education > courses > university courses > graduate courses > modern operating system > zstu 2018-2019-2 class > student homework directory > L20182E060103 >
Homework-7- An article about GFS and HDFS, and their differences Version 0
👤 Author: by mahrubachyoutlookcom 2019-05-12 23:41:29
COMPARISON OF GFS AND HDFS









































































properties GFS HDFS
Design Goals

  • The main goal of GFS is to support large files

  • Built based on the assumption that terabyte data sets will be distributed across thousands of disks attached to commodity compute nodes.

  • Used for data intensive computing [8].

  • Store data reliably, even when failures occur within chunk servers, master, or network partitions.

  • GFS is designed more for batch processing rather than interactive use by users




  • One of the main goals of HDFS is to support large files.

  • Built based on the assumption that terabyte data sets will be distributed across thousands of disks attached to commodity compute nodes.

  • Used for data intensive computing [8].

  • Store data reliably, even when failures occur within name nodes, data nodes, or network partitions.

  • HDFS is designed more for batch processing rather than interactive use by users


Processes

  • Master and chunk server




  • Name node and Data node


File Management

  • Files are organized hierarchically in directories and identified by path names.

  • GFS is exclusively for Google only.




  • HDFS supports a traditional hierarchical file organization ·

  • HDFS also supports third-party file systems such as Cloud Store and Amazon Simple Storage Service [9].


Scalability

  • Cluster based architecture

  • The file system consists of hundreds or even thousands of storage machines built from inexpensive commodity parts.

  • The largest cluster have over 1000 storage nodes, over 300 TB of disk storage, and are heavily accessed by hundreds of clients on distinct machines on a continuous basis.




  • Cluster based architecture

  • Hadoop currently runs on clusters with thousands of nodes.

  • E.g. Face book has 2 major clusters: - A 1100-machine cluster with 8800 cores and about 12PB raw storage. - A 300-machine cluster with 2400 cores and about 3PB raw storage. - Each (commodity) node has 8 cores and 12 TB of storage.

  • EBay uses 532 nodes cluster (8*532 cores, 5.3PB)

  • Yahoo uses more than 100,000 CPUs in >40,000 computers running Hadoop - biggest cluster: 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) [10]

  • K.Talattinis et.al concluded in their work that Hadoop is really efficient while running in a fully distributed mode, however in order to achieve optimal results and get advantage of Hadoop scalability, it is necessary to use large clusters of computers[11]


Protection

  • Google have their own file system called GFS. With GFS, files are split up and stored in multiple pieces on multiple machines.

  • Filenames are random (they do not match content type or owner). There are hundreds of thousands of files on a single disk, and all the data is obfuscated so that it is not human readable. The algorithms uses for obfuscation changes all the time [12].




  • The HDFS implements a permission model for files and directories that shares much of the POSIX model.

  • File or directory has separate permissions for the user that is the owner, for other users that are members of the group, and for all other users [13].


Security

  • Google has dozens of datacenters for redundancy. These datacenters are in undisclosed locations and most are unmarked for protection.

  • Access is allowed to authorized employees and vendors only. Some of the protections in place include: 24/7 guard coverage, Electronic key access, Access logs, Closed circuit televisions, Alarms linked to guard stations, Internal and external patrols, Dual utility power feeds and Backup power UPS and generators [12].




  • HDFS security is based on the POSIX model of users and groups.

  • Currently is security is limited to simple file permissions.

  • The identity of a client process is just whatever the host operating system says it is.

  • Network authentication protocols like Kerberos for user authentication and encryption of data transfers are yet not supported [14].


Database Files

  • Bigtable is the database used by GFS. Bigtable is a proprietary distributed database of Google Inc




  • HBase[15] provides Bigtable (Google) [16]-like capabilities on top of Hadoop Core.


File Serving

  • A file in GFS is comprised of fixed sized chunks. The size of chunk is 64MB. Parts of a file can be stored on different nodes in a cluster satisfying the concepts load balancing and storage management.




  • HDFS is divided into large blocks for storage and access, typically 64MB in size. Portions of the file can be stored on different cluster nodes, balancing storage resources and demand [17].


Cache Management

  • Clients do cache metadata.

  • Neither the sever nor the client caches the file data.

  • Chunks are stored as local files in a Linux system. So, Linux buffer cache already keeps frequently accessed data in memory. Therefore chunk servers need not cache file data.




  • HDFS uses distributed cache

  • It is a facility provided by Mapreduce framework to cache application-specific, large, read-only files (text, archives, jars and so on)

  • Private (belonging to one user) and Public (belonging to all the user of the same node) Distributed Cache Files [18].


Cache Consistency

  • Append-once-read-many model is adapted by Google. It avoids the locking mechanism of files for writing in distributed environment is avoided.

  • Client can append the data to the existing file.




  • HDFS’s write-once-read-many model that relaxes concurrency control requirements, simplifies data coherency, and enables high throughput access [9].

  • Client can only append to existing files (yet not supported)


Communication

  • TCP connections are used for communication. Pipelining is used for data transfer over TCP connections.


· RPC based protocol on top of TCP/IP
Replication Strategy

  • Chunk replicas are spread across the racks. Master automatically replicates the chunks.

  • A user can specify the number of replicas to be maintained.

  • The master re-replicates a chunk replica as soon as the number of available replicas falls below a user-specified number.




  • Automatic replication system. · Rack based system. By default two copies of each block are stored by different Data Nodes in the same rack and a third copy is stored on a Data Node in a different rack (for greater reliability) [17].

  • An application can specify the number of replicas of a file that should be maintained by HDFS [9].

  • Replication pipelining in case of write operations.


Available Implementation

  • GFS is a proprietary distributed file system developed by Google for its own use




  • Yahoo, Facebook, IBM etc. are based on HDFS.



 

Please login to reply. Login

Reversion History

Loading...
No reversions found.