properties |
GFS |
HDFS |
Design Goals |
- The main goal of GFS is to support large files
- Built based on the assumption that terabyte data sets will be distributed across thousands of disks attached to commodity compute nodes.
- Used for data intensive computing [8].
- Store data reliably, even when failures occur within chunk servers, master, or network partitions.
- GFS is designed more for batch processing rather than interactive use by users
|
- One of the main goals of HDFS is to support large files.
- Built based on the assumption that terabyte data sets will be distributed across thousands of disks attached to commodity compute nodes.
- Used for data intensive computing [8].
- Store data reliably, even when failures occur within name nodes, data nodes, or network partitions.
- HDFS is designed more for batch processing rather than interactive use by users
|
Processes |
|
|
File Management |
- Files are organized hierarchically in directories and identified by path names.
- GFS is exclusively for Google only.
|
- HDFS supports a traditional hierarchical file organization ·
- HDFS also supports third-party file systems such as Cloud Store and Amazon Simple Storage Service [9].
|
Scalability |
- Cluster based architecture
- The file system consists of hundreds or even thousands of storage machines built from inexpensive commodity parts.
- The largest cluster have over 1000 storage nodes, over 300 TB of disk storage, and are heavily accessed by hundreds of clients on distinct machines on a continuous basis.
|
- Cluster based architecture
- Hadoop currently runs on clusters with thousands of nodes.
- E.g. Face book has 2 major clusters: - A 1100-machine cluster with 8800 cores and about 12PB raw storage. - A 300-machine cluster with 2400 cores and about 3PB raw storage. - Each (commodity) node has 8 cores and 12 TB of storage.
- EBay uses 532 nodes cluster (8*532 cores, 5.3PB)
- Yahoo uses more than 100,000 CPUs in >40,000 computers running Hadoop - biggest cluster: 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) [10]
- K.Talattinis et.al concluded in their work that Hadoop is really efficient while running in a fully distributed mode, however in order to achieve optimal results and get advantage of Hadoop scalability, it is necessary to use large clusters of computers[11]
|
Protection |
- Google have their own file system called GFS. With GFS, files are split up and stored in multiple pieces on multiple machines.
- Filenames are random (they do not match content type or owner). There are hundreds of thousands of files on a single disk, and all the data is obfuscated so that it is not human readable. The algorithms uses for obfuscation changes all the time [12].
|
- The HDFS implements a permission model for files and directories that shares much of the POSIX model.
- File or directory has separate permissions for the user that is the owner, for other users that are members of the group, and for all other users [13].
|
Security |
- Google has dozens of datacenters for redundancy. These datacenters are in undisclosed locations and most are unmarked for protection.
- Access is allowed to authorized employees and vendors only. Some of the protections in place include: 24/7 guard coverage, Electronic key access, Access logs, Closed circuit televisions, Alarms linked to guard stations, Internal and external patrols, Dual utility power feeds and Backup power UPS and generators [12].
|
- HDFS security is based on the POSIX model of users and groups.
- Currently is security is limited to simple file permissions.
- The identity of a client process is just whatever the host operating system says it is.
- Network authentication protocols like Kerberos for user authentication and encryption of data transfers are yet not supported [14].
|
Database Files |
- Bigtable is the database used by GFS. Bigtable is a proprietary distributed database of Google Inc
|
- HBase[15] provides Bigtable (Google) [16]-like capabilities on top of Hadoop Core.
|
File Serving |
- A file in GFS is comprised of fixed sized chunks. The size of chunk is 64MB. Parts of a file can be stored on different nodes in a cluster satisfying the concepts load balancing and storage management.
|
- HDFS is divided into large blocks for storage and access, typically 64MB in size. Portions of the file can be stored on different cluster nodes, balancing storage resources and demand [17].
|
Cache Management |
- Clients do cache metadata.
- Neither the sever nor the client caches the file data.
- Chunks are stored as local files in a Linux system. So, Linux buffer cache already keeps frequently accessed data in memory. Therefore chunk servers need not cache file data.
|
- HDFS uses distributed cache
- It is a facility provided by Mapreduce framework to cache application-specific, large, read-only files (text, archives, jars and so on)
- Private (belonging to one user) and Public (belonging to all the user of the same node) Distributed Cache Files [18].
|
Cache Consistency |
- Append-once-read-many model is adapted by Google. It avoids the locking mechanism of files for writing in distributed environment is avoided.
- Client can append the data to the existing file.
|
- HDFS’s write-once-read-many model that relaxes concurrency control requirements, simplifies data coherency, and enables high throughput access [9].
- Client can only append to existing files (yet not supported)
|
Communication |
- TCP connections are used for communication. Pipelining is used for data transfer over TCP connections.
|
· RPC based protocol on top of TCP/IP |
Replication Strategy |
- Chunk replicas are spread across the racks. Master automatically replicates the chunks.
- A user can specify the number of replicas to be maintained.
- The master re-replicates a chunk replica as soon as the number of available replicas falls below a user-specified number.
|
- Automatic replication system. · Rack based system. By default two copies of each block are stored by different Data Nodes in the same rack and a third copy is stored on a Data Node in a different rack (for greater reliability) [17].
- An application can specify the number of replicas of a file that should be maintained by HDFS [9].
- Replication pipelining in case of write operations.
|
Available Implementation |
- GFS is a proprietary distributed file system developed by Google for its own use
|
- Yahoo, Facebook, IBM etc. are based on HDFS.
|