The Hadoop distributed file system (HDFS) is a subproject of the Apache Hadoop project. HDFS is a fault tolerant, high scalable distributed storage system and gives a high-throughput access to large data sets for clients and applications. HDFS expects that files will write once only and the read process have to be more efficient then write processes.1)
HDFS is an advancement from the Google File System(GFS) and the most characteristics are equal. Google expected that the failure rate in clusters are very high, especially if the architecture is planned without high performance hardware. Therefore the fault tolerance for the hardware failure was the main goal for the GFS. Additional it was important to have a distributed files system, which can handle a big data volume and which can add in an easy way new nodes to the cluster.2)
In the HDFS cluster architecture is built up of the name node and data nodes. The data files are distributed in blocks in the data nodes. The HDFS is built up for managing a large quantity of data. Also the architecture is optimized for expansion.
It exist only one name node per HDFS cluster. One cluster can contain thousand of data nodes and it gives still more HDFS Clients per cluster. The name node knows all metadata of the data filesystem and manages namespace operations. Also the name node manage the data nodes that stores the data. Further he knows which data node stores which data and how many copies of data are exist.3)
In the name node so called inodes represented data files and directories. An Inode receive file changes and access times, or namespace and disk space odds.
The DataNodes are optimized for read and write requests from the file system’s clients. Data nodes have no knowledge about the HDFS filesystem. The main functions of data nodes are to create, delete, and replicate data blocks corresponding to instructions from the responsible name node.4)
The data nodes send in a fixed time interval a heartbeat to the name node. If the name node get for a certain time no response from the same data node, this data node will declare to a dead node. The client can’t no longer access to the data on the dead node, as with a removed file. At this moment the name node gives a command to the other data nodes to copy their files on other data nodes (replication), to save the data. This action prevents a data lost which can arrive with a higher probability and bring the replication factor back to a normalized state.5)
The HDFS communication protocol based on the TCP/IP protocol. A HDFS client enables a connection to a configurable TCP port on the name node machine and then talks with the name node using a safety Remote Procedure Call (RPC)-based protocol. The name node is simply a server and never initiates a request, it only responds to RPC requests issued by DataNodes or clients.6)