Saturday, 2 January 2016

HDFS Architecture

HDFS stands for Hadoop Distributed File System, is designed to store very large files in Hadoop cluster. It is fault tolerant and can run on commodity hardware. Hadoop is written in Java, so you can run Hadoop on any platform, which supports Java.

Distributed File System
Distributed file system is a software, which manages the storage across a network of machines.

Hadoop run on Commodity Hardware
HDFS don’t require very expensive hardware; it can run on cluster of commodity hardware machines (Like your laptops, computers, low cost servers etc.,).

HDFS Coherence model
HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed.

HDFS Architecture
HDFS works on master-slave architecture. There are two main components in HDFS.
a.   Namenode
b.   Datanode

Following image is taken from Hadoop documentation.


Namenode
It is a master server, maintains metadata about the data stored in HDFS. Namenode records any changes to the file system. default block size is 64MB in Hadoop 1.x, 128MB in Hadoop 2.X. Suppose, when client want to store 270MB of data(assume block size is configured to 64MB) into HDFS, following sequence happens.

a.   270MB is divided into 5 blocks (64MB, 64MB, 64MB, 64MB, 14MB). Client sends a request to the Namenode (by specifying metadata about file) that it wants to store 5 blocks of data. Namenode server respond to client by specifying data nodes to store the data.

Lets assume Namenode sends response like following.
Store Block1 in Datanode5
Store Block2 in Datanode135
Store Block3 in Datanode1124
Store Block4 in Datanode3
Store Block1 in Datanode87

b.   Now client sends data blocks to specific data nodes. (I don’t talk about replication here, will discuss in subsequent section)

In the same way, if client want to read fie “A” from HDFS, it sends a request to Namenode and enquire about datanodes, where file “A” is located. Once Namenode sends metadata, client reads data blocks from respective data nodes.

Datanodes
Datanodes are the actual place where the data is stored. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

Replication Factor
Application can specify how many replicas of a copy to be stored in HDFS. default replication factor in Hadoop is 3. That means, each data block has 3 copies in the Hadoop cluster. The replicas of data block always kept on different machines.
 
How replication happens
Lets suppose you want to save a file “A.txt” of size 270MB into HDFS. Following are the sequence of steps going to happen.

1.    default block size is 64MB in Hadoop 1.x and 128MB in Hadoop 2.x. This block size is configurable. To demonstrate the example, assume the block size is 64MB. so client divides the file 'A.txt' into 5 blocks.
Block1 – 64mb
Block2 – 64mb
Block3 – 64mb
Block4 – 64mb
Block5 – 14mb

Client sends meta data of the file “A.txt” and blocks details to Namenode.
2.    Namenode receives client request, and check its meta data information about HDFS, and respond to Client by specifying where to store block. For example response may looks like following.

If replication factor is 3.

Store Block1 in Datanodes 153, 21, 39
Store Block2 in Datanodes 13, 543, 9
Store Block3 in Datanodes 1, 243, 567
Store Block4 in Datanodes 97, 321, 6
Store Block5 in Datanodes 64, 765, 98

3. Namenode sends block names and datanode location to store data into HDFS. Client then copies the blocks to the closest Data node. The data node is then responsible for copying the block to a second datanode, where finally the second will copy to the third. After writing successfully to third data node, third datanode send acknowledement to second datanode, second datanode send acknowledgement to first datanode, first datanode send acknowledgement to client.

As you observe the floe, client will only copy data to one of the data nodes, and the framework will take care of the replication between datanodes.

Note
The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. The replication factor can be specified at file creation time and can be changed later.

What is Block report
All datanodes periodically send a report contains list of blocks available/filled in data node to Namenode.

What is heartbeat
Namenode periodically receives heartbeat of datanodes in a cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly.

How HDFS select a data node to satisfy client read request
HDFS tries to satisfy a read request from a datanode that is closest to the client.

How file deletes and restore happens?
Hadoop documentation says…
When a user or an application deletes a file, it is not immediately removed from HDFS. Instead, HDFS first renames it to a file in the /trash directory. The file can be restored quickly as long as it remains in /trash. A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.

A user can Undelete a file after deleting it as long as it remains in the /trash directory. If a user wants to undelete a file that he/she has deleted, he/she can navigate the /trash directory and retrieve the file. The /trash directory contains only the latest copy of the file that was deleted. The /trash directory is just like any other directory with one special feature: HDFS applies specified policies to automatically delete files from this directory. Current default trash interval is set to 0 (Deletes file without storing in trash). This value is configurable parameter stored as fs.trash.interval stored in core-site.xml.

How file system meta data persists
Namenode maintains two components to maintain meta data persistence.

1.EditLog: It is a transaction log which persistently record every change that occurs to file system metadata.

2.FsImage: It is a snapshot of the file system metadata (like which blocks are stored at what location etc.,) at a given point of time. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. You can assume FsImage as a periodical snapshot of EditLog. If a name node crashes, it can restore its state by first loading the fsimage then replaying all the operations (also called edits or transactions) in the edit log to catch up to the most recent state of the namesystem.
 
Latest fsImage will be created by merging the previous fsImage and changes in the edit log file.
 

 

Is NameNode perform the merging of FsImage and Edit log file?

No. Secondary name node is responsible to merge FsImage and Edit log files. Name node writes FsImage and edit log files to a shared location, Secondary name node read this FsImage, log files from the shared location, merge them and write the new FsImage to the shared location.



Once the merging process completed successfully,

a.   Old FsImage is replaced with new FsImage and

b.   edit log file is cleared.

 

This merging process will be done for every 30 seconds by default and it is configurable.

 

 

What will happen when a name node is down?

In Hadoop 1.x, name node is a single point of failure, when a name node fails, entire system will not work and manual intervention is required to make the system up and running again.

 

Hadoop 2.x addresses this issue with the help of secondary name node. When primary node went down, secondary name node loads the latest FsImage and start acting as primary name node. But the merging process will not be done until Hadoop admin configured new secondary node.


What will happen when a data node is down?

When a data node is down,

a.   Name node replicate the blocks that are persisted in the failed node to other active data nodes in the cluster.

b.   Namenode remove the all the entries of failed data node from the metadata.


Referred Articles





Previous                                                 Next                                                 Home

No comments:

Post a Comment