HDFS | Hadoop Developer Self Learning Outline

9:19:00 PM 10:24:36 PM

With reference to my earlier post related to Hadoop Developer Self Learning Outline.
I am going to write short and simple tutorial on HDFS
In this post I am going to cover following topic

Concepts (Distributed storage,horizontal scaling, replication, rack awareness)
Architecture
Namenode (function, storage, file system meta-data, and block reports)
Secondary namenode
Data node
Configuration files
Single node and multi node installation
Communications / heart-beats
Block manager / balancer
Health check / safemode

Concepts (Distributed storage,horizontal scaling, replication, rack awareness)

1. Hadoop Distributed File System (HDFS)

Basic Concept:

· HDFS is a filesystem written in Java Based on Google’s GFS

· Sits on top of a native file system Such as ext3, ext4 or xfs

· Provides redundant storage for massive amounts of data Using readily available, industry standard computers

· HDFS performs best with a ‘modest’ number of large files

o Millions, rather than billions, of files

o Each file typically 100MB or more

· Files in HDFS are ‘write once’

· No random writes to files are allowed

· HDFS is optimized for large, streaming reads of files

o Rather than random reads

Rack awareness and HA will be covered in another post.

2. Architecture

Fig. HDFS Architecture

3. Functions of a Name Node

• Manages File System Namespace

o Maps a file name to a set of blocks

o Maps a block to the DataNodes where it resides

• Cluster Configuration Management

• Replication Engine for Blocks

NameNode Metadata

• Metadata in Memory

o The entire metadata is in main memory

o No demand paging of metadata

• Types of metadata

o List of files

o List of Blocks for each file

o List of DataNodes for each block

o File attributes, e.g. creation time, replication factor

• A Transaction Log

o Records file creations, file deletions etc

4. Secondary NameNode

• Copies FsImage and Transaction Log from Namenode to a temporary directory

• Merges FSImage and Transaction Log into a new FSImage in temporary directory

• Uploads new FSImage to the NameNode

o Transaction Log on NameNode is purged

5. DataNode

• A Block Server

o Stores data in the local file system (e.g. ext3)

o Stores metadata of a block (e.g. CRC)

o Serves data and metadata to Clients

6. Configuration files and Installation

Single node setup can be found at

Hadoop2.7.0 Single Node Cluster Setup on Ubuntu 14.10

7. Configration files

files which need to be configured during setup are:

1. hadoop-env.sh

2. core-site.xml

3. yarn-site.xml

4. mapred-site.xml

8. Communications / heart-beats

Heartbeats

• DataNodes send hearbeat to the NameNode

o Once every 3 seconds

• NameNode uses heartbeats to detect DataNode failure

9. Block manager / balancer Block Placement

• Block Report

o Periodically sends a report of all existing blocks to the NameNode

• Facilitates Pipelining of Data

o Forwards data to other specified DataNodes

• Current Strategy

o One replica on local node

o Second replica on a remote rack

o Third replica on same remote rack

o Additional replicas are randomly placed

• Clients read from nearest replicas

• Would like to make this policy pluggable

Replication Engine

• NameNode detects DataNode failures

o Chooses new DataNodes for new replicas

o Balances disk usage

o Balances communication traffic to DataNodes

10. Health check / safemode

HadoopSafe Mode (Maintenance Mode) Commands

Hadoopfsck Commands

A hadoop Blog: Blog Link
FB Page:Hadoop Quiz
Comment for update or changes..

HDFS | Hadoop Developer Self Learning Outline

Post a Comment