HDFS | Hadoop Developer Self Learning Outline
With reference to my earlier post related to Hadoop
Developer Self Learning Outline.
I am going to write short and simple tutorial on HDFS
In this post I am going to cover following topic
I am going to write short and simple tutorial on HDFS
In this post I am going to cover following topic
- Concepts (Distributed storage,horizontal scaling, replication, rack awareness)
- Architecture
- Namenode (function, storage, file system meta-data, and block reports)
- Secondary namenode
- Data node
- Configuration files
- Single node and multi node installation
- Communications / heart-beats
- Block manager / balancer
- Health check / safemode
Concepts (Distributed storage,horizontal scaling, replication, rack
awareness)
1.
Hadoop Distributed File
System (HDFS)
Basic Concept:
·
HDFS is a filesystem written in Java Based on
Google’s GFS
·
Sits on top of a native file system Such as
ext3, ext4 or xfs
·
Provides redundant storage for massive amounts
of data Using readily available,
industry standard computers
·
HDFS
performs best with a ‘modest’ number of large files
o Millions, rather than billions, of files
o Each file typically 100MB or more
·
Files
in HDFS are ‘write once’
·
No
random writes to files are allowed
·
HDFS
is optimized for large, streaming reads of files
o Rather than random reads
Rack awareness and HA will be covered in another post.
2.
Architecture
3.
Functions of a Name Node
•
Manages File System Namespace
o
Maps a file name to a set of
blocks
o
Maps a block to the DataNodes
where it resides
•
Cluster Configuration
Management
•
Replication Engine for Blocks
NameNode Metadata
•
Metadata in Memory
o
The entire metadata is in
main memory
o
No demand paging of metadata
•
Types of metadata
o
List of files
o
List of Blocks for each file
o
List of DataNodes for each
block
o
File attributes, e.g.
creation time, replication factor
•
A Transaction Log
o Records file creations, file
deletions etc
4.
Secondary NameNode
•
Copies FsImage and
Transaction Log from Namenode to a temporary directory
•
Merges FSImage and
Transaction Log into a new FSImage in temporary directory
•
Uploads new FSImage to the
NameNode
o
Transaction Log on NameNode
is purged
5.
DataNode
•
A Block Server
o
Stores data in the local file
system (e.g. ext3)
o
Stores metadata of a block
(e.g. CRC)
o
Serves data and metadata to
Clients
6.
Configuration files and
Installation
Single
node setup can be found at
7.
Configration files
files which need to be configured during setup are:
1.
hadoop-env.sh
2.
core-site.xml
3.
yarn-site.xml
4.
mapred-site.xml
8.
Communications / heart-beats
Heartbeats
•
DataNodes send hearbeat to
the NameNode
o
Once every 3 seconds
•
NameNode uses heartbeats to
detect DataNode failure
9.
Block manager / balancer Block Placement
•
Block Report
o
Periodically sends a report
of all existing blocks to the NameNode
•
Facilitates Pipelining of
Data
o
Forwards data to other
specified DataNodes
•
Current Strategy
o
One replica on local node
o
Second replica on a remote
rack
o
Third replica on same remote
rack
o
Additional replicas are
randomly placed
•
Clients read from nearest
replicas
•
Would like to make this
policy pluggable
Replication Engine
•
NameNode detects DataNode
failures
o
Chooses new DataNodes for new
replicas
o
Balances disk usage
o
Balances communication traffic
to DataNodes
10.
Health check / safemode
FB Page:Hadoop Quiz
Comment for update or changes..
Post a Comment
image video quote pre code