HDFS2 | Hadoop Developer Self Learning Outline
1
With reference to my earlier post related to Hadoop Developer Self Learning Outline.
I am going to write short and simple tutorial on HDFS
In this post I am going to cover following topic.
JAVA API:
Write Path
2. Navigating HDFS UI
Web UIs for the Common User
The default Hadoop ports are as follows:
A detailed doc related to basic hdfs command will be added soon
Areas Where HDFS Is Not a Good Fit Today
4. Abstracting the Block in HDFS
A block is the minimum amount of data that can be read or written. 64 MB is the default. Files in HDFS are broken into block-sized chunks, which are stored as independent units. HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be made to be significantly larger than the time to seek to the start of the block. Thus the time to transfer a large file made of multiple blocks operates at the disk transfer rate.
Apache Hadoop 3.0.0-alpha2 is released
The Web Developer Bootcamp https://goo.gl/w27ZHY
The Complete Web Developer Course 2.0 https://goo.gl/8MvdLq
AWS Certified Solutions Architect - Associate
A hadoop Blog: Blog Link FB Page:Hadoop Quiz
Comment for update or changes..
I am going to write short and simple tutorial on HDFS
In this post I am going to cover following topic.
- Read / write path
- Navigating HDFS UI
- Command-line interaction with HDFS
- File systems abstractions
- Reading / writing files using Java API
- Latest in HDFS
- Namenode
HA and Federation
1. Read Path
[1] The
client program starts with Hadoop library jar and copy of cluster configuration
data, that specifies the location of the name node.
[2] The
client begins by contact the node node indicating the file it wants to read.
[3] The
name node will validate clients identity, either by simply trusting client or
using authentication protocol such as Kerberos.
[4] The
client identity is verified against the owner and permission of the file.
[5] Namenode
responds to the client with the first block ID and the list of data nodes on
which a copy of the block can be found, sorted by their distance to the client,
Distance to the client is measured according to Hadoop's rack topology
[6] With
the block IDS and datanode hostnames, the client can now contact the most
appropriate datanode directly and read the block data it needs. This process
repeats until all the blocks in the file have been read or the client closes
the file stream.
FileSystem
fileSystem = FileSystem.get(conf);
Path
path = new Path("/path/to/file.ext");
if (!fileSystem.exists(path))
{
System.out.println("File
does not exists");
return;
}FSDataInputStream
in = fileSystem.open(path);
int
numBytes = 0;
while ((numBytes
= in.read(b))> 0) {
System.out.prinln((char)numBytes));//
code to manipulate the data which is read}
in.close();
out.close();
fileSystem.close();
[1] Client
makes a request to open a file for wringing using the Hadoop FileSystem APIs.
[2] A
request is sent to the name node to create the file metadata if the user has
the necessary permission to do so. However, it initially has no associated
blocks.
[3] Namenode
responds to the client indicating that the request was successful and it should
start writing data.
[4] The
client library sends request to name node asking set of datanodes to which data
should be written, it gets a list from name node
[5] The
client makes connection to first data node, which in turn makes connection to
second and second datanode makes connection to third.
[6] The
client starts writing data to first data node, the first data node writes data
to disk as well as to the input stream pointing to second data node. The second
data node writes the data the disk and writes to the connection pointing to
third data node and so on.
[7] Once
client is finished writing it indicates closing of the stream that flushes data
and writes to disk.
JAVA API:
FileSystem
fileSystem = FileSystem.get(conf);
// Check
if the file already exists
Path
path = new Path("/path/to/file.ext");
if
(fileSystem.exists(path)) {
System.out.println("File
" + dest + " already exists");
return;
}
//
Create a new file and write data to it.
FSDataOutputStream
out = fileSystem.create(path);
InputStream
in = new BufferedInputStream(new FileInputStream(
new
File(source)));
byte[] b
= new byte[1024];
int
numBytes = 0;
while
((numBytes = in.read(b)) > 0) {
out.write(b,
0, numBytes);
}
// Close
all the file descripters
in.close();
out.close();
fileSystem.close();
2. Navigating HDFS UI
Apart from CLI you can , Hadoop also provide UI access
Web UIs for the Common User
The default Hadoop ports are as follows:
|
Daemon
|
Default Port
|
Configuration Parameter
|
HDFS
|
Namenode
|
50070
|
dfs.http.address
|
Datanodes
|
50075
|
dfs.datanode.http.address
|
|
Secondarynamenode
|
50090
|
dfs.secondary.http.address
|
|
Backup/Checkpoint node?
|
50105
|
dfs.backup.http.address
|
|
MR
|
Jobracker
|
50030
|
mapred.job.tracker.http.address
|
Tasktrackers
|
50060
|
mapred.task.tracker.http.address
|
|
Yarn |
19888
|
mapreduce.jobhistory.webapp.address
|
|
resourcemanager
|
8088
|
mapred.task.tracker.http.address
|
|
Yarn |
Nodemanager
|
8042
|
yarn.nodemanager.webapp.address
|
3. Command-line interaction with HDFS
To start
HDFS-CLI, run the following command:
java -jar hdfs-cli-0.0.1-SNAPSHOT.jar
A detailed doc related to basic hdfs command will be added soon
Using CLI all commands can be executed llets take and
example of Hadoop Namenode Command
Hadoop Namenode Commands older version
Command
|
Description
|
hadoop namenode -format
|
Format HDFS filesystem from Namenode
|
hadoop namenode -upgrade
|
Upgrade the NameNode
|
start-dfs.sh
|
Start HDFS Daemons
|
stop-dfs.sh
|
Stop HDFS Daemons
|
start-mapred.sh
|
Start MapReduce Daemons
|
stop-mapred.sh
|
Stop MapReduce Daemons
|
hadoop namenode -recover -force
|
Recover namenode metadata after a cluster failure (may lose data)
|
All other Hadoop command can be found at Command Centre
Reference Material for Hadoop command line interface
- Low-latency data access
- Lots of small files
- Multiple writers, arbitrary file modifications
4. Abstracting the Block in HDFS
A block is the minimum amount of data that can be read or written. 64 MB is the default. Files in HDFS are broken into block-sized chunks, which are stored as independent units. HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be made to be significantly larger than the time to seek to the start of the block. Thus the time to transfer a large file made of multiple blocks operates at the disk transfer rate.
5. Benefits of Block Abstraction
A file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster. Making the unit of abstraction a block rather than a file simplifies the storage subsystem. Blocks provide fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.
A file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster. Making the unit of abstraction a block rather than a file simplifies the storage subsystem. Blocks provide fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.
6. Reading / writing files using
Java API:
This topic is covered in upper section in hdfs read and
write. JAVA APIs are provided to make the call.
7. Latest in HDFS
8. Namenode HA and Federation
In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the other is in a Standbystate. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.
In order for the Standby node to keep its state synchronized with the Active node, the current implementation requires that the two nodes both have access to a directory on a shared storage device (eg an NFS mount from a NAS). This restriction will likely be relaxed in future versions.
In Short HA means
Two Name Nodes: Active and Standby
Classic mode
– One NameNode
– One “helper” node called Secondary Name Node
– Bookkeeping, not backup
Detailed DOC can be found here
HDFS federation
HDFS
federation, introduced in the 2.x release series, allows a cluster to scale by
adding namenodes, each of which manages a portion of the filesystem namespace.
For example, one namenode might manage all the files rooted under /user, say,
and a second namenode might handle files under /share.
Under
federation, each namenode manages a namespace volume, which is made up of the
metadata for the namespace, and a block pool containing all the blocks for the
filesin the namespace.
Top Trending Udemy courses at huge discount
The Complete Web Developer Course 2.0 https://goo.gl/8MvdLq
AWS Certified Solutions Architect - Associate
Comment for update or changes..
Java Training in Chennai
Salesforce Training in Chennai