HDFS2 | Hadoop Developer Self Learning Outline

12:49:00 AM 12:49:23 AM

With reference to my earlier post related to Hadoop Developer Self Learning Outline.
I am going to write short and simple tutorial on HDFS
In this post I am going to cover following topic.

Read / write path
Navigating HDFS UI
Command-line interaction with HDFS
File systems abstractions
Reading / writing files using Java API
Latest in HDFS
Namenode HA and Federation

1. Read Path

[1] The client program starts with Hadoop library jar and copy of cluster configuration data, that specifies the location of the name node.

[2] The client begins by contact the node node indicating the file it wants to read.

[3] The name node will validate clients identity, either by simply trusting client or using authentication protocol such as Kerberos.

[4] The client identity is verified against the owner and permission of the file.

[5] Namenode responds to the client with the first block ID and the list of data nodes on which a copy of the block can be found, sorted by their distance to the client, Distance to the client is measured according to Hadoop's rack topology

[6] With the block IDS and datanode hostnames, the client can now contact the most appropriate datanode directly and read the block data it needs. This process repeats until all the blocks in the file have been read or the client closes the file stream.

JAVA API:

FileSystem fileSystem = FileSystem.get(conf);

Path path = new Path("/path/to/file.ext");

if (!fileSystem.exists(path)) {

System.out.println("File does not exists");

return;

}FSDataInputStream in = fileSystem.open(path);

int numBytes = 0;

while ((numBytes = in.read(b))> 0) {

System.out.prinln((char)numBytes));// code to manipulate the data which is read}

in.close();

out.close();

fileSystem.close();

Write Path

[1] Client makes a request to open a file for wringing using the Hadoop FileSystem APIs.

[2] A request is sent to the name node to create the file metadata if the user has the necessary permission to do so. However, it initially has no associated blocks.

[3] Namenode responds to the client indicating that the request was successful and it should start writing data.

[4] The client library sends request to name node asking set of datanodes to which data should be written, it gets a list from name node

[5] The client makes connection to first data node, which in turn makes connection to second and second datanode makes connection to third.

[6] The client starts writing data to first data node, the first data node writes data to disk as well as to the input stream pointing to second data node. The second data node writes the data the disk and writes to the connection pointing to third data node and so on.

[7] Once client is finished writing it indicates closing of the stream that flushes data and writes to disk.

JAVA API:

FileSystem fileSystem = FileSystem.get(conf);

// Check if the file already exists

Path path = new Path("/path/to/file.ext");

if (fileSystem.exists(path)) {

System.out.println("File " + dest + " already exists");

return;

}

// Create a new file and write data to it.

FSDataOutputStream out = fileSystem.create(path);

InputStream in = new BufferedInputStream(new FileInputStream(

new File(source)));

byte[] b = new byte[1024];

int numBytes = 0;

while ((numBytes = in.read(b)) > 0) {

out.write(b, 0, numBytes);

}

// Close all the file descripters

in.close();

out.close();

fileSystem.close();

2. Navigating HDFS UI

Apart from CLI you can , Hadoop also provide UI access

Web UIs for the Common User
The default Hadoop ports are as follows:

	Daemon	Default Port	Configuration Parameter
HDFS	Namenode	50070	dfs.http.address
	Datanodes	50075	dfs.datanode.http.address
	Secondarynamenode	50090	dfs.secondary.http.address
	Backup/Checkpoint node?	50105	dfs.backup.http.address
MR	Jobracker	50030	mapred.job.tracker.http.address
	Tasktrackers	50060	mapred.task.tracker.http.address
	Yarn	19888	mapreduce.jobhistory.webapp.address
	resourcemanager	8088	mapred.task.tracker.http.address
Yarn	Nodemanager	8042	yarn.nodemanager.webapp.address

3. Command-line interaction with HDFS

To start HDFS-CLI, run the following command:

java -jar hdfs-cli-0.0.1-SNAPSHOT.jar

A detailed doc related to basic hdfs command will be added soon

Using CLI all commands can be executed llets take and example of Hadoop Namenode Command

Hadoop Namenode Commands older version

Command	Description
hadoop namenode -format	Format HDFS filesystem from Namenode
hadoop namenode -upgrade	Upgrade the NameNode
start-dfs.sh	Start HDFS Daemons
stop-dfs.sh	Stop HDFS Daemons
start-mapred.sh	Start MapReduce Daemons
stop-mapred.sh	Stop MapReduce Daemons
hadoop namenode -recover -force	Recover namenode metadata after a cluster failure (may lose data)

All other Hadoop command can be found at Command Centre

Reference Material for Hadoop command line interface

Areas Where HDFS Is Not a Good Fit Today

Low-latency data access
Lots of small files
Multiple writers, arbitrary file modifications

4. Abstracting the Block in HDFS

A block is the minimum amount of data that can be read or written. 64 MB is the default. Files in HDFS are broken into block-sized chunks, which are stored as independent units. HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be made to be significantly larger than the time to seek to the start of the block. Thus the time to transfer a large file made of multiple blocks operates at the disk transfer rate.

5. Benefits of Block Abstraction
A file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster. Making the unit of abstraction a block rather than a file simplifies the storage subsystem. Blocks provide fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.

6. Reading / writing files using Java API:

This topic is covered in upper section in hdfs read and write. JAVA APIs are provided to make the call.

7. Latest in HDFS

Apache Hadoop 3.0.0-alpha2 is released

8. Namenode HA and Federation

In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the other is in a Standbystate. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.

In order for the Standby node to keep its state synchronized with the Active node, the current implementation requires that the two nodes both have access to a directory on a shared storage device (eg an NFS mount from a NAS). This restriction will likely be relaxed in future versions.

In Short HA means

Two Name Nodes: Active and Standby

Classic mode

– One NameNode

– One “helper” node called Secondary Name Node

– Bookkeeping, not backup

Detailed DOC can be found here

HDFS federation

HDFS federation, introduced in the 2.x release series, allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace. For example, one namenode might manage all the files rooted under /user, say, and a second namenode might handle files under /share.

Under federation, each namenode manages a namespace volume, which is made up of the metadata for the namespace, and a block pool containing all the blocks for the filesin the namespace.

Top Trending Udemy courses at huge discount

The Web Developer Bootcamp https://goo.gl/w27ZHY

The Complete Web Developer Course 2.0 https://goo.gl/8MvdLq

AWS Certified Solutions Architect - Associate

A hadoop Blog: Blog Link FB Page:Hadoop Quiz
Comment for update or changes..

1 comment

Unknown July 4, 2017 at 8:22 PM

Wow amazing i think these are the mistakes i have repeatedly doing when writing with the content. Especially by checking with the spelling i am spending much time. And some distractions likewise you are telling by checking with the email and so on. I hope it will be useful for me in future. Please keep update like this

Java Training in Chennai

Salesforce Training in Chennai

Balas

Rules ~

Media +

Respect Others Opinion. Commenting Link Is Strictly Forbidden. Comments According To The Posts Always Gets Priority.