Hadoop Introduction 2 | Hadoop Developer Self Learning

With reference to my earlier post related to Hadoop Developer Self Learning Outline.
I am going to write short and simple tutorial on it.
In this post I am going to cover following topic

Pre knowledge: Understanding Big Data

Hadoop Introduction

Hadoop history and concepts
Ecosystem

Above two topics are covered in part one of Hadoop Introduction. In this post we are going to look forward about the hadoop distributions, factors need to be consider while choosing them and Hadoop high level architecture

Distributions
High level architecture

Distributions

hadoop is apache top project.

Different vendor worked on hadoop and developed a distribution.One should be very specific about choosing this distribution.

you can refer below consideration for selecting vendor.

4 Considerations whenchoosing a Hadoop Distribution

Top Hadoop distributor are described here . few are them are free, few are premium and few are free + premium ex. cloudera

Amazon Elastic MapReduce
Cloudera CDH Hadoop Distribution
Hortonworks Data Platform (HDP)
MapR Hadoop Distribution
IBM Open Platform
Microsoft Azure's HDInsight -Cloud based Hadoop Distrbution
Pivotal Big Data Suite
Datameer Professional
Datastax Enterprise Analytics
Dell- Cloudera Apache Hadoop Solution

few popular vendor and there recent releases

Vendor	Product evaluated	Product version evaluated
Cloudera	Cloudera Enterprise	5.5
Hortonworks	Hortonworks Data Platform	2.3
IBM	IBM BigInsights for Apache Hadoop	4.1
MapR Technologies	The MapR Distribution including Apache	5
Pivotal Software	HadoopPivotal HD	3.x

High level Architecture

Hadoop 1.0 architecture is shown below

Core Component

HDFS (Hadoop Distributed File System)

- Distributed Storage

MR framework (MapReduce)

- Parallel Processing/Computing

Hadoop 1.0 and 2.0:

Yarn is introduced in Hadoop 2.0.

Description about component

Apache HDFS

The Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper. Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. With Zookeeper the HDFS High Availability feature addresses this problem by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby.

Apache MapReduce

MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Apache MapReduce was derived from Google MapReduce: Simplified Data Processing on Large Clusters paper. The current Apache MapReduce version is built over Apache YARN Framework. YARN stands for “Yet-Another-Resource-Negotiator”. It is a new framework that facilitates writing arbitrary distributed processing frameworks and applications. YARN’s execution model is more generic than the earlier MapReduce implementation. YARN can run applications that do not follow the MapReduce model, unlike the original Apache Hadoop MapReduce (also called MR1). Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing.

(Source: Github)

A hadoop Blog: Blog Link
FB Page:Hadoop Quiz
Comment for update or changes..

Free Hadoop Tutorial Hadoop Developer Self Learning Outline

Hadoop Introduction 2 | Hadoop Developer Self Learning

Post a Comment