Hadoop Introduction 2 | Hadoop Developer Self Learning
I am going to write short and simple tutorial on it.
In this post I am going to cover following topic
Pre knowledge: Understanding Big Data
Hadoop Introduction
- Hadoop history and concepts
- Ecosystem
Above two topics are covered in part one of Hadoop Introduction. In this post we are going to look forward about the hadoop distributions, factors need to be consider while choosing them and Hadoop high level architecture
- Distributions
- High level architecture
hadoop is apache top project.
Different vendor worked on hadoop and developed a distribution.One should be very specific about choosing this distribution.
you can refer below consideration for selecting vendor.
Top Hadoop distributor are described here . few are them are free, few are premium and few are free + premium ex. cloudera
- Amazon Elastic MapReduce
- Cloudera CDH Hadoop Distribution
- Hortonworks Data Platform (HDP)
- MapR Hadoop Distribution
- IBM Open Platform
- Microsoft Azure's HDInsight -Cloud based Hadoop Distrbution
- Pivotal Big Data Suite
- Datameer Professional
- Datastax Enterprise Analytics
- Dell- Cloudera Apache Hadoop Solution
few popular vendor and there recent releases
Vendor | Product evaluated | Product version evaluated |
Cloudera | Cloudera Enterprise | 5.5 |
Hortonworks | Hortonworks Data Platform | 2.3 |
IBM | IBM BigInsights for Apache Hadoop | 4.1 |
MapR Technologies | The MapR Distribution including Apache | 5 |
Pivotal Software | HadoopPivotal HD | 3.x |
High level Architecture
Hadoop 1.0 architecture is shown below
Core Component
- HDFS (Hadoop Distributed File System)
- Distributed Storage
- MR framework (MapReduce)
- Parallel Processing/Computing
Hadoop 1.0 and 2.0:
Yarn is introduced in Hadoop 2.0.
Description about component
Apache HDFS
|
The Hadoop Distributed File System (HDFS) offers a way to
store large files across multiple machines. Hadoop and HDFS was derived from
Google File System (GFS) paper. Prior to Hadoop 2.0.0, the NameNode was a
single point of failure (SPOF) in an HDFS cluster. With Zookeeper the HDFS
High Availability feature addresses this problem by providing the option of
running two redundant NameNodes in the same cluster in an Active/Passive
configuration with a hot standby.
|
Apache MapReduce
|
MapReduce is a programming model for processing large data
sets with a parallel, distributed algorithm on a cluster. Apache MapReduce
was derived from Google MapReduce: Simplified Data Processing on Large
Clusters paper. The current Apache MapReduce version is built over Apache YARN
Framework. YARN stands for “Yet-Another-Resource-Negotiator”. It is a new
framework that facilitates writing arbitrary distributed processing
frameworks and applications. YARN’s execution model is more generic than the
earlier MapReduce implementation. YARN can run applications that do not
follow the MapReduce model, unlike the original Apache Hadoop MapReduce (also
called MR1). Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce
for data-processing.
|
(Source: Github)
FB Page:Hadoop Quiz
Comment for update or changes..
Post a Comment
image video quote pre code