Hadoop Introduction | Hadoop Developer Self Learning
With reference to my earlier post related to
I am going to write short and simple tutorial on it.
In this post I am going to cover following topic
- Hadoop history and concepts
- Ecosystem
- Distributions
- High level architecture
Pre knowledge: Understanding Big Data
Hadoop history and concepts
About Hadoop
•
Apache top level project, open-source
implementation of frameworks for reliable, scalable, distributed computing and
data storage.
•
It is a flexible and highly-available
architecture for large scale computation and data processing on a network of
commodity hardware.
Why Hadoop?
•
Designed to answer the question: “How to
process big data with reasonable cost and time?”
Who developed Hadoop?
Doug Cutting
Michael J. Cafarella
Short timeline about hadoop development (2002-2009)
Time line
History related to release , development process is mentioned below
Year
|
Month
|
Event
|
---|---|---|
2003
|
October
|
Google File System paper released
|
2004
|
December
|
MapReduce: Simplified Data Processing on Large Clusters
|
2006
|
January
|
Hadoop subproject created with mailing lists, jira, and wiki
|
2006
|
January
|
Hadoop is born from Nutch 197
|
2006
|
February
|
NDFS+ MapReduce moved out of Apache Nutch to create Hadoop
|
2006
|
April
|
Hadoop 0.1.0 released
|
2006
|
April
|
Hadoop sorts 1.8 TB on 188 nodes in 47.9 hours
|
2007
|
October
|
First release of Hadoop that includes HBase
|
2007
|
October
|
Yahoo Labs creates Pig, and donates it to the ASF
|
2008
|
May
|
Hadoop wins TeraByte Sort (World Record sortbenchmark.org)
|
2008
|
October
|
Cloudera, Hadoop distributor is founded
|
2011
|
June
|
Rob Beardon and Eric Badleschieler spin out Hortonworks out of
Yahoo.
|
2012
|
January
|
Hadoop community moves to separate from MapReduce and replace
with YARN
|
2012
|
November
|
Apache Hadoop 1.0 Available
|
2014
|
February
|
Apache Spark top Level Apache Project
|
2014
|
June
|
Apache Hadoop 2.4 Available
|
2014
|
August
|
Apache Hadoop 2.5 Available
|
2014
|
November
|
Apache Hadoop 2.6 Available
|
2015
|
June
|
Apache Hadoop 2.7 Available
|
2017
|
March
|
Apache Hadoop 2.8 Available
|
Source: Wiki
Hadoop version Explained
Each time Hadoop undergoes a new release . It
update its version. HADOOP "X,Y,Z"
A detail document about hadoop vesioning is
mentioned in Hadoop Version Explanation
What is Hadoop
A brief note about it
- Apache Hadoop Is an open-source software framework for storing and processing large data sets.
- Hadoop is fault tolerance distributed filesystem for storage and parallal computing for processing
- No high end or expensive systems are required, Built on commodity hardwares.- Can run on your machine.
- Can run on Linux, Mac OS/X, Windows,
Solaris
- Fault tolerant system
Setup related to to hadoop ecosystem on ubuntu can be
navigated @ Installation part
- Execution of the
job continues even of nodes are failing.
- Highly reliable and efficient storage system
Main Features of hadoop
- Distributed Storage
- Fault Tolerance
- Horizontal
Scalability
- Open Souce
- Comodity Hardware
- Parallel Processing
Ecosystem:
- Distributions
- High level architecture- this topic will be covered in my next post.
- Get your hand dirty in hadoop framework
Comment for update or changes
A detailed about hadoop
ecosystem is expressed in following table
(Source: Github)
Apache HDFS
|
The Hadoop Distributed File
System (HDFS) offers a way to store large files across multiple machines.
Hadoop and HDFS was derived from Google File System (GFS) paper. Prior to
Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS
cluster. With Zookeeper the HDFS High Availability feature addresses this
problem by providing the option of running two redundant NameNodes in the
same cluster in an Active/Passive configuration with a hot standby.
|
---|---|
Apache MapReduce
|
MapReduce is a programming model
for processing large data sets with a parallel, distributed algorithm on a
cluster. Apache MapReduce was derived from Google MapReduce: Simplified Data
Processing on Large Clusters paper. The current Apache MapReduce version is
built over Apache YARN Framework. YARN stands for
“Yet-Another-Resource-Negotiator”. It is a new framework that facilitates
writing arbitrary distributed processing frameworks and applications. YARN’s
execution model is more generic than the earlier MapReduce implementation.
YARN can run applications that do not follow the MapReduce model, unlike the
original Apache Hadoop MapReduce (also called MR1). Hadoop YARN is an attempt
to take Apache Hadoop beyond MapReduce for data-processing.
|
Apache Pig
|
Pig provides an engine for
executing data flows in parallel on Hadoop. It includes a language, Pig
Latin, for expressing these data flows. Pig Latin includes operators for many
of the traditional data operations (join, sort, filter, etc.), as well as the
ability for users to develop their own functions for reading, processing, and
writing data. Pig runs on Hadoop. It makes use of both the Hadoop Distributed
File System, HDFS, and Hadoop’s processing system, MapReduce.Pig uses
MapReduce to execute all of its data processing. It compiles the Pig Latin
scripts that users write into a series of one or more MapReduce jobs that it
then executes. Pig Latin looks different from many of the programming
languages you have seen. There are no if statements or for loops in Pig
Latin. This is because traditional procedural and object-oriented programming
languages describe control flow, and data flow is a side effect of the
program. Pig Latin instead focuses on data flow.
|
Apache Spark
|
Data analytics cluster computing
framework originally developed in the AMPLab at UC Berkeley. Spark fits into
the Hadoop open-source community, building on top of the Hadoop Distributed
File System (HDFS). However, Spark provides an easier to use alternative to
Hadoop MapReduce and offers performance up to 10 times faster than previous
generation systems like Hadoop MapReduce for certain applications. Spark is a
framework for writing fast, distributed programs. Spark solves similar
problems as Hadoop MapReduce does but with a fast in-memory approach and a
clean functional style API. With its ability to integrate with Hadoop and
inbuilt tools for interactive query analysis (Shark), large-scale graph
processing and analysis (Bagel), and real-time analysis (Spark Streaming), it
can be interactively used to quickly process and query big data sets. To make
programming faster, Spark provides clean, concise APIs in Scala, Java and
Python. You can also use Spark interactively from the Scala and Python shells
to rapidly query big datasets. Spark is also the engine behind Shark, a fully
Apache Hive-compatible data warehousing system that can run 100x faster than
Hive.
|
Apache HBase
|
Google BigTable Inspired.
Non-relational distributed database. Ramdom, real-time r/w operations in
column-oriented very large tables (BDDB: Big Data Data Base). It’s the
backing system for MR jobs outputs. It’s the Hadoop database. It’s for
backing Hadoop MapReduce jobs with Apache HBase tables
|
Apache Hive
|
Data Warehouse infrastructure
developed by Facebook. Data summarization, query, and analysis. It’s provides
SQL-like language (not SQL92 compliant): HiveQL.
|
Apache Flume
|
Flume is a distributed, reliable,
and available service for efficiently collecting, aggregating, and moving
large amounts of log data. It has a simple and flexible architecture based on
streaming data flows. It is robust and fault tolerant with tunable
reliability mechanisms and many failover and recovery mechanisms. It uses a
simple extensible data model that allows for online analytic application.
|
Apache Sqoop
|
System for bulk data transfer
between HDFS and structured datastores as RDBMS. Like Flume but from HDFS to
RDBMS.
|
Apache Kafka
|
Distributed publish-subscribe
system for processing large amounts of streaming data. Kafka is a Message
Queue developed by LinkedIn that persists messages to disk in a very
performant manner. Because messages are persisted, it has the interesting
ability for clients to rewind a stream and consume the messages again.
Another upside of the disk persistence is that bulk importing the data into
HDFS for offline analysis can be done very quickly and efficiently. Storm,
developed by BackType (which was acquired by Twitter a year ago), is more
about transforming a stream of messages into new streams.
|
Apache Oozie
|
Workflow scheduler system for MR
jobs using DAGs (Direct Acyclical Graphs). Oozie Coordinator can trigger jobs
by time (frequency) and data availability
|
Apache Zookeeper
|
It’s a coordination service that
gives you the tools you need to write correct distributed applications.
ZooKeeper was developed at Yahoo! Research. Several Hadoop projects are
already using ZooKeeper to coordinate the cluster and provide
highly-available distributed services. Perhaps most famous of those are
Apache HBase, Storm, Kafka. ZooKeeper is an application library with two
principal implementations of the APIs—Java and C—and a service component
implemented in Java that runs on an ensemble of dedicated servers. Zookeeper
is for building distributed systems, simplifies the development process,
making it more agile and enabling more robust implementations. Back in 2006,
Google published a paper on "Chubby", a distributed lock service which
gained wide adoption within their data centers. Zookeeper, not surprisingly,
is a close clone of Chubby designed to fulfill many of the same roles for
HDFS and other Hadoop infrastructure.
|
Post a Comment
image video quote pre code