Hadoop Introduction | Hadoop Developer Self Learning

With reference to my earlier post related to 
I am going to write short and simple tutorial on it.
In this post I am going to cover following topic

  • Hadoop history and concepts
  • Ecosystem
  • Distributions
  • High level architecture
Pre knowledge:  Understanding Big Data


Hadoop history and concepts

About Hadoop
          Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage.
          It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware.

Why Hadoop?
          Designed to answer the question: “How to process big data with reasonable cost and time?”

Who developed Hadoop?
Doug Cutting
Doug Cutting 


Michael J. Cafarella

Michael J. Cafarella

                                     

Short timeline about hadoop development (2002-2009)


Time line
                         Time line

History related to release , development process is mentioned below

Year
Month
Event
2003
 October
Google File System paper released
2004
 December
MapReduce: Simplified Data Processing on Large Clusters
2006
 January
Hadoop subproject created with mailing lists, jira, and wiki
2006
 January
Hadoop is born from Nutch 197
2006
 February
NDFS+ MapReduce moved out of Apache Nutch to create Hadoop
2006
 April
Hadoop 0.1.0 released
2006
 April
Hadoop sorts 1.8 TB on 188 nodes in 47.9 hours
2007
 October
First release of Hadoop that includes HBase
2007
October
Yahoo Labs creates Pig, and donates it to the ASF
2008
May
Hadoop wins TeraByte Sort (World Record sortbenchmark.org)
2008
October
Cloudera, Hadoop distributor is founded
2011
June
Rob Beardon and Eric Badleschieler spin out Hortonworks out of Yahoo.
2012
January
Hadoop community moves to separate from MapReduce and replace with YARN
2012
November
Apache Hadoop 1.0 Available
2014
February
Apache Spark top Level Apache Project
2014
June
Apache Hadoop 2.4 Available
2014
August
Apache Hadoop 2.5 Available
2014
November
Apache Hadoop 2.6 Available
2015
June
Apache Hadoop 2.7 Available
2017
March
Apache Hadoop 2.8 Available

Source: Wiki

Hadoop version Explained

Each time Hadoop undergoes a new release . It update its version. HADOOP "X,Y,Z"
A  detail document about hadoop vesioning is mentioned in  Hadoop Version Explanation

What is Hadoop
A brief note about it
  • Apache Hadoop Is an open-source software framework for storing and  processing large data sets.
  • Hadoop is fault tolerance distributed filesystem for storage and parallal computing for processing
  • No high end or expensive systems are required,  Built on commodity hardwares.
    -  Can run on your machine.
  • Can run on Linux, Mac OS/X, Windows, Solaris
  • Fault tolerant system
Setup related to to hadoop ecosystem on ubuntu can be navigated @ Installation part

Execution of the job continues even of nodes are failing.
-   Highly reliable and efficient storage system

Main Features of hadoop
Distributed Storage
-  Fault Tolerance
Horizontal Scalability
-  Open Souce
Comodity Hardware
-  Parallel Processing

Ecosystem:



  • Distributions
  • High level architecture- this topic will be covered in my next post.
  • Get your hand dirty in hadoop framework
A hadoop Blog: Blog Link

FB Page:Hadoop Quiz

Comment for update or changes

A detailed about hadoop ecosystem is expressed in following table 
(Source: Github)

Apache HDFS
The Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper. Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. With Zookeeper the HDFS High Availability feature addresses this problem by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby.
Apache MapReduce
MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Apache MapReduce was derived from Google MapReduce: Simplified Data Processing on Large Clusters paper. The current Apache MapReduce version is built over Apache YARN Framework. YARN stands for “Yet-Another-Resource-Negotiator”. It is a new framework that facilitates writing arbitrary distributed processing frameworks and applications. YARN’s execution model is more generic than the earlier MapReduce implementation. YARN can run applications that do not follow the MapReduce model, unlike the original Apache Hadoop MapReduce (also called MR1). Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing.
Apache Pig
Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows. Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data. Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s processing system, MapReduce.Pig uses MapReduce to execute all of its data processing. It compiles the Pig Latin scripts that users write into a series of one or more MapReduce jobs that it then executes. Pig Latin looks different from many of the programming languages you have seen. There are no if statements or for loops in Pig Latin. This is because traditional procedural and object-oriented programming languages describe control flow, and data flow is a side effect of the program. Pig Latin instead focuses on data flow.
Apache Spark
Data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). However, Spark provides an easier to use alternative to Hadoop MapReduce and offers performance up to 10 times faster than previous generation systems like Hadoop MapReduce for certain applications. Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and inbuilt tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big data sets. To make programming faster, Spark provides clean, concise APIs in Scala, Java and Python. You can also use Spark interactively from the Scala and Python shells to rapidly query big datasets. Spark is also the engine behind Shark, a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.
Apache HBase
Google BigTable Inspired. Non-relational distributed database. Ramdom, real-time r/w operations in column-oriented very large tables (BDDB: Big Data Data Base). It’s the backing system for MR jobs outputs. It’s the Hadoop database. It’s for backing Hadoop MapReduce jobs with Apache HBase tables
Apache Hive
Data Warehouse infrastructure developed by Facebook. Data summarization, query, and analysis. It’s provides SQL-like language (not SQL92 compliant): HiveQL.
Apache Flume
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
Apache Sqoop
System for bulk data transfer between HDFS and structured datastores as RDBMS. Like Flume but from HDFS to RDBMS.
Apache Kafka
Distributed publish-subscribe system for processing large amounts of streaming data. Kafka is a Message Queue developed by LinkedIn that persists messages to disk in a very performant manner. Because messages are persisted, it has the interesting ability for clients to rewind a stream and consume the messages again. Another upside of the disk persistence is that bulk importing the data into HDFS for offline analysis can be done very quickly and efficiently. Storm, developed by BackType (which was acquired by Twitter a year ago), is more about transforming a stream of messages into new streams.
Apache Oozie
Workflow scheduler system for MR jobs using DAGs (Direct Acyclical Graphs). Oozie Coordinator can trigger jobs by time (frequency) and data availability
Apache Zookeeper
It’s a coordination service that gives you the tools you need to write correct distributed applications. ZooKeeper was developed at Yahoo! Research. Several Hadoop projects are already using ZooKeeper to coordinate the cluster and provide highly-available distributed services. Perhaps most famous of those are Apache HBase, Storm, Kafka. ZooKeeper is an application library with two principal implementations of the APIs—Java and C—and a service component implemented in Java that runs on an ensemble of dedicated servers. Zookeeper is for building distributed systems, simplifies the development process, making it more agile and enabling more robust implementations. Back in 2006, Google published a paper on "Chubby", a distributed lock service which gained wide adoption within their data centers. Zookeeper, not surprisingly, is a close clone of Chubby designed to fulfill many of the same roles for HDFS and other Hadoop infrastructure.