Hadoop Introduction | Hadoop Developer Self Learning

roshan

3:23:00 PM 2:58:38 PM

With reference to my earlier post related to

Hadoop Developer Self Learning Outline.

I am going to write short and simple tutorial on it.

In this post I am going to cover following topic

Hadoop history and concepts
Ecosystem
Distributions
High level architecture

Pre knowledge: Understanding Big Data

Hadoop history and concepts

About Hadoop

• Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage.

• It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware.

Why Hadoop?

• Designed to answer the question: “How to process big data with reasonable cost and time?”

Who developed Hadoop?

Doug Cutting

Michael J. Cafarella

Short timeline about hadoop development (2002-2009)

Time line

History related to release , development process is mentioned below

Year	Month	Event
2003	October	Google File System paper released
2004	December	MapReduce: Simplified Data Processing on Large Clusters
2006	January	Hadoop subproject created with mailing lists, jira, and wiki
2006	January	Hadoop is born from Nutch 197
2006	February	NDFS+ MapReduce moved out of Apache Nutch to create Hadoop
2006	April	Hadoop 0.1.0 released
2006	April	Hadoop sorts 1.8 TB on 188 nodes in 47.9 hours
2007	October	First release of Hadoop that includes HBase
2007	October	Yahoo Labs creates Pig, and donates it to the ASF
2008	May	Hadoop wins TeraByte Sort (World Record sortbenchmark.org)
2008	October	Cloudera, Hadoop distributor is founded
2011	June	Rob Beardon and Eric Badleschieler spin out Hortonworks out of Yahoo.
2012	January	Hadoop community moves to separate from MapReduce and replace with YARN
2012	November	Apache Hadoop 1.0 Available
2014	February	Apache Spark top Level Apache Project
2014	June	Apache Hadoop 2.4 Available
2014	August	Apache Hadoop 2.5 Available
2014	November	Apache Hadoop 2.6 Available
2015	June	Apache Hadoop 2.7 Available
2017	March	Apache Hadoop 2.8 Available

Source: Wiki

Hadoop version Explained

Each time Hadoop undergoes a new release . It update its version. HADOOP "X,Y,Z"

A detail document about hadoop vesioning is mentioned in Hadoop Version Explanation

What is Hadoop

A brief note about it

Apache Hadoop Is an open-source software framework for storing and processing large data sets.
Hadoop is fault tolerance distributed filesystem for storage and parallal computing for processing
No high end or expensive systems are required, Built on commodity hardwares.
- Can run on your machine.

Can run on Linux, Mac OS/X, Windows, Solaris
Fault tolerant system

Setup related to to hadoop ecosystem on ubuntu can be navigated @ Installation part

- Execution of the job continues even of nodes are failing.

- Highly reliable and efficient storage system

Main Features of hadoop

- Distributed Storage

- Fault Tolerance

- Horizontal Scalability

- Open Souce

- Comodity Hardware

- Parallel Processing

Ecosystem:

Distributions
High level architecture- this topic will be covered in my next post.
Get your hand dirty in hadoop framework

A hadoop Blog: Blog Link

FB Page:Hadoop Quiz

Comment for update or changes

A detailed about hadoop ecosystem is expressed in following table

(Source: Github)

Apache HDFS	The Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper. Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. With Zookeeper the HDFS High Availability feature addresses this problem by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby.
Apache MapReduce	MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Apache MapReduce was derived from Google MapReduce: Simplified Data Processing on Large Clusters paper. The current Apache MapReduce version is built over Apache YARN Framework. YARN stands for “Yet-Another-Resource-Negotiator”. It is a new framework that facilitates writing arbitrary distributed processing frameworks and applications. YARN’s execution model is more generic than the earlier MapReduce implementation. YARN can run applications that do not follow the MapReduce model, unlike the original Apache Hadoop MapReduce (also called MR1). Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing.
Apache Pig	Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows. Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data. Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s processing system, MapReduce.Pig uses MapReduce to execute all of its data processing. It compiles the Pig Latin scripts that users write into a series of one or more MapReduce jobs that it then executes. Pig Latin looks different from many of the programming languages you have seen. There are no if statements or for loops in Pig Latin. This is because traditional procedural and object-oriented programming languages describe control flow, and data flow is a side effect of the program. Pig Latin instead focuses on data flow.
Apache Spark	Data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). However, Spark provides an easier to use alternative to Hadoop MapReduce and offers performance up to 10 times faster than previous generation systems like Hadoop MapReduce for certain applications. Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and inbuilt tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big data sets. To make programming faster, Spark provides clean, concise APIs in Scala, Java and Python. You can also use Spark interactively from the Scala and Python shells to rapidly query big datasets. Spark is also the engine behind Shark, a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.
Apache HBase	Google BigTable Inspired. Non-relational distributed database. Ramdom, real-time r/w operations in column-oriented very large tables (BDDB: Big Data Data Base). It’s the backing system for MR jobs outputs. It’s the Hadoop database. It’s for backing Hadoop MapReduce jobs with Apache HBase tables
Apache Hive	Data Warehouse infrastructure developed by Facebook. Data summarization, query, and analysis. It’s provides SQL-like language (not SQL92 compliant): HiveQL.
Apache Flume	Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
Apache Sqoop	System for bulk data transfer between HDFS and structured datastores as RDBMS. Like Flume but from HDFS to RDBMS.
Apache Kafka	Distributed publish-subscribe system for processing large amounts of streaming data. Kafka is a Message Queue developed by LinkedIn that persists messages to disk in a very performant manner. Because messages are persisted, it has the interesting ability for clients to rewind a stream and consume the messages again. Another upside of the disk persistence is that bulk importing the data into HDFS for offline analysis can be done very quickly and efficiently. Storm, developed by BackType (which was acquired by Twitter a year ago), is more about transforming a stream of messages into new streams.
Apache Oozie	Workflow scheduler system for MR jobs using DAGs (Direct Acyclical Graphs). Oozie Coordinator can trigger jobs by time (frequency) and data availability
Apache Zookeeper	It’s a coordination service that gives you the tools you need to write correct distributed applications. ZooKeeper was developed at Yahoo! Research. Several Hadoop projects are already using ZooKeeper to coordinate the cluster and provide highly-available distributed services. Perhaps most famous of those are Apache HBase, Storm, Kafka. ZooKeeper is an application library with two principal implementations of the APIs—Java and C—and a service component implemented in Java that runs on an ensemble of dedicated servers. Zookeeper is for building distributed systems, simplifies the development process, making it more agile and enabling more robust implementations. Back in 2006, Google published a paper on "Chubby", a distributed lock service which gained wide adoption within their data centers. Zookeeper, not surprisingly, is a close clone of Chubby designed to fulfill many of the same roles for HDFS and other Hadoop infrastructure.

Free Hadoop Tutorial Hadoop Developer Self Learning Outline

Hadoop Introduction | Hadoop Developer Self Learning

Post a Comment