Understanding Big Data | Hadoop Developer Self Learning


With reference to my earlier post related to Hadoop Developer Self Learning Outline.
I am going to write short and simple tutorial on it.

Free Hadoop Tutorial for you

explore big data system and get your hand dirty.
This post will consist of below topic

Understanding Big Data
  • 3V (Volume-Variety-Velocity) characteristics
  • Structured and Unstructured Data
  • Application and use cases of Big Data
  • Limitations of traditional large Scale systems
A. 3V (Volume-Variety-Velocity) characteristics

These 3V's are known as Characteristics of 'Big Data'

1.Volume – The name 'Big Data' itself is related to a size which is enormous. Size of data plays very crucial role in determining value out of data. 
  • Data volume is increasing exponentially 

2.Variety – The next aspect of 'Big Data' is its variety.
Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. 
Various formats, types, and structures:Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…
  • A single application can be generating/collecting many types of data 
3.Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data.
  • Data is begin generated fast and need to be processed fast
The fourt 'V ' introduced by IBM

4.Variability – This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.

3V

B. Structured and Unstructured Data

Structured data
  • Information stored DB
  • Strict format
Limitation
  • Not all data collected is structured
Semi-structured data
  • Data may have certain structure but not all information collected has identical structure
  • Some attributes may exist in some of the entities of a particular type but not in others
Unstructured data
  • “Unstructured data refers to information that either does not have a pre-defined data model and/or is not organized in a predefined manner.”
It is expected Structured data is 20 % and rest is in unstructured and Semi-structured data

Structured vs Unstructured


C. Application and use cases of Big Data
below are the major sector where big data is widely used
  • Public Sector Services.
  • Healthcare contributions.
  • Learning Services.
  • Insurance Services.
  • Industrialized and Natural Resources.
  • Transportation Services.
  • Banking Sectors and Fraud Detection.
D. Limitations of traditional large Scale systems

  • Traditional large scale computing involved complex processing on small amounts of data
  • Exponential growth in data drove development of distributed computing 
  • Distributed computing is difficult! 
  • Hadoop addresses distributed computing challenges 
  1. Bring the computation to the data 
  2. Fault tolerance 
  3. Scalability 
  4. Hadoop hides the ‘plumbing’ so developers can focus on the data
go through this difference 

Please comment in case of any doubt or correction required.