What is Hadoop? and Top Hadoop distributions

What is Hadoop?

  • Apache project for storing and processing large data sets
  • Open-source implementation of Google Big Data solutions
  • Components:
    • HDFS (Hadoop Distributed File System)
    • YARN (Yet Another Resource Negotiator)
    • Data processing models (MapReduce, Impala, Tez, etc.)
    • Underpinning tools (Pig, Hive, Sqoop, HBase, etc.)
  • Written in Java


Data storage evolution

  • 1956 - HDD (Hard Disk Drive), now up to 6 TB
  • 1983 - SDD (Solid State Drive), now up to 16 TB
  • 1984 - NFS (Network File System), first NAS (Network Attached Storage) implementation
  • 1987 - RAID (Redundant Array of Independent Disks), now up to ~100 disks
  • 1993 - Disk Arrays, now up to ~200 disks
  • 1994 - Fibre-channel, first SAN (Storage Area Network) implementation
  • 2003 - GFS (Google File System), first Big Data implementation

Top Hadoop distributions

  • Apache Hadoop,
  • CDH (Cloudera Distribution including apache Hadoop),
  • HDP (Hortonworks Data Platform),
  • M3M5 and M7,
  • Amazon Elastic MapReduce3
  • BigInsights Enterprise Edition
  • Intel Distribution for Apache Hadoop