Answer to Hadoop Real time Questions

Answer to Hadoop Real time Questions

First 10 Answers  for this question are provided in this post

1. What kind of issues your facing while using cluster


Answer:
when we are not using any particular vendor like Cloudera, MapR, Hortonworks
we may face this issue (also leads in performance degradation) read about performance tuning here


  1. Lack of configuration management
  2. Poor allocation of resources
  3. Lack of a dedicated network
  4. Lack of monitoring and metrics
  5. Ignorance of what log files contain what information
  6. Drastic measures to address simple problems
  7. Inadvertent introduction of single points of failure
  8. Over reliance on defaults
Cluster issues are somehow related to Admin team.other task that need to be manage daily are
  1. Managing space between application users
  2. Distcp - Data back ups and migration
  3. Managing Services and adding nodes using Ambari 
  4. Changing cluster capacity 
  5. user/group permission management
  6. Alerts and Notifications
  7. Script configuration
  8. Interfaces setup

2. Please mention recommend hard-disk and ram size .

This is something related to cluster management .
Development side PC requirement: Commodity hardware

Definition: Commodity hardware is a non-expensive system which is not of high quality or high-availability. Hadoop can be installed in any average commodity hardware. We don’t need super computers or high-end hardware to work on Hadoop

We need to care about daily data you are processing and how long you are storing it into local system.

Normally PC with 8 GB Ram and 1TB hard disk can be preferred

Cluster Management: Production Environment
Namenode and Secondary name node should have high capacity.
We have to plan about: storage , CPU, memory, network bandwidth?

3. Hadoop 1 or Hadoop 2 Which one you are using.

Answer version which you are using.

Hadoop 2.x is preferred because of YARN. Resource allocation, High Availability and Federation

4. Are you using any Hadoop distribution.
  • Cloudera offers CDH (Cloudera's Distribution including Apache Hadoop) and Cloudera Enterprise. 
  • Hortonworks (formed by Yahoo and Benchmark Capital) Hortonworks provides Hortonworks Data Platform (HDP). 
  • MapR – the MapR Distribution for Apache Hadoop
More detail can be fetched from here.
Considerations should be taken into account when choosing a Hadoop Distribution can be check at here

5.Have you used oozie and zookeeper in cluster

Zookeeper : The Zookeeper is used for managing ‘coordination related’ data. It has simple Client Server model architecture

Oozie –


  • Used for Hadoop job scheduler.
  • also used for Monitoring and Alerting Hadoop jobs 

Types of Oozie Jobs
  1. Periodical/Coordinator Job
  2. Oozie Hadoop Workflow
  3. Oozie Bundle

6. If you used what kind of jobs can you explain me

Mostly we use it to schedule job at cluster node instead of running manual script each time.
Alert mails are triggered when threshold value is reached.

7. What trouble shooting issues you faced 
Issues can be related to cluster or logs like 
  1. IO exception error 
  2. cluster in safe mode
  3. host unreachable,
  4. Change in host identification
  5. sderr in logs

I am explaining a simple issue
            Error while running a example more than once. Output file already exist. 
Solution:
use following command to check output exist or not
bin/hadoop jar hadoop-*-examples.jar \
grep input output2 'dfs[a-z.]+'
                       or 
bin/hadoop dfs -rmr output can be used.

Improper syntax also cause some trouble shooting issue


8. Cluster maintenance and backup

  1. FileSystem Checksrecursively Health check up
     sudo -u hdfs hadoop fsck /
  2. HDFS Balancer utility
    sudo -u hdfs hdfs balancer -threshold <threshold-value>
  3. Adding or Decommissioning nodes to the cluster
  4. Node Failures
  5. Database and Metadata Backups for individual database dumps.
  6. Purging older log files
  7. Plan unplaned downtime
  8. Network issue (host unreachable)


9. Have you use any monitoring tools have like gangila.
Ganglia and Nagios.

Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. We haven’t used
Ganglia is more concerned with gathering metrics and tracking them over time while Nagios has focused on being an alerting mechanism.

10. What is the roles and responsibilities of your project?
Hadoop Profiles

  1. Developer
  2. Architect
  3. Admin
  4. Tester
  5. Hadoop Support
  6. Data Scientist


Job Responsibilities of a Hadoop Developer:
A Hadoop Developer has many responsibilities. 

  1. loading data from RDBMS to HDFS or Hbase using Sqoop
  2. Bulk or Batch data processing.
  3. Write components to integrate/automate the system in MapR, 
  4. Data Analysis
  5. Hive , Pig Query
  6. Using Streaming API
In short we have to explain about
 where you will get data, data loading, data analysis and what is final business use case of your project