Answer to Hadoop Real time Questions

10:23:00 PM 10:23:02 PM

First 10 Answers for this question are provided in this post

1. What kind of issues your facing while using cluster

Answer:

when we are not using any particular vendor like Cloudera, MapR, Hortonworks

we may face this issue (also leads in performance degradation) read about performance tuning here

Lack of configuration management
Poor allocation of resources
Lack of a dedicated network
Lack of monitoring and metrics
Ignorance of what log files contain what information
Drastic measures to address simple problems
Inadvertent introduction of single points of failure
Over reliance on defaults

Cluster issues are somehow related to Admin team.other task that need to be manage daily are

Managing space between application users
Distcp - Data back ups and migration
Managing Services and adding nodes using Ambari
Changing cluster capacity
user/group permission management
Alerts and Notifications
Script configuration
Interfaces setup

2. Please mention recommend hard-disk and ram size .

This is something related to cluster management .

Development side PC requirement: Commodity hardware

Definition: Commodity hardware is a non-expensive system which is not of high quality or high-availability. Hadoop can be installed in any average commodity hardware. We don’t need super computers or high-end hardware to work on Hadoop

We need to care about daily data you are processing and how long you are storing it into local system.

Normally PC with 8 GB Ram and 1TB hard disk can be preferred

Cluster Management: Production Environment

Namenode and Secondary name node should have high capacity.

We have to plan about: storage , CPU, memory, network bandwidth?

3. Hadoop 1 or Hadoop 2 Which one you are using.

Answer version which you are using.

Hadoop 2.x is preferred because of YARN. Resource allocation, High Availability and Federation

4. Are you using any Hadoop distribution.

Cloudera offers CDH (Cloudera's Distribution including Apache Hadoop) and Cloudera Enterprise.

Hortonworks (formed by Yahoo and Benchmark Capital) Hortonworks provides Hortonworks Data Platform (HDP).

MapR – the MapR Distribution for Apache Hadoop

More detail can be fetched from here.

Considerations should be taken into account when choosing a Hadoop Distribution can be check at here

5.Have you used oozie and zookeeper in cluster

Zookeeper : The Zookeeper is used for managing ‘coordination related’ data. It has simple Client Server model architecture

Oozie –

Used for Hadoop job scheduler.
also used for Monitoring and Alerting Hadoop jobs

Types of Oozie Jobs

Periodical/Coordinator Job
Oozie Hadoop Workflow
Oozie Bundle

6. If you used what kind of jobs can you explain me

Mostly we use it to schedule job at cluster node instead of running manual script each time.

Alert mails are triggered when threshold value is reached.

7. What trouble shooting issues you faced

Issues can be related to cluster or logs like

IO exception error
cluster in safe mode
host unreachable,
Change in host identification
sderr in logs

I am explaining a simple issue

Error while running a example more than once. Output file already exist.

Solution:

use following command to check output exist or not

bin/hadoop jar hadoop-*-examples.jar \

grep input output2 'dfs[a-z.]+'

bin/hadoop dfs -rmr output can be used.

Improper syntax also cause some trouble shooting issue

8. Cluster maintenance and backup

FileSystem Checksrecursively Health check up

sudo -u hdfs hadoop fsck /
HDFS Balancer utility

sudo -u hdfs hdfs balancer -threshold <threshold-value>
Adding or Decommissioning nodes to the cluster
Node Failures
Database and Metadata Backups for individual database dumps.
Purging older log files
Plan unplaned downtime
Network issue (host unreachable)

9. Have you use any monitoring tools have like gangila.

Ganglia and Nagios.

Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. We haven’t used

Ganglia is more concerned with gathering metrics and tracking them over time while Nagios has focused on being an alerting mechanism.

10. What is the roles and responsibilities of your project?

Hadoop Profiles

Developer
Architect
Admin
Tester
Hadoop Support
Data Scientist

Job Responsibilities of a Hadoop Developer:

A Hadoop Developer has many responsibilities.

loading data from RDBMS to HDFS or Hbase using Sqoop
Bulk or Batch data processing.
Write components to integrate/automate the system in MapR,
Data Analysis
Hive , Pig Query
Using Streaming API

In short we have to explain about

where you will get data, data loading, data analysis and what is final business use case of your project

4 comments

sumeetgupta07 September 17, 2016 at 10:23 PM

Thanks
Balasan
- TG September 18, 2016 at 9:20 AM
  
  your Welcome
Balas
Balas
lalitha September 28, 2016 at 4:51 PM

hi dis is lalitha ...lalithakonidala123@gmail.com
can anyone please mail me sample hadoop resume..
Balasan
- TG November 6, 2016 at 9:13 AM
  
  sure ...In few days I shall be uploading on this blog and also will update you
Balas
Balas

Rules ~

Media +

Respect Others Opinion. Commenting Link Is Strictly Forbidden. Comments According To The Posts Always Gets Priority.