Answer to Hadoop Real time Questions Part 2


Answer set second for question number 11-24  for this question set are provided in this post
Hadoop Real time question


11. Hadoop performance tunning
Please go through Performance Tuning

12.Planning of Hadoop cluster
Two main aspect need to consider:

  1. The number of machines
  2. Specification of the machines- (RAM, Storage and Processor)
Information related to machine specification can be found here for you developer machine  (Question no 2)

Cluster Specification:
Production cluster size :
You need say the answer based on daily income data size and duration of project project running and hard disk size of each machine.
Important aspect to be considered while planning:

  • Hardware Requirement for NameNodes:
  • Hardware Requirement for JobTracker/ResourceManager:
  • Memory sizing:depends on the size of data 
  • Processors: number of cores
  • Hardware Requirement for SlavesNodes
  • capacity planning:
Say we have 70TB of raw data to store on a yearly basis (i.e. moving window of 1 year). So after compression (say, with Gzip) we will get 70 – (70 * 60%) = 28Tb that will multiply by 3x = 84, but keep 70% capacity: 84Tb = x * 70% thus x = 84/70% = 120Tb is the value we need for capacity planning.
  • Number of nodes: capacity planning / The number of hard disks we need. Ex 120Tb/12 1Tb = 10 nodes.
Expecting more input from reader for this question

13. what is ranger

Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform. The vision with Ranger is to provide comprehensive security across the Apache Hadoop ecosystem.
  • Ranger tutorial can be found here 
14.have you used UDF for pig or hive ?
yes or no
Generally Pig or Hive having some Built-in functions,we can use that Built-in functions for our program with out adding any extra code but some times required logic is not available in that built-in functions. Thus there is need of UDF, at that time user have to write some own custom user defined functions called UDF (user defined function).

15.have you written any script automated in system for cluster.
I've seen this done very nicely using Foreman, Chef, and Ambari Blueprints. Foreman was used to provision the VMs, Chef scripts were used to install Ambari, configure the Ambari blueprint, and to create the cluster using the Blueprint.

16. Kerberos installation and configuration
On the terminal use this command : 
>>user@ubuntu:~$ sudo apt-get install krb5-user
Press Y when asked. and then press enter when ask for package configuration.
Done

17.How well does Hadoop scaling ?
adding or removing nodes/machine to Cluster
type: Scale-up vs Scale-out or commissioning and decommission

18.Name upgrade and increase cluster size like commissioning and decommission
Commissioning: Adding nodes
Decommission: removing nodes

19.have you used metrics ?
No
Metrics are statistical information exposed by Hadoop daemons, used for monitoring, performance tuning and debug. There are many metrics available by default and they are very useful for troubleshooting.
about metrics you can read here

20.how to decide a cluster size ...based on data size. can u tell which formula we are using
check question number 12

21.can you explain me complete Hadoop eco system and it works?
Tell every thing what you know about hadoop stack.


22.15 node cluster ...how many datanodes are in that
23.How much data you processed in 15 node cluster
24.Every day how much data you processing.


please look at the example specified in question 12 and here .All above question answer are mentioned there.

- See question list at: Here

- See answer set 1   : Here