Hadoop Performance Tuning

2:07:00 AM 4:03:02 PM

Hadoop Performance Tuning

There are many ways to improve the performance of Hadoop jobs. In this post, we will provide a few MapReduce properties that can be used at various mapreduce phases to improve the performance tuning.

There is no one-size-fits-all technique for tuning Hadoop jobs, because of the architecture of Hadoop, achieving balance among resources is often more effective than addressing a single problem.

Depending on the type of job you are running and the amount of data you are moving, the solution might be quite different

We encourage you to experiment with these and to report your results.

Bottlenecks

Hadoop resources can be classified into computation, memory, network bandwidth and input and output (I/O). A job can run slowly if any of these resources perform badly. Below are the common resource bottlenecks in hadoop jobs.

CPU – Key Resource for both Map and Reduce Tasks Computation
RAM – Main Memory available on the slave (node manager) nodes.
Network Bandwidth – When large amounts of data sets are being processed, high network utilization occurs among nodes. This may occur when Reduce tasks pull huge data from Map tasks in the Shuffle phase, and also when the job outputs the final results into HDFS.
Storage I/O – File read write I/O throughput to HDFS. Storage I/O utilization heavily depends on the volume of input, intermediate data, and final output data.

Below are the common issues that may arise in Mapreduce Job Execution flow.Massive I/O Caused by Large Input Data in Map Input Stage.

Click on every problem to see in detail

Problem1 – Massive I/O Caused by Large Input Data in Map Input Stage

Problem2 – Massive I/O Caused by Spilled Records in Partition and Sortphases

Problem3 – Massive Network Traffic Caused by large Map Output

Problem4 – Massive Network Traffic Caused by large Reduce Output

Problem 5 – Insufficient Parallel Tasks

Hadoop Quiz

Hadoop Performance Tuning

Hadoop Performance Tuning

Post a Comment