Hadoop Performance Tuning
There is no one-size-fits-all technique for tuning Hadoop jobs, because of the architecture of Hadoop, achieving balance among resources is often more effective than addressing a single problem.
Depending on the type of job you are running and the amount of data you are moving, the solution might be quite different
We encourage you to experiment with these and to report your results.
Bottlenecks
Hadoop
resources can be classified into computation, memory, network
bandwidth and input and output (I/O). A job can run slowly if any of
these resources perform badly. Below are the common resource
bottlenecks in hadoop jobs.
- CPU – Key Resource for both Map and Reduce Tasks Computation
- RAM – Main Memory available on the slave (node manager) nodes.
- Network Bandwidth – When large amounts of data sets are being processed, high network utilization occurs among nodes. This may occur when Reduce tasks pull huge data from Map tasks in the Shuffle phase, and also when the job outputs the final results into HDFS.
- Storage I/O – File read write I/O throughput to HDFS. Storage I/O utilization heavily depends on the volume of input, intermediate data, and final output data.
Below
are the common issues that may arise in Mapreduce Job Execution
flow.Massive I/O Caused by Large Input Data in Map Input Stage.
Click on every problem to see in detail
Post a Comment
image video quote pre code