Massive Network Traffic Caused by large Map Output
Problem 3 – Massive Network Traffic Caused by large
Map Output
Large output from the Map phase can cause longer I/O and data transfer time, and in worst cases can raise exceptions, if all the I/O throughput channels are saturated or if network bandwidth is exhausted.
We can identify this issue with high values in below job counters.
• Job counters: FILE_BYTES_WRITTEN, FILE_BYTES_READ, Combine Input Records
• Possible exceptions: java.io.IOException
Solution 3.1: Compress Map Output
If Map Output is very large, it is always recommended to use compression techniques to reduce the size of intermediate data. By default, Map Output is not compressed but we can enable by setting below properties to true.
mapreduce.map.output.compress
|
false
|
|
mapreduce.map.output.compress.codec
|
org.apache.hadoop.io.compress.DefaultCodec
|
|
Below is the code snippet to enable gzip map output
compression in our job:
1
2
3
4
5
6
|
Configuration conf = new
Configuration();
conf.setBoolean("mapreduce.map.output.compress",
true);
conf.setClass("mapreduce.map.output.compress.codec",
GzipCodec.class,
CompressionCodec.class);
Job job = new Job(conf);
|
Solution 3.2: Implement a Combiner
We can also reduce the network I/O
caused by Map Output by implementing Combiner if aggregate operation
follows commutative and associative rule.
Post a Comment
image video quote pre code