Massive Network Traffic Caused by large Map Output

Problem 3 – Massive Network Traffic Caused by large Map Output

Large output from the Map phase can cause longer I/O and data transfer time, and in worst cases can raise exceptions, if all the I/O throughput channels are saturated or if network bandwidth is exhausted.

We can identify this issue with high values in below job counters.

• Job counters: FILE_BYTES_WRITTEN, FILE_BYTES_READ, Combine Input Records

• Possible exceptions: java.io.IOException

Solution 3.1: Compress Map Output

If Map Output is very large, it is always recommended to use compression techniques to reduce the size of intermediate data. By default, Map Output is not compressed but we can enable by setting below properties to true.

mapreduce.map.output.compress
false
mapreduce.map.output.compress.codec
org.apache.hadoop.io.compress.DefaultCodec
Below is the code snippet to enable gzip map output compression in our job:

1
2
3
4
5
6

Configuration conf = new Configuration();
conf.setBoolean("mapreduce.map.output.compress", true);
conf.setClass("mapreduce.map.output.compress.codec", GzipCodec.class,
CompressionCodec.class);
Job job = new Job(conf);
Solution 3.2: Implement a Combiner

We can also reduce the network I/O caused by Map Output by implementing Combiner if aggregate operation follows commutative and associative rule.