Massive Network Traffic Caused by large Reduce Output

Problem 4 – Massive Network Traffic Caused by large Reduce Output
Large output from Reducers can cause lot of I/O write operations to HDFS.
We can identify this issue with high values in below job counters.
         Job counters: Bytes Written, HDFS_BYTES_WRITTEN
         Possible exceptions: java.io.IOException
The above two counters denote the volume of data from Reduce Phase, but these two counters do not include the replication factor. If the replication factor is greater than one, it means that blocks of data will be replicated to different nodes, which requires more I/O for read and write operations, and which also uses network bandwidth.
Solution 4.1: Compress Reducer/Final Output
We can enable compression on Mapreduce job’s output by setting below properties to true at site level for all jobs.

mapreduce.output.fileoutputformat
.compress
false
Compress?
mapreduce.output.fileoutputformat
.compress.type
RECORD
If SequenceFiles, then it Should be one of NONE, RECORD or BLOCK.
mapreduce.output.fileoutputformat
.compress.codec
org.apache.hadoop.io.
compress.DefaultCodec


         If Output Files are Not Sequence FilesSet the above properties either in Job driver using code snippet like below (Snappy compression with Block Mode) or in mapred-site.xml file.

1
2
3

FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
         If Output Files are Sequence Files

1
2
3
4
5
6

job.setOutputFormatClass(SequenceFileOutputFormat.class);
SequenceFileOutputFormat.setCompressOutput(job, true);
SequenceFileOutputFormat.setOutputCompressorClass(job, SnappyCodec.class);
SequenceFileOutputFormat.setOutputCompressionType(job,
CompressionType.BLOCK);
         To Make Global Changes to cluster

1
2
3
4
5
6
7
8
9
10
11
12
13

<property>
     <name>mapreduce.output.fileoutputformat.compress</name>
     <value>true</value>
</property>
<property>
     <name>mapreduce.output.fileoutputformat.compress.codec</name>
     <value>SnappyCodec.class</value>
</property>
<property>
     <name>mapreduce.output.fileoutputformat.compress.type</name>
     <value>BLOCK</value>
</property>
Solution 4.2: Adjust Replication Factor
By reducing the replication factor to 1 when more replications are needed we can improve the job performance as , copying data to multiple nodes will be reduced. Set dfs.replication property to 1 using conf object in Job Driver program.