Massive Network Traffic Caused by large Reduce Output

6:30:00 AM 8:20:19 AM

Problem 4 – Massive Network Traffic Caused by large Reduce Output

Large output from Reducers can cause lot of I/O write operations to HDFS.

We can identify this issue with high values in below job counters.

• Job counters: Bytes Written, HDFS_BYTES_WRITTEN

• Possible exceptions: java.io.IOException

The above two counters denote the volume of data from Reduce Phase, but these two counters do not include the replication factor. If the replication factor is greater than one, it means that blocks of data will be replicated to different nodes, which requires more I/O for read and write operations, and which also uses network bandwidth.

Solution 4.1: Compress Reducer/Final Output

We can enable compression on Mapreduce job’s output by setting below properties to true at site level for all jobs.

mapreduce.output.fileoutputformat .compress	false	Compress?
mapreduce.output.fileoutputformat .compress.type	RECORD	If SequenceFiles, then it Should be one of NONE, RECORD or BLOCK.
mapreduce.output.fileoutputformat .compress.codec	org.apache.hadoop.io. compress.DefaultCodec

• If Output Files are Not Sequence FilesSet the above properties either in Job driver using code snippet like below (Snappy compression with Block Mode) or in mapred-site.xml file.

FileOutputFormat.setCompressOutput(job, true);

FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);

• If Output Files are Sequence Files

job.setOutputFormatClass(SequenceFileOutputFormat.class);

SequenceFileOutputFormat.setCompressOutput(job, true);

SequenceFileOutputFormat.setOutputCompressorClass(job, SnappyCodec.class);

SequenceFileOutputFormat.setOutputCompressionType(job,

CompressionType.BLOCK);

• To Make Global Changes to cluster

<name>mapreduce.output.fileoutputformat.compress</name>

</property>

<name>mapreduce.output.fileoutputformat.compress.codec</name>

<value>SnappyCodec.class</value>

</property>

<name>mapreduce.output.fileoutputformat.compress.type</name>

<value>BLOCK</value>

</property>

Solution 4.2: Adjust Replication Factor

By reducing the replication factor to 1 when more replications are needed we can improve the job performance as , copying data to multiple nodes will be reduced. Set dfs.replication property to 1 using conf object in Job Driver program.

Massive Network Traffic Caused by large Reduce Output

Post a Comment