Massive Network Traffic Caused by large Reduce Output
Problem 4 – Massive Network Traffic Caused
by large Reduce Output
Large output from Reducers can
cause lot of I/O write operations to HDFS.
We can identify this issue with high
values in below job counters.
•
Job
counters: Bytes Written, HDFS_BYTES_WRITTEN
•
Possible
exceptions: java.io.IOException
The above two counters denote the
volume of data from Reduce Phase, but these two counters do not include
the replication factor. If the replication factor is greater than one, it
means that blocks of data will be replicated to different nodes, which
requires more I/O for read and write operations, and which also uses
network bandwidth.
Solution 4.1: Compress Reducer/Final
Output
We can enable compression on
Mapreduce job’s output by setting below properties to true at site level for
all jobs.
mapreduce.output.fileoutputformat
.compress
|
false
|
Compress?
|
mapreduce.output.fileoutputformat
.compress.type
|
RECORD
|
If
SequenceFiles, then it Should be one of NONE, RECORD or BLOCK.
|
mapreduce.output.fileoutputformat
.compress.codec
|
org.apache.hadoop.io.
compress.DefaultCodec
|
•
If Output Files are Not Sequence FilesSet the above properties either in
Job driver using code snippet like below (Snappy compression with Block
Mode) or in mapred-site.xml file.
1
2
3
|
FileOutputFormat.setCompressOutput(job,
true);
FileOutputFormat.setOutputCompressorClass(job,
GzipCodec.class);
|
•
If Output Files are Sequence Files
1
2
3
4
5
6
|
job.setOutputFormatClass(SequenceFileOutputFormat.class);
SequenceFileOutputFormat.setCompressOutput(job,
true);
SequenceFileOutputFormat.setOutputCompressorClass(job,
SnappyCodec.class);
SequenceFileOutputFormat.setOutputCompressionType(job,
CompressionType.BLOCK);
|
•
To Make Global Changes to cluster
1
2
3
4
5
6
7
8
9
10
11
12
13
|
<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>SnappyCodec.class</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.compress.type</name>
<value>BLOCK</value>
</property>
|
Solution 4.2: Adjust Replication
Factor
By reducing the replication factor
to 1 when more replications are needed we can improve the job performance
as , copying data to multiple nodes will be reduced.
Set dfs.replication property
to 1 using conf object in Job Driver program.
Post a Comment
image video quote pre code