Massive I/O Caused by Large Input Data in Map Input Stage
Problem
1 – Massive I/O Caused by Large Input Data in Map Input Stage
This problem happens most often on jobs with light computation and large volumes of source data. If disk I/O is not fast enough, computation resources will be idle and spend most of the job time waiting for the incoming data. Therefore, performance can be constrained by disk I/O.
We can identify this issue with high values in below job counters.
We can identify this issue with high values in below job counters.
- Job counters: Bytes Read, HDFS_BYTES_READ
Solution
1: Compress
Input Data
Compress Input data – Compression of files saves storage space on HDFS and also improves speed of transfer.
We can use any of the below compression techniques on input data sets.
We can use any of the below compression techniques on input data sets.
Format
|
Codec
|
Extension
|
Splittable
|
Hadoop
|
DEFLATE
|
org.apache.hadoop.io.compress.DefaultCodec
|
.deflate
|
N
|
Y
|
Gzip
|
org.apache.hadoop.io.compress.GzipCodec
|
.gz
|
N
|
Y
|
Bzip2
|
org.apache.hadoop.io.compress.BZip2Codec
|
.bz2
|
Y
|
Y
|
LZO
|
com.hadoop.compression.lzo.LzopCodec
|
.lzo
|
N
|
Y
|
LZ4
|
org.apache.hadoop.io.compress.Lz4Codec
|
.Lz4
|
Y
|
Y
|
Snappy
|
org.apache.hadoop.io.compress.SnappyCodec
|
.Snappy
|
Y
|
Y
|
When we submit a MapReduce job against compressed data in HDFS, Hadoop will determine whether the source file is compressed by checking the file name extension, and if the file name has an appropriate extension, Hadoop will decompress it automatically using the appropriate codec. Therefore, users do not need to explicitly specify a codec in the MapReduce job.
However, if the file name extension does not follow naming conventions, Hadoop will not recognize the format and will not automatically decompress the file. Therefore, to enable self-detection and decompression, we must ensure that the file name extension matches the file name extensions supported by each codec.
Visit : here (for more info)
However, if the file name extension does not follow naming conventions, Hadoop will not recognize the format and will not automatically decompress the file. Therefore, to enable self-detection and decompression, we must ensure that the file name extension matches the file name extensions supported by each codec.
Visit : here (for more info)
Post a Comment
image video quote pre code