InputSplit describes
the data to be processed by a Mapper instance. FileSplit is the default
InputSplit. Each split is divided into records, and one mapper processes one
split.
Hadoop InputSplit RecordReader.png
Don’t
confuse with block size and split size. Block size is physical representation
of data, where as input split is logical representation of data. Map reads data
from blocks through splits.
If your
resources are limited and you want to limit number of maps, then you can
increase the split size. For example, there is 640MB file, which is located as
10 blocks (Each block is of size 64MB). Then you can mention split size as
128MB, so 2 blocks (64+64 = 128) are logically grouped together and 5 maps will
run parallely.
How to set split size
By setting
following variable, you can configure the split size.
Property
|
Description
|
mapreduce.input.fileinputformat.split.minsize
|
The
minimum size chunk that map input should be split into.
|
mapreduce.input.fileinputformat.split.maxsize
|
The maximum
size chunk that map input should be split into.
|
You can pass
these variables in command line using –D option (or) you can set them in your
code by using setter methods of Configuration object.
Job job = Job.getInstance(getConf());
job.setJarByClass(MyJob.class);
job.getConfiguration().set("<property-name>",<value>);
How to prevent splitting?
There are
some scenarios, where you want to process entire file using single mapper
instance. You can achieve this by extending TextInputFormat class and
overriding isSplittable method.
import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapreduce.JobContext; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; public class SplitPreventor extends TextInputFormat { @Override protected boolean isSplitable(JobContext context, Path file) { return false; } }
If “isSplitable”
method returns false, then total file is processed using single mapper.
You can
prevent splitting in another way, by setting the variable, mapreduce.input.fileinputformat.split.minsize to more
than your file size.
Referred Articles
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
No comments:
Post a Comment