Saturday, 2 January 2016

Hadoop: MapReduce: InputSplit

InputSplit describes the data to be processed by a Mapper instance. FileSplit is the default InputSplit. Each split is divided into records, and one mapper processes one split.

Hadoop InputSplit RecordReader.png

Don’t confuse with block size and split size. Block size is physical representation of data, where as input split is logical representation of data. Map reads data from blocks through splits.

If your resources are limited and you want to limit number of maps, then you can increase the split size. For example, there is 640MB file, which is located as 10 blocks (Each block is of size 64MB). Then you can mention split size as 128MB, so 2 blocks (64+64 = 128) are logically grouped together and 5 maps will run parallely.

How to set split size
By setting following variable, you can configure the split size.

Property
Description
mapreduce.input.fileinputformat.split.minsize
The minimum size chunk that map input should be split into.
mapreduce.input.fileinputformat.split.maxsize
The maximum size chunk that map input should be split into.

You can pass these variables in command line using –D option (or) you can set them in your code by using setter methods of Configuration object.

Job job = Job.getInstance(getConf());
job.setJarByClass(MyJob.class);
job.getConfiguration().set("<property-name>",<value>);

How to prevent splitting?

There are some scenarios, where you want to process entire file using single mapper instance. You can achieve this by extending TextInputFormat class and overriding isSplittable method.
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

public class SplitPreventor extends TextInputFormat {
 @Override
 protected boolean isSplitable(JobContext context, Path file) {
  return false;
 }
}

If “isSplitable” method returns false, then total file is processed using single mapper.

You can prevent splitting in another way, by setting the variable, mapreduce.input.fileinputformat.split.minsize to more than your file size.

Referred Articles
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml



Previous                                                 Next                                                 Home

No comments:

Post a Comment