InputFormat
specifies how to read data from a file into Mapper instances. Hadoop provides
several implementations of InputFormat class to work with text and binary
files. Hadoop facilitates to define custom input formats; I will explain this
in later posts.
TextInputFormat
is the default InputFormat. You can choose the InputFormat for your job by
using setInputFormatClass() method of Job class.
public void setInputFormatClass(Class<? extends
InputFormat> cls)
Set the
InputFormat for the job.
Following
are the standard input formats used in Map reduce job.
InputFormat
|
Description
|
Key
|
Value
|
TextInputFormat
|
Input
format for plain text files.
|
Key is the
position (byte offset) of the line.
|
Line
contents
|
KeyValueTextInputFormat
|
Input
format for plain text files. Each line is divided into key and value parts by
a separator byte. Default separator is tab character.
|
Everything
upto the first tab character. Tab is the default separator. The separator
byte can be specified in config file under the attribute name
mapreduce.input.keyvaluelinerecordreader.key.value.separator.
|
The
remainder of the line
|
SequenceFileInputFormat
|
It is
input format for sequence files. SequenceFiles are flat files consisting of
binary key/value pairs.
|
User
defined
|
User
defined
|
In previous
examples, I didn’t specify any InputFormat for MapReduce job, so by default it
takes TextInputFormat. I am going to use KeyValueTextInputFormat class to
process following data.
E12345 HariKrishna Gurram
E23456 KiranKumar Darsi
E34567 Sandesh Nair
E45678 Preethi Nair
E56789 Sunil Kumar
E67891 SrinathVenkata ramani
E78910 Arpan Debroy
E89101 Phalgun garimella
“emp.txt”
contains (employeeId firstName lastName), I want to extract (employeeId
firstName) from emp.txt file.
Step 1: Following application extract (employeeId
firstName) from emp.txt file.
import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class FirstNameExtractor { public static class TokenizerMapper extends Mapper<Text, Text, Text, Text> { public void map(Text key, Text value, Context context) throws IOException, InterruptedException { Text firstName = new Text(value.toString().split("\t")[0]); context.write(key, firstName); } } public static class NameReducer extends Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { Text name = null; for (Text firstName : values) { name = firstName; } context.write(key, name); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "FirstName Extractor"); job.setJarByClass(FirstNameExtractor.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(NameReducer.class); job.setReducerClass(NameReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setInputFormatClass(KeyValueTextInputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
Step 2: Compile above java file.
$ hadoop
com.sun.tools.javac.Main FirstNameExtractor.java
Step 3: Create jar file.
$ jar cf
extract.jar FirstNameExtractor*class
Step 4: Run jar file.
$ hadoop jar
extract.jar FirstNameExtractor /user/harikrishna_gurram/emp.txt /user/harikrishna_gurram/names_extractor
Output file
is located at “/user/harikrishna_gurram/names_extract”.
$ hadoop fs -ls /user/harikrishna_gurram/names_extractor Found 2 items -rw-r--r-- 3 harikrishna_gurram supergroup 0 2015-06-24 10:37 /user/harikrishna_gurram/names_extractor/_SUCCESS -rw-r--r-- 3 harikrishna_gurram supergroup 131 2015-06-24 10:37 /user/harikrishna_gurram/names_extractor/part-r-00000 $ $ hadoop fs -cat /user/harikrishna_gurram/names_extractor/part-r-00000 E12345 HariKrishna E23456 KiranKumar E34567 Sandesh E45678 Preethi E56789 Sunil E67891 SrinathVenkata E78910 Arpan E89101 Phalgun
No comments:
Post a Comment