Saturday, 2 January 2016

Hadoop Input formats

InputFormat specifies how to read data from a file into Mapper instances. Hadoop provides several implementations of InputFormat class to work with text and binary files. Hadoop facilitates to define custom input formats; I will explain this in later posts.

TextInputFormat is the default InputFormat. You can choose the InputFormat for your job by using setInputFormatClass() method of Job class.

public void setInputFormatClass(Class<? extends InputFormat> cls)
Set the InputFormat for the job.

Following are the standard input formats used in Map reduce job.

InputFormat
Description
Key
Value
TextInputFormat
Input format for plain text files.
Key is the position (byte offset) of the line.
Line contents
KeyValueTextInputFormat
Input format for plain text files. Each line is divided into key and value parts by a separator byte. Default separator is tab character.
Everything upto the first tab character. Tab is the default separator. The separator byte can be specified in config file under the attribute name mapreduce.input.keyvaluelinerecordreader.key.value.separator.
The remainder of the line
SequenceFileInputFormat
It is input format for sequence files. SequenceFiles are flat files consisting of binary key/value pairs.
User defined
User defined

In previous examples, I didn’t specify any InputFormat for MapReduce job, so by default it takes TextInputFormat. I am going to use KeyValueTextInputFormat class to process following data.

E12345      HariKrishna                  Gurram
E23456      KiranKumar                  Darsi
E34567      Sandesh             Nair
E45678      Preethi                Nair
E56789      Sunil Kumar
E67891      SrinathVenkata            ramani
E78910      Arpan         Debroy
E89101      Phalgun              garimella


“emp.txt” contains (employeeId    firstName   lastName), I want to extract (employeeId firstName) from emp.txt file.


Step 1: Following application extract (employeeId firstName) from emp.txt file.

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class FirstNameExtractor {
 public static class TokenizerMapper extends Mapper<Text, Text, Text, Text> {

  public void map(Text key, Text value, Context context)
    throws IOException, InterruptedException {
   Text firstName = new Text(value.toString().split("\t")[0]);
   context.write(key, firstName);
  }
 }

 public static class NameReducer extends Reducer<Text, Text, Text, Text> {

  public void reduce(Text key, Iterable<Text> values, Context context)
    throws IOException, InterruptedException {
   Text name = null;

   for (Text firstName : values) {
    name = firstName;
   }

   context.write(key, name);
  }
 }

 public static void main(String[] args) throws Exception {
  Configuration conf = new Configuration();
  Job job = Job.getInstance(conf, "FirstName Extractor");
  job.setJarByClass(FirstNameExtractor.class);
  job.setMapperClass(TokenizerMapper.class);
  job.setCombinerClass(NameReducer.class);
  job.setReducerClass(NameReducer.class);
  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(Text.class);
  job.setInputFormatClass(KeyValueTextInputFormat.class);
  FileInputFormat.addInputPath(job, new Path(args[0]));
  FileOutputFormat.setOutputPath(job, new Path(args[1]));
  System.exit(job.waitForCompletion(true) ? 0 : 1);
  
  
 }
}

Step 2: Compile above java file.
$ hadoop com.sun.tools.javac.Main FirstNameExtractor.java

Step 3: Create jar file.
$ jar cf extract.jar FirstNameExtractor*class

Step 4: Run jar file.
$ hadoop jar extract.jar FirstNameExtractor /user/harikrishna_gurram/emp.txt /user/harikrishna_gurram/names_extractor


Output file is located at “/user/harikrishna_gurram/names_extract”.

$ hadoop fs -ls /user/harikrishna_gurram/names_extractor
Found 2 items
-rw-r--r--   3 harikrishna_gurram supergroup          0 2015-06-24 10:37 /user/harikrishna_gurram/names_extractor/_SUCCESS
-rw-r--r--   3 harikrishna_gurram supergroup        131 2015-06-24 10:37 /user/harikrishna_gurram/names_extractor/part-r-00000
$ 
$ hadoop fs -cat /user/harikrishna_gurram/names_extractor/part-r-00000
E12345 HariKrishna
E23456 KiranKumar
E34567 Sandesh
E45678 Preethi
E56789 Sunil 
E67891 SrinathVenkata
E78910 Arpan
E89101 Phalgun





Previous                                                 Next                                                 Home

No comments:

Post a Comment