Saturday, 2 January 2016

Hadoop MapReduce counters

Counters are interesting feature in Hadoop, used to track out job across Map and Reduce phases. When you ran any job, you can observe lot of counters statistics in log like following.

File System Counters
                  FILE: Number of bytes read=20181
                  FILE: Number of bytes written=1095667
                  FILE: Number of read operations=0
                  FILE: Number of large read operations=0
                  FILE: Number of write operations=0
                  HDFS: Number of bytes read=1532
                  HDFS: Number of bytes written=167
                  HDFS: Number of read operations=38
                  HDFS: Number of large read operations=0
                  HDFS: Number of write operations=16
         File Input Format Counters
                  Bytes Read=383
         File Output Format Counters
                  Bytes Written=86

Above statistics are written by using default built-in counters. You can create custom counters as per your needs. Each Counter is named by an Enum and has a long for the value.

Following application reads a file and count number of stop words from it. Stop words are most frequently/common words in any language (For ex a, an, the, but etc.,).


Step 1: Following application count number of stop words in given file.
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Counters;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class StopWords {

 private static final String[] stopWords = { "a", "about", "above", "above",
   "across", "after", "afterwards", "again", "against", "all",
   "almost", "alone", "along", "already", "also", "although",
   "always", "am", "among", "amongst", "amoungst", "amount", "an",
   "and", "another", "any", "anyhow", "anyone", "anything", "anyway",
   "anywhere", "are", "around", "as", "at", "back", "be", "became",
   "because", "become", "becomes", "becoming", "been", "before",
   "beforehand", "behind", "being", "below", "beside", "besides",
   "between", "beyond", "bill", "both", "bottom", "but", "by", "call",
   "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry",
   "de", "describe", "detail", "do", "done", "down", "due", "during",
   "each", "eg", "eight", "either", "eleven", "else", "elsewhere",
   "empty", "enough", "etc", "even", "ever", "every", "everyone",
   "everything", "everywhere", "except", "few", "fifteen", "fify",
   "fill", "find", "fire", "first", "five", "for", "former",
   "formerly", "forty", "found", "four", "from", "front", "full",
   "further", "get", "give", "go", "had", "has", "hasnt", "have",
   "he", "hence", "her", "here", "hereafter", "hereby", "herein",
   "hereupon", "hers", "herself", "him", "himself", "his", "how",
   "however", "hundred", "ie", "if", "in", "inc", "indeed",
   "interest", "into", "is", "it", "its", "itself", "keep", "last",
   "latter", "latterly", "least", "less", "ltd", "made", "many",
   "may", "me", "meanwhile", "might", "mill", "mine", "more",
   "moreover", "most", "mostly", "move", "much", "must", "my",
   "myself", "name", "namely", "neither", "never", "nevertheless",
   "next", "nine", "no", "nobody", "none", "noone", "nor", "not",
   "nothing", "now", "nowhere", "of", "off", "often", "on", "once",
   "one", "only", "onto", "or", "other", "others", "otherwise", "our",
   "ours", "ourselves", "out", "over", "own", "part", "per",
   "perhaps", "please", "put", "rather", "re", "same", "see", "seem",
   "seemed", "seeming", "seems", "serious", "several", "she",
   "should", "show", "side", "since", "sincere", "six", "sixty", "so",
   "some", "somehow", "someone", "something", "sometime", "sometimes",
   "somewhere", "still", "such", "system", "take", "ten", "than",
   "that", "the", "their", "them", "themselves", "then", "thence",
   "there", "thereafter", "thereby", "therefore", "therein",
   "thereupon", "these", "they", "thickv", "thin", "third", "this",
   "those", "though", "three", "through", "throughout", "thru",
   "thus", "to", "together", "too", "top", "toward", "towards",
   "twelve", "twenty", "two", "un", "under", "until", "up", "upon",
   "us", "very", "via", "was", "we", "well", "were", "what",
   "whatever", "when", "whence", "whenever", "where", "whereafter",
   "whereas", "whereby", "wherein", "whereupon", "wherever",
   "whether", "which", "while", "whither", "who", "whoever", "whole",
   "whom", "whose", "why", "will", "with", "within", "without",
   "would", "yet", "you", "your", "yours", "yourself", "yourselves" };

 static Set<String> stopWordsSet = new HashSet<>();

 static {
  for (String s : stopWords) {
   stopWordsSet.add(s);
  }
 }

 public static enum COUNTERS {
  STOPWORDS;
 }

 public static class StopWordsMapper extends
   Mapper<Object, Text, Text, IntWritable> {

  public boolean isStopWord(String s) {
   return stopWordsSet.contains(s);
  }

  public void map(Object key, Text value, Context context)
    throws IOException, InterruptedException {

   String str = value.toString();
   StringTokenizer tokens = new StringTokenizer(str);

   while (tokens.hasMoreElements()) {
    String word = tokens.nextToken();
    if (isStopWord(word)) {
     context.getCounter(COUNTERS.STOPWORDS).increment(1);
    }
   }

  }
 }

 public static void main(String[] args) throws Exception {
  Configuration conf = new Configuration();
  Job job = Job.getInstance(conf, "Character count");
  job.setJarByClass(StopWords.class);
  job.setMapperClass(StopWordsMapper.class);
  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(IntWritable.class);
  FileInputFormat.addInputPath(job, new Path(args[0]));
  FileOutputFormat.setOutputPath(job, new Path(args[1]));
  job.waitForCompletion(true);

  org.apache.hadoop.mapreduce.Counters counters = job.getCounters();
  System.out.printf("Number of stop words are %d ",
    counters.findCounter(COUNTERS.STOPWORDS).getValue());
 }
}

Step 2: Compile above java file.
$ hadoop com.sun.tools.javac.Main StopWords.java

Step 3: Create jar file.
$ jar cf stopwords.jar StopWords*class

Step 4: Run jar file.
$ hadoop jar stopwords.jar StopWords /user/harikrishna_gurram/data.txt /user/harikrishna_gurram/stopwords3




Previous                                                 Next                                                 Home

No comments:

Post a Comment