Programming for beginners: Hadoop MapReduce counters

Counters are interesting feature in Hadoop, used to track out job across Map and Reduce phases. When you ran any job, you can observe lot of counters statistics in log like following.

File System Counters

FILE: Number of bytes read=20181

FILE: Number of bytes written=1095667

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=1532

HDFS: Number of bytes written=167

HDFS: Number of read operations=38

HDFS: Number of large read operations=0

HDFS: Number of write operations=16

File Input Format Counters

Bytes Read=383

File Output Format Counters

Bytes Written=86

Above statistics are written by using default built-in counters. You can create custom counters as per your needs. Each Counter is named by an Enum and has a long for the value.

Following application reads a file and count number of stop words from it. Stop words are most frequently/common words in any language (For ex a, an, the, but etc.,).

Step 1: Following application count number of stop words in given file.

import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Counters;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class StopWords {

 private static final String[] stopWords = { "a", "about", "above", "above",
   "across", "after", "afterwards", "again", "against", "all",
   "almost", "alone", "along", "already", "also", "although",
   "always", "am", "among", "amongst", "amoungst", "amount", "an",
   "and", "another", "any", "anyhow", "anyone", "anything", "anyway",
   "anywhere", "are", "around", "as", "at", "back", "be", "became",
   "because", "become", "becomes", "becoming", "been", "before",
   "beforehand", "behind", "being", "below", "beside", "besides",
   "between", "beyond", "bill", "both", "bottom", "but", "by", "call",
   "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry",
   "de", "describe", "detail", "do", "done", "down", "due", "during",
   "each", "eg", "eight", "either", "eleven", "else", "elsewhere",
   "empty", "enough", "etc", "even", "ever", "every", "everyone",
   "everything", "everywhere", "except", "few", "fifteen", "fify",
   "fill", "find", "fire", "first", "five", "for", "former",
   "formerly", "forty", "found", "four", "from", "front", "full",
   "further", "get", "give", "go", "had", "has", "hasnt", "have",
   "he", "hence", "her", "here", "hereafter", "hereby", "herein",
   "hereupon", "hers", "herself", "him", "himself", "his", "how",
   "however", "hundred", "ie", "if", "in", "inc", "indeed",
   "interest", "into", "is", "it", "its", "itself", "keep", "last",
   "latter", "latterly", "least", "less", "ltd", "made", "many",
   "may", "me", "meanwhile", "might", "mill", "mine", "more",
   "moreover", "most", "mostly", "move", "much", "must", "my",
   "myself", "name", "namely", "neither", "never", "nevertheless",
   "next", "nine", "no", "nobody", "none", "noone", "nor", "not",
   "nothing", "now", "nowhere", "of", "off", "often", "on", "once",
   "one", "only", "onto", "or", "other", "others", "otherwise", "our",
   "ours", "ourselves", "out", "over", "own", "part", "per",
   "perhaps", "please", "put", "rather", "re", "same", "see", "seem",
   "seemed", "seeming", "seems", "serious", "several", "she",
   "should", "show", "side", "since", "sincere", "six", "sixty", "so",
   "some", "somehow", "someone", "something", "sometime", "sometimes",
   "somewhere", "still", "such", "system", "take", "ten", "than",
   "that", "the", "their", "them", "themselves", "then", "thence",
   "there", "thereafter", "thereby", "therefore", "therein",
   "thereupon", "these", "they", "thickv", "thin", "third", "this",
   "those", "though", "three", "through", "throughout", "thru",
   "thus", "to", "together", "too", "top", "toward", "towards",
   "twelve", "twenty", "two", "un", "under", "until", "up", "upon",
   "us", "very", "via", "was", "we", "well", "were", "what",
   "whatever", "when", "whence", "whenever", "where", "whereafter",
   "whereas", "whereby", "wherein", "whereupon", "wherever",
   "whether", "which", "while", "whither", "who", "whoever", "whole",
   "whom", "whose", "why", "will", "with", "within", "without",
   "would", "yet", "you", "your", "yours", "yourself", "yourselves" };

 static Set<String> stopWordsSet = new HashSet<>();

 static {
  for (String s : stopWords) {
   stopWordsSet.add(s);
  }
 }

 public static enum COUNTERS {
  STOPWORDS;
 }

 public static class StopWordsMapper extends
   Mapper<Object, Text, Text, IntWritable> {

  public boolean isStopWord(String s) {
   return stopWordsSet.contains(s);
  }

  public void map(Object key, Text value, Context context)
    throws IOException, InterruptedException {

   String str = value.toString();
   StringTokenizer tokens = new StringTokenizer(str);

   while (tokens.hasMoreElements()) {
    String word = tokens.nextToken();
    if (isStopWord(word)) {
     context.getCounter(COUNTERS.STOPWORDS).increment(1);
    }
   }

  }
 }

 public static void main(String[] args) throws Exception {
  Configuration conf = new Configuration();
  Job job = Job.getInstance(conf, "Character count");
  job.setJarByClass(StopWords.class);
  job.setMapperClass(StopWordsMapper.class);
  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(IntWritable.class);
  FileInputFormat.addInputPath(job, new Path(args[0]));
  FileOutputFormat.setOutputPath(job, new Path(args[1]));
  job.waitForCompletion(true);

  org.apache.hadoop.mapreduce.Counters counters = job.getCounters();
  System.out.printf("Number of stop words are %d ",
    counters.findCounter(COUNTERS.STOPWORDS).getValue());
 }
}

Step 2: Compile above java file.

$ hadoop com.sun.tools.javac.Main StopWords.java

Step 3: Create jar file.

$ jar cf stopwords.jar StopWords*class

Step 4: Run jar file.

$ hadoop jar stopwords.jar StopWords /user/harikrishna_gurram/data.txt /user/harikrishna_gurram/stopwords3

Previous Next Home

Programming for beginners

Saturday, 2 January 2016

Hadoop MapReduce counters

No comments:

Post a Comment