Counters are
interesting feature in Hadoop, used to track out job across Map and Reduce
phases. When you ran any job, you can observe lot of counters statistics in log
like following.
File System Counters
FILE: Number of bytes
read=20181
FILE: Number of bytes
written=1095667
FILE: Number of read
operations=0
FILE: Number of large read
operations=0
FILE: Number of write
operations=0
HDFS: Number of bytes
read=1532
HDFS: Number of bytes
written=167
HDFS: Number of read
operations=38
HDFS: Number of large read
operations=0
HDFS: Number of write
operations=16
File Input Format Counters
Bytes Read=383
File Output Format Counters
Bytes Written=86
Above
statistics are written by using default built-in counters. You can create
custom counters as per your needs. Each Counter is named by an Enum and has a
long for the value.
Following
application reads a file and count number of stop words from it. Stop words are
most frequently/common words in any language (For ex a, an, the, but etc.,).
Step 1: Following application count number of stop words in
given file.
import java.io.IOException; import java.util.HashSet; import java.util.Set; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Counters; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class StopWords { private static final String[] stopWords = { "a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "around", "as", "at", "back", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thickv", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves" }; static Set<String> stopWordsSet = new HashSet<>(); static { for (String s : stopWords) { stopWordsSet.add(s); } } public static enum COUNTERS { STOPWORDS; } public static class StopWordsMapper extends Mapper<Object, Text, Text, IntWritable> { public boolean isStopWord(String s) { return stopWordsSet.contains(s); } public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String str = value.toString(); StringTokenizer tokens = new StringTokenizer(str); while (tokens.hasMoreElements()) { String word = tokens.nextToken(); if (isStopWord(word)) { context.getCounter(COUNTERS.STOPWORDS).increment(1); } } } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "Character count"); job.setJarByClass(StopWords.class); job.setMapperClass(StopWordsMapper.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); org.apache.hadoop.mapreduce.Counters counters = job.getCounters(); System.out.printf("Number of stop words are %d ", counters.findCounter(COUNTERS.STOPWORDS).getValue()); } }
Step 2: Compile above java file.
$ hadoop
com.sun.tools.javac.Main StopWords.java
Step 3: Create jar file.
$ jar cf
stopwords.jar StopWords*class
Step 4: Run jar file.
$ hadoop jar
stopwords.jar StopWords /user/harikrishna_gurram/data.txt
/user/harikrishna_gurram/stopwords3
No comments:
Post a Comment