In this post, I am going to explain how Hadoop MapReduce works.
What is the problem with traditional computing?
In traditional computing, you need to have all the data in the system upfront to perform some processing on it. But take this scenario where you have peta bytes of data, is it possible to store and process this information in one single system, the answer is NO. This is where distributed computing comes into picture.
How MapReduce address this problem?
In Hadoop, data is divided into multiple blocks and persisted on many data nodes (computes). Hadoop did most of the processing at data nodes and later point of time merge the results from data nodes.
What is MapReduce?
MapReduce is a computing technique to process the data in a distributed manner. There are two key phases in MapReduce.
a. Map phase
b. Reducer phase
One key point to note here is that, both map and reduce phase can only understand <key, value> pairs.
Let us try to understand more with some practical example.
Example: Let’s take a Word count program, that count how many times a word is repeated in the file ‘input.txt’.
input.txt
Hello How are you are you going to canada Hello are you there ya i am doing work Hi ptr How are you Hi buddy good day good day too very very good morning good night
Assume the file ‘input.txt’ is of size 500MB, persisted in HDFS. By assuming a block size of 128MB, this file is divided into 4 blocks and persisted on different data nodes.
Block B1
Hello How are you are you going to canada
Block B2
Hello are you there ya i am doing work
Block B3
Hi ptr How are you Hi buddy good day good day too
Block B4
very very good morning good night
When a client write a Map-Reduce code and submit it to Hadoop, this code is shipped to respective data nodes where the file data blocks persisted and map phase gets executed.
Map Phase in detail
As I said previously, Map and Reduce phases can only understand the data which is in <key, value> pairs. But the data is in text format here. Here RecordReader comes into picture to help mapper program. RecordReader reads the data and create <key, value> pairs.
For example, LineRecordReader (it is the default record reader in Hadoop), read the input data line by line and convert it into <key, value> pairs, where key is the byte offset and value is the actual line data.
LineRecordReader output for the Block1 data looks like below.
<12324, Hello How are you>
<961245, are you going to canada>
12324 and 961245 are the byte offsets (addresses) of the content.
Once the data is converted to <key, value> format, mapper can start processing the data. Mapper transforms input into intermediate form. The output of mapper goes to reducer.
Mapper program split the data Block B1 like below
<Hello, 1> <How, 1> <are, 1> <you, 1> <are, 1> <you, 1> <going, 1> <to, 1> <Canada, 1>
In the above snippet,
a. Key is the actual string
b. Value is the count, how many times the string is repeated.
As you see above snippet, strings like ‘are’ and ‘you’ repeated twice. We can combine these duplicate keys using a combiner step. Combiner step is optional, and it gets executed after mapper phase in the same data node. Most of the times, it does the similar thing what a reducer do, so it is called as local reducer.
After combining these duplicated entries, intermediate results transformed like below.
<Hello, 1> <How, 1> <are, 2> <you, 2> <going, 1> <to, 1> <Canada, 1>
The Mapper outputs are sorted and then partitioned per Reducer. The total number of partitions is the same as the number of reduce tasks for the job. Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.
Map phase output for different data blocks is given below.
Block1
(Hello, 1) (How, 1) (are, 2) (canada, 1) (going, 1) (to, 1) (you, 2)
Block2
(Hello, 1) (am, 1) (are, 1) (doing, 1) (i, 1) (there, 1) (work, 1) (ya, 1) (you, 1)
Block3
(Hi, 2) (How, 1) (are, 1) (buddy, 1) (day, 2) (good, 2) (ptr, 1) (too, 1) (you, 1)
Block4
(good, 2) (morning, 1) (night, 1) (very, 2)
Reducer Phase
Reducer phase works on the <key, value> pairs produced by map phase.
Reducer can run on one (or) more machines, these machines can be either the data nodes that holds this file blocks or some different data nodes. Values with the same Key always lands at the same Reducer.
Reducer takes the output of Mapper phase and gives you final result.
Reducer has three primary phases.
a. Shuffle
b. Sorting
c. Reduce
a.Shuffle
Shuffling is the process of transferring the map phase output from the mapper data nodes to reducer data nodes.
b.Sorting
Sorting phase groups reducer inputs by keys (since different mappers may have output the same key) in this stage. This stage output can be like below.(Hello, [1, 1]) (Hi, [2]) (How, [1, 1]) (am, [1]) (are, [2, 1, 1]) (buddy, [1]) (canada, [1]) (day, [2]) (doing, [1]) (going, [1]) (good, [2, 2]) (i, [1]) (morning, [1]) (night, [1]) (ptr, [1]) (there, [1]) (to, [1]) (too, [1]) (very, [2]) (work, [1]) (ya, [1]) (you, [2, 1, 1])
c. reduce
reduce method is called for each <key, (collection of values)> in the sorted inputs and produce the final result.
After reducer step final output, looks like below.
(Hello, 2) (Hi, 2) (How, 2) (am, 1) (are, 4) (buddy, 1) (canada, 1) (day, 2) (doing, 1) (going, 1) (good, 4) (i, 1) (morning, 1) (night, 1) (ptr, 1) (there, 1) (to, 1) (too, 1) (very, 2) (work, 1) (ya, 1) (you, 4)
Previous Next Home
No comments:
Post a Comment