Hadoop MapReduce Parallel Data Flow Model Big Data Analytics
This article discusses the Hadoop MapReduce Parallel Data Flow Model.
The Map-Reduce data flow model is a very powerful computational method for big data applications. The underlying idea in the MapReduce computational model is very simple.
There are two stages in the Hadoop MapReduce model, the first stage is the mapping stage, and the second stage is the reducing stage.
In the mapping stage, a mapping procedure is applied to the given input data. For example, assume that one wants to count how many times each word appears in the novel. One solution is to divide the novel into 20 sections and assign the task of counting the frequency of each word to 20 people. This step is known as the mapping stage.
Let the content of the text file contains two lines:
Line1: Hello World Hadoop World
Line 2: Hello Hadoop Goodbye Hadoop
In this case, each line of the text file is assigned to one person to count the frequency of each word. Hence the output of the mapping stage
< Hadoop, 1>
The reducing stage starts when everyone is finished counting their assigned text file. The reducer calculates the sum of each word as each one of them tells their counts. The output of the reducing stage is:
< Hadoop, 3>
Following are the steps in Hadoop MapReduce Parallel Data Flow Model
1. Input Splits
Hadoop Distributes File Systems (HDFS) divides the data into multiple blocks. These data blocks are distributed and replicated over multiple storage devices called DatNodes. The default size of the data block is 64MB. Thus, the data with 150MB file size would be divided into 3 data blocks and they will be written into different machines in the Hadoop cluster. Also, they will be replicated in the Hadoop cluster based on the replication factor.
2. Map Step
The mapping stage is where the parallel nature of Hadoop comes into the picture. When we want to process a large amount of data, many mappers can operate at the same time in parallel. The user provides the specific task to the mapper. Mapreduce model executes the mapper process where the data resides. As the data block is replicated on multiple data nodes, the least busy node is selected for the mapper process. If all data nodes holding the data blocks are too busy, the MapReduce model will try to select a data node that is closest to the name node (a characteristic called rack awareness). The last option is any data node in the cluster is chosen that has a data block.
3. Combiner step
It is possible to provide optimization or pre-reduction as part of the mapping stage where key-value pairs are combined prior to the next stage. The combiner stage is optional.
Let the text at mapper is: Hello World Hadoop World
Output of Mapper
< Hadoop, 1>
Output of Combiner
< Hadoop, 1>
4. Shuffle step
Each reducer process should count similar key-value pairs. Hence, results of the mapping stage must be collected by key-value pairs, then they are shuffled and sent to the same reducing process. If only a single reducer process is used, the Shuffle stage is not needed.
5. Reduce Step
The final step is the actual reduction stage. In this reducing stage, the data reduction is performed based on the programmer’s design. The reducing stage is also optional. The results of the reducer stage are written to HDFS. Each reducer process will write an output file. For instance, a MapReduce job with two reducers will create two output files with names part-0000 and part-0001.
Sample Eaxmple: Hadoop MapReduce Parallel Data Flow Model
Combiner Stage in MapReduce Model
This article discusses the Hadoop MapReduce Parallel Data Flow Model Big Data Analytics. Don’t forget to give your comment and Subscribe to our YouTube channel for more videos and like the Facebook page for regular updates.