In the last 2 blogs, we discussed The Big data Technology stack and HDFS. If you want to revisit these topics then please go to the link
Big Data Technology Stack
HDFS
In this blog we will discuss Map Reduce.
HDFS
In this blog we will discuss Map Reduce.
Map reduce is the most difficult concept in big data. Why?
Because it is low level programming. In this blog, we will discuss high level
map reduce. This blog will not contain any coding.
Map Reduce
Architecture
The main elements in the map reduce architecture will be
the:
1)
Job Client: Submit Jobs
2)
Job Tracker: Coordinate Jobs
3)
Task Tracker: Executes Jobs
Job tracker is the master in
this architecture to the Task trackers. It is similar to Name Node being master
to data node.
So the rules of the game
are,
1) Job
Clients submits jobs to Job Trackers and also copies its information like
binaries to HDFS
2) Job
Tracker talks to name node
3) Job
Tracker creates an execution plan
4) Job
Trackers submits to task trackers. Task trackers do the heavy lifting. They will
execute the map and reduce functions.
5) Task
trackers report and progress via heartbeat: When its executing map and reduce
functions, it sends progress updates and status updates to job tracker
6) Job
tracker manages phases
7) Job
tracker Update status
so this covers high level architecture of map reduce.
Now lets zoom in for MAP REDUCE INTERNALS
So the first phase is here the Split Phase. Split phase uses
the input format to bring data of the disk, off the HDFS format and split it up
so that it can be sent to mappers. The default input format is the text input
format. It breaks up data line by line. So each line be sent to mapper. So if
you have a large file having many lines, you could have thousands of thousands
of mapper running simultaneously.
Let’s go to input format. There are variety of input
formats: there is binary input format, database input format, record based
input formats etc.
Mappers transforms the input splits into key/value pairs
based on user defined code. Then it will go to intermediate phase called
Shuffle and sort. Shuffle and sort moves map outputs to the reducers and sorts
them by key
Reducers aggregates key/value pairs based on user-defined
code. And then put it into output format. Output format determines how the
results are written to the output directory. Output format puts the data into
HDFS
So lets send some data through
So here we have a input, say a file with bunch of information. By default it will split the different lines. and each line will be sent to the mapper. mapper will do key value and so on.
so what we have seen above is functional programming paradigm. its a programming paradigm that treats computation as the evolution of mathematical functions.
So this is how map reduce works
No comments:
Post a Comment