MapReduce is a programming model and an associated implementation for processing and generating large data sets. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication.
The project dependencies can be installed using:
make buildAnd a specific node can be started using:
make master # master node
make worker # worker node
make backup # backup nodeIf you do not have make, can install its:
sudo apt-get makeA MapReduce System has two important functions
-
Map-> maps and filters a set of data represented by thekey-value pair -
Reduce-> process the set of values associated to a single outkey
A example of map and reduce function using in a word count problem:
def map(doc_line: int, doc_line_text: str):
res = []
for word in doc_line_text.split(' '):
res.append((word, 1))
return res
def reduce(word: str, vals: list):
count = 0
for v in vals:
count += int(v)
return countCan be defined other function in the map operation (Parse and GroupBy) for parsing input data and local grouping in workers
The MapReduce System count with the modules mapreduce and framework. The mapreduce module provides the Configuration Class for the MapReduce Job, the classes Mapper and Reducer are base class, you must inherits and redefine the functions map and reduce in each class, optionally you can define new
parse and groupby functions with the same signature, the function mapreduce contained in the
module mapreduce is used for the client to submit a new job in the jobstracker, the next example show a correct use of the framework
Example (program.py):
from mapreduce.config import MapReduce, Mapper, Reducer
from mapreduce import mapreduce
import sys
class WC_Mapper(Mapper):
def map(self, key, value):
res = []
for word in value.split(' '):
res.append((word, 1))
return res
class WC_Reducer(Reducer):
def reduce(self, key, value):
res = 0
for elem in value:
res += int(elem)
return res
if __name__ == "__main__":
out = sys.argv[1] # output folder in master
wc_m = WC_Mapper() # Mapper
wc_r = WC_Reducer() # Reducer
# config class
config = MapReduce('./input', wc_m, wc_r, out)
# submit job using the mapreduce function
mapreduce(config)The module framework contains the MasterNode, BackupNode and WorkerNode theirs conforms the cluster
structure. Only a MasterNode can be created for the network, the BackupNode nodes are used to select a new master when
master node cannot be founded, a WorkerNode compute the task and sends the values to the master
A MasterNode contains a JobsTracker, used to organizes the task and distributes this to workers. The scripts master.py,
slave.py and b_master.py provides a simple form to init and call this nodes, wake up a new node must be done using this
scripts