Distributed Filesystem (GFS?) is used to manage the data stored on the disks of all machines in
the MapReduce cluster. FS uses replication to mitigate limitations of commodity hardware.
Map invocations are distributed across multiple machines by partitioning input data into a set of
splits
Splits can be processed in parallel on different machines
Reduce invocations are distributed by partitioning intermediate key space into R pieces:
R is supplied by the user
Partitioning function ~~> hash(key) mod R
MapReduce steps:
Client library partitions the data into M pieces (16-64 Mb each) and starts up many copies of
the MapReduce on a cluster
One copy is started as a master and others are workers. Master assigns each idle worker with
tasks from M and R
Workers with "Map" read their corresponding input split, parse it into key-value pairs and
runs the user-defined map on all. Intermediate pairs are buffered in memory.
Periodically the buffered pairs are written to local disk, partitioned into R regions by the
partitioning function. The locations for the partitions are passed to master
Master notifies the reduce workers of the locations of partitions. The reduce worker uses
remote procedure calls to read the buffered data from the local disks. When all intermediate
keys are read, reducer sorts them by key (to group same keys together)
Reducer iterates over the keys found and passes all corresponding intermediate values to the
user-defined reduce function. The output of that function is appended to a final output file
for this reduce partition
When all mappers and reducers terminate, the master wakes up the client by returning from a
MapReduce call
After successful execution, the output is available in the R output files, one per reducer
Fault Tolerance:
Worker Failure:
The master pings every worker periodically, marking workers that have been quiet for
while as failed
Any map completed/running task of a failed worker is reset to its initial state and could
be scheduled on other workers. Completed tasks are re-executed since their output is
stored on the local disk of that machine and therefore is inaccessible on failure.
Completed reduce tasks need not to be re-executed, as their output is stored on a global
filesystem
When a map worker fails and another worker gets assigned to that tasks, all reducers are
notified of that change
Master Failure:
When a master fails, the entire computation is aborted and the client is notified
Semantics in the Presence of Failures:
MapReduce produces the same results as sequential execution if both functions are
deterministic functions of their respective inputs
Atomic commits of map and reduce help in achieving this property (saving temporary results
on a filesystem)
Underlying filesystem must implement atomic rename (otherwise there may be conflicting
reduce finishes that try to rename their temporary files into a final output file)
The semantics are weaker for the non-deterministic functions of map and/or reduce
Locality:
Conserving network bandwidth by sharding and replicating shards on three different
machines
Master takes locations of raw files into account when scheduling the operations to
minimise redistribution at runtime
Task Granularity:
Ideally M and R are much larger than the number of worker machines in the cluster. This
would improve dynamic load balancing and speed up recovery when a worker fails
The bounds on M and R are in master, as it needs to make O(M+R) scheduling decisions and
keeps O(M+R) states in memory
Backup Tasks:
"Stragglers" - machines that are taking a very long time to execute a step in map or
reduce, slowing down the entire computation
When a MapReduce is close to completion the master schedules backup tasks for all
in-progress tasks to be executed on other machines
Refinements:
Partitioning Function:
Users specify the number of output partitions, data gets partitioned on intermediate keys
with the (hash(key) mod R) function
It is possible to provide a custom partitioning function (i.e. hash(hostname(url) mod R)
Ordering Guarantees:
Guarantee that within a given partition, the intermediate key-value pairs are processed in
the increasing key order
Perk of outputting data in a sorted format
Combiner Function:
Perform partial merging of the data before it is sent over the network to the reducer
Executed on each machine that performs a map task (typically the same as reduce)
IO Types:
Predefined input file formats or users can define their own readers to integrate with
other services (such as databases, etc)
Skipping Bad-Records:
Bugs in user code cause map or reduce to deterministically crash on certain inputs
Detect and skip these records at runtime (if more than one failure occurs on a record)