MapReduce

By Google

Programming model and associated implementation for processing and generating large datasets
Map functions process a key/value pair and generate a set of intermediate key/value pairs
Reduce merges intermediate pairs into a final value for a key
Programs written in this way are automatically parallelised and run on large clusters

Large amounts of data made it infeasible to run queries on a single machine.
Parallelisation, fault-tolerance, data distribution and load balancing --> MapReduce

Computation is expressed in two functions (map and reduce, obvs)
Types:
- map (k1, v1) -> list(k2, v2)
- reduce (k2, list(v2)) -> list(v2)

Distributed Filesystem (GFS?) is used to manage the data stored on the disks of all machines in the MapReduce cluster. FS uses replication to mitigate limitations of commodity hardware.
Map invocations are distributed across multiple machines by partitioning input data into a set of splits
Splits can be processed in parallel on different machines
Reduce invocations are distributed by partitioning intermediate key space into R pieces:
- R is supplied by the user
- Partitioning function ~~> hash(key) mod R
MapReduce steps:
1. Client library partitions the data into M pieces (16-64 Mb each) and starts up many copies of the MapReduce on a cluster
2. One copy is started as a master and others are workers. Master assigns each idle worker with tasks from M and R
3. Workers with "Map" read their corresponding input split, parse it into key-value pairs and runs the user-defined map on all. Intermediate pairs are buffered in memory.
4. Periodically the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations for the partitions are passed to master
5. Master notifies the reduce workers of the locations of partitions. The reduce worker uses remote procedure calls to read the buffered data from the local disks. When all intermediate keys are read, reducer sorts them by key (to group same keys together)
6. Reducer iterates over the keys found and passes all corresponding intermediate values to the user-defined reduce function. The output of that function is appended to a final output file for this reduce partition
7. When all mappers and reducers terminate, the master wakes up the client by returning from a MapReduce call
After successful execution, the output is available in the R output files, one per reducer
Fault Tolerance:
- Worker Failure:
  - The master pings every worker periodically, marking workers that have been quiet for while as failed
  - Any map completed/running task of a failed worker is reset to its initial state and could be scheduled on other workers. Completed tasks are re-executed since their output is stored on the local disk of that machine and therefore is inaccessible on failure. Completed reduce tasks need not to be re-executed, as their output is stored on a global filesystem
  - When a map worker fails and another worker gets assigned to that tasks, all reducers are notified of that change
- Master Failure:
  - When a master fails, the entire computation is aborted and the client is notified
- Semantics in the Presence of Failures:
  - MapReduce produces the same results as sequential execution if both functions are deterministic functions of their respective inputs
  - Atomic commits of map and reduce help in achieving this property (saving temporary results on a filesystem)
  - Underlying filesystem must implement atomic rename (otherwise there may be conflicting reduce finishes that try to rename their temporary files into a final output file)
  - The semantics are weaker for the non-deterministic functions of map and/or reduce
- Locality:
  - Conserving network bandwidth by sharding and replicating shards on three different machines
  - Master takes locations of raw files into account when scheduling the operations to minimise redistribution at runtime
- Task Granularity:
  - Ideally M and R are much larger than the number of worker machines in the cluster. This would improve dynamic load balancing and speed up recovery when a worker fails
  - The bounds on M and R are in master, as it needs to make O(M+R) scheduling decisions and keeps O(M+R) states in memory
- Backup Tasks:
  - "Stragglers" - machines that are taking a very long time to execute a step in map or reduce, slowing down the entire computation
  - When a MapReduce is close to completion the master schedules backup tasks for all in-progress tasks to be executed on other machines
Refinements:
- Partitioning Function:
  - Users specify the number of output partitions, data gets partitioned on intermediate keys with the (hash(key) mod R) function
  - It is possible to provide a custom partitioning function (i.e. hash(hostname(url) mod R)
- Ordering Guarantees:
  - Guarantee that within a given partition, the intermediate key-value pairs are processed in the increasing key order
  - Perk of outputting data in a sorted format
- Combiner Function:
  - Perform partial merging of the data before it is sent over the network to the reducer
  - Executed on each machine that performs a map task (typically the same as reduce)
- IO Types:
  - Predefined input file formats or users can define their own readers to integrate with other services (such as databases, etc)
- Skipping Bad-Records:
  - Bugs in user code cause map or reduce to deterministically crash on certain inputs
  - Detect and skip these records at runtime (if more than one failure occurs on a record)

Do we need this section?

Notes on MapReduce: Simplified Data Processing on Large Clusters

Overview

Intro

Programming Model

Implementation

Performance