What

A framework (abstraction) for processing and generating large datasets on a Distributed System

divide and conquer + Functional Programming + Pipelining

How

Map

  • Input: kv pair
  • Output (emit): intermediate kv pair

Reduce

After all maps have finished

  • Input: grouped intermediate kv pair (need to first collect and group before starting)
  • Output (emit): desired result

Optimization

  • (In the old time) The bottle neck is in network, since GFS is involved, we can find who hold the input file then have the worker working on it.
  • Data streaming
  • Use Hashtable to group intermedia file
  • Load balance: Many more tasks than workers

Abstraction

  • MapReduce hides many details:
    • sending app code to servers
    • tracking which tasks are done
    • moving data from Maps to Reduces
    • balancing load over servers
    • recovering from failures