MapReduce

What

A framework (abstraction) for processing and generating large datasets on a Distributed System

divide and conquer + Functional Programming + Pipelining

After all maps have finished

Input: grouped intermediate kv pair (need to first collect and group before starting)
Output (emit): desired result

(In the old time) The bottle neck is in network, since GFS is involved, we can find who hold the input file then have the worker working on it.
Data streaming
Use Hashtable to group intermedia file
Load balance: Many more tasks than workers