What is MapReduce and when would you use it?

Answer

MapReduce is a programming model for processing large datasets in parallel across a cluster of machines, introduced by Google in 2004. Inspired by functional programming's map and reduce. Two phases: (1) Map phase: input data is split into chunks; each chunk is processed independently on worker nodes by the map function; map emits key-value pairs; (2) Reduce phase: all values with the same key are grouped and sent to a reducer; the reduce function aggregates them into a final output. Example — word count: Map("document") → (word, 1) for each word; Shuffle → group by word; Reduce("word", [1,1,1]) → (word, 3). Execution: master node splits input, assigns map tasks to workers; workers write output to local disk; master assigns reduce tasks, which pull map outputs over the network; reducers write final output. Fault tolerance: if a worker fails, master re-runs its tasks on another worker — possible because map is deterministic. When to use: batch processing of large datasets (logs, user events), ETL pipelines, building inverted indexes, machine learning on large datasets. Modern alternatives: Apache Spark — in-memory processing (10-100x faster for iterative algorithms); much more expressive (SQL, ML, streaming). MapReduce has largely been replaced by Spark in practice. Apache Hadoop YARN still provides the resource management layer. Conceptual value: MapReduce introduced the "move computation to data" principle — process data where it's stored rather than moving data to computation.

Answer

More System Design Questions