What is MapReduce
- Programming paradigm
- Designed to solve one problem
- Created by Google
- Two parts: Map and Reduce
Map part
- Execute the Map()function on data
- Execute on each node
- Output
pairs on each node
Reduce part
- Execute the Reduce() function on data
- Execute on some node
- Aggregate sets of
pairs on some nodes - Output a combined list
MapReduce 1.0
- Distributed, scalable, cheap
- Storage
- HDFS-triple replicated
- Commodity hardware
- Processing
– Parallel via Map(local) and Reduce(aggregated)
Key Aspects of MapReduce
- It’s an API, or set of libraries
- Job – unit of MapReduce work/instance
- Map task – runs on each node
- Reduce task – runs on some nodes
- Source data – HDFS or other location
MapReduce Daemons and Services
- JVMs or services – isolated processes
- Job tracker – one (controller and scheduler)
- Task trackers – one per cluster (monitors taks)
- Job configurations
– Specify input/output locations for job instances
– Job clients submit jobs for execution