Hadoop Note 4: MapReduce

What is MapReduce

  • Programming paradigm
  • Designed to solve one problem
  • Created by Google
  • Two parts: Map and Reduce

Map part

  • Execute the Map()function on data
  • Execute on each node
  • Output pairs on each node

Reduce part

  • Execute the Reduce() function on data
  • Execute on some node
  • Aggregate sets of pairs on some nodes
  • Output a combined list

MapReduce 1.0

  1. Distributed, scalable, cheap
  2. Storage
  • HDFS-triple replicated
  • Commodity hardware
  1. Processing

– Parallel via Map(local) and Reduce(aggregated)

Key Aspects of MapReduce

  • It’s an API, or set of libraries
  1. Job – unit of MapReduce work/instance
  2. Map task – runs on each node
  3. Reduce task – runs on some nodes
  4. Source data – HDFS or other location

MapReduce Daemons and Services

  1. JVMs or services – isolated processes
  • Job tracker – one (controller and scheduler)
  • Task trackers – one per cluster (monitors taks)
  1. Job configurations

– Specify input/output locations for job instances
– Job clients submit jobs for execution

Leave a Reply

Your email address will not be published. Required fields are marked *