User Tools

Site Tools


concepts:map-reduce
  • Bookmark "MapReduce" at del.icio.us
  • Bookmark "MapReduce" at Digg
  • Bookmark "MapReduce" at Furl
  • Bookmark "MapReduce" at Reddit
  • Bookmark "MapReduce" at Ask
  • Bookmark "MapReduce" at Google
  • Bookmark "MapReduce" at Netscape
  • Bookmark "MapReduce" at StumbleUpon
  • Bookmark "MapReduce" at Technorati
  • Bookmark "MapReduce" at Live Bookmarks
  • Bookmark "MapReduce" at Yahoo! Myweb
  • Bookmark "MapReduce" at Facebook
  • Bookmark "MapReduce" at Newsvine
  • Bookmark "MapReduce" at Yahoo! Bookmarks
  • Bookmark "MapReduce" at Twitter
  • Bookmark "MapReduce" at myAOL
  • Bookmark "MapReduce" at Slashdot
  • Bookmark "MapReduce" at Mister Wong

MapReduce

Definition

MapReduce is a program by Google which conducts and simplifies the processing of very large amounts of data. The data volume processed can consist of up to several petabytes (1 petabyte = ca. 1.000.000 gigabyte). The procedure of MapReduce is split into the three phases Map, Shuffle and Reduce.1) In general it is used for the simultaneous processing of data. This is necessary especially since it would take too much time to work off large amounts of data on one system. For example: one Google- search generates 292 million hits in 0.4 seconds. 2)

MapReduce Architecture

The MapReduce architecture equals a three- layer- model. These phases will be explained as follows. In order to realize MapReduce it is initially necessary to upload the data into a computer cluster. Meanwhile a Hadoop Distributed File System (HDFS) splits up the data and saves it on the various data nodes of the computer cluster. Due to reliability HDFS ideally saves up to three copies on various nodes.3)

Map

A specific user firstly generates a Map and Reduce function. MapReduce now produces a Map Task on each relevant data node, which in return reads the data saved on the node and executes the Map function. The result is locally saved on each data node. Additionally HDFS offers something called Load Balancing, which means dividing the Map Task´s calculations up due to the nodes´ capacity and transferring them into the data copies.4)

Shuffle

This method is used to generate a rough granular partitioning of the results of the earlier mentioned Map process. In this phase the intermediary results are grouped together, during which they are distributed corresponding with a partitioning in the network. It is not important, which partitioning is used during this step.5)

Reduce

After the phases of Map and Shuffle have been successful, the last step is Reduce. Firstly, the Reduce function generated earlier is read. Just like in the Map phase, a Reduce Task is developed on each data node. The Reduce Task reads the intermediate results and uses the Reduce function on each partitioning. It is used exactly once on each group consisting of n elements. The end results are saved locally and HDFS copies them on other nodes in order to guarantee reliability and Load Balancing. As soon as the Reduce phase comes to an end, the complete result of the MapReduce task can be found on the data nodes. 6)

Combine Phase

As a variation, a so called Combine phase can occur between Map and Shuffle, which mostly serves the same purpose as the Reduce phase, the difference being, that it takes place during the Map phase. Its purpose is to reduce the network load by combining the transferrable data of the Shuffle process (1000 messages (term, 1) turns into one message (term, 1000)). 7)

Field of Application

MapReduce is used to e.g. cultivate a subject index out of large amounts of data. Google used this algorithm in the past for developing an index for their search engine. In 2010, Google published the process and since 2014 it has been expanded (Cloud Dataflow), in order to ensure more flexibility.8)

1) , 3) Inmon, Linstedt: “Data Architecture: A Primer for the Data Scientist”, p. 149ff.
  • Bookmark "MapReduce" at del.icio.us
  • Bookmark "MapReduce" at Digg
  • Bookmark "MapReduce" at Furl
  • Bookmark "MapReduce" at Reddit
  • Bookmark "MapReduce" at Ask
  • Bookmark "MapReduce" at Google
  • Bookmark "MapReduce" at Netscape
  • Bookmark "MapReduce" at StumbleUpon
  • Bookmark "MapReduce" at Technorati
  • Bookmark "MapReduce" at Live Bookmarks
  • Bookmark "MapReduce" at Yahoo! Myweb
  • Bookmark "MapReduce" at Facebook
  • Bookmark "MapReduce" at Newsvine
  • Bookmark "MapReduce" at Yahoo! Bookmarks
  • Bookmark "MapReduce" at Twitter
  • Bookmark "MapReduce" at myAOL
  • Bookmark "MapReduce" at Slashdot
  • Bookmark "MapReduce" at Mister Wong
concepts/map-reduce.txt · Last modified: 2015/03/26 18:46 (external edit)