Skip to content

Introduction to Apache MapReduce

  • Data

MapReduce is an algorithm that allows large data sets to be processed in parallel, i.e. on multiple computers simultaneously. This greatly accelerates queries for large data sets.

What do we use MapReduce for?

MapReduce was originally used by Google to efficiently search the large volume of search results. However, the algorithm only became really famous through its use within the Hadoop framework. It stores large amounts of data in the Hadoop Distributed File System (HDFS for short) and uses MapReduce for queries or aggregations of information in the range of terabytes or petabytes.

Suppose we have stored all parts of the Harry Potter novels in Hadoop as a PDF and now want to count the individual words that appear in the books. This is a classic task where the division into a map function and a reduce function can help us.

How was it done before?

Before there was the possibility to split such complex queries to a whole computer cluster and to calculate them in parallel, one was forced to run through the complete data set one after the other. Of course, the larger the data set, the longer the query time.

Suppose we already have the Harry Potter texts word for word in a Python list:

harry_potter_words = ['Once', 'upon', 'a', 'time', 'Harry', 'Potter', 'met',         
                      'Ron', 'Weasley', 'and', 'Hermine', 'Granger', 'and', 
                      'they', 'became', 'Harry', 'Potters', 'best', 'friends', 
                      '.']

We can count the words that occur by looping through this list with a For loop and loading each word into the “Counter” from the Python “Collections” module. This function then does the word counting for us and outputs the ten most frequent words. Using the Python module “Time”, we can display how long it took our computer to execute this function.

import collections

def find_top_words(list_of_words):
    cnt = collections.Counter()
    for word in list_of_words:
        cnt.update([word])
    return cnt.most_common(10)

According to the website wordcounter.io, the first Harry Potter part has a total of 76,944 words. Since our example sentence has only 20 words (including period), this means that we have to repeat this example sentence about 3,850 times (76,944 / 20 ~ 3,847) to get a list with as many words as Harry Potter’s first part:

# Part 1 = 76,944 words ~ 3,850 example sentences of length 20
harry_potter_part_1 = harry_potter_words * 3850

%time find_top_words(harry_potter_part_1)

Out: 
Wall time: 312 ms
[('Harry', 7700),
 ('and', 7700),
 ('Once', 3850),
 ('upon', 3850),
 ('a', 3850),
 ('time', 3850),
 ('Potter', 3850),
 ('met', 3850),
 ('Ron', 3850),
 ('Weasley', 3850)]

Our function needs 312 milliseconds to run through all the words of the first part and count how often they occur. If we perform the same query for all Harry Potter parts with a total of 3,397,170 words (source: wordcounter.io), it takes a total of 3.7 seconds.

# All Parts = 3,397,170 words ~ 170,000 example sentences of length 20
harry_potter_all_parts = harry_potter_words * 170000

%time find_top_words(harry_potter_all_parts)

Out: 
Wall time: 3.79 s
[('Harry', 340000),
 ('and', 340000),
 ('Once', 170000),
 ('upon', 170000),
 ('a', 170000),
 ('time', 170000),
 ('Potter', 170000),
 ('met', 170000),
 ('Ron', 170000),
 ('Weasley', 170000)]

This query takes a comparatively long time and naturally becomes longer and longer for larger data sets. The only way to speed up the execution of the function is to equip a computer with a more powerful processor (CPU), i.e. to improve its hardware. When one tries to speed up the execution of an algorithm by improving the hardware of the device, this is called vertical scaling.

How does the MapReduce algorithm work?

With the help of MapReduce, it is possible to significantly speed up such a query by splitting the task into smaller subtasks. This in turn has the advantage that the subtasks can be divided among and executed by many different computers. This means that we do not have to improve the hardware of a single device, but can use many, comparatively less powerful, computers and still reduce the query time. Such an approach is called horizontal scaling.

Let’s get back to our example: Up to now, we had figuratively proceeded in such a way that we read all Harry Potter parts and simply extended the tally sheet with the individual words by one tally after each word we read. The problem with this is that we can’t parallelize this approach. Assuming a second person wants to assist us, she can’t do so because she needs the tally sheet we’re currently working with to continue. As long as she doesn’t have it, she can’t support.

However, she can support us by already starting with the second part of the Harry Potter series and creating a separate tally list just for the second book. Finally, we can then merge all the individual tally sheets and, for example, add up the frequency of the word “Harry” on all the tally sheets.

Das Bild zeigt den MapReduce Algorithmus am Beispiel von Harry Potter Büchern.
MapReduce using the example of word counts in Harry Potter books.

This also makes it relatively easy to scale the task horizontally by having one person work on each Harry Potter book. If we want to work even faster, we can also involve several people and let each person work on a single chapter. In the end, we only have to combine all the results of the individual persons to arrive at an overall result.

MapReduce example in Python

In total we need two functions a mapper and a reducer. We define the mapper in such a way that it returns a dictionary for each word it receives with the word as key and the value 1:

def mapper(word):
    return {word: 1}

Analogous to our example, the mapper returns a tally list that says: “The word that was passed to me occurs exactly once”. In the second step, the reducer then takes all the individual tally sheets and merges them into one large overall tally sheet. It distinguishes two cases: If the word that is passed to it already occurs in its large tally list, then it simply adds a dash in the corresponding line. If the new word does not yet appear in its list, the reducer simply adds a new line with the word to the large tally list.

def reducer(collection, new_word_dict):
    for word in new_word_dict.keys():
        if word in collection:
            collection[word] = collection[word] + 1
        else:
            collection[word] = 1
    return collection

If we merge these two subtasks, the same query as before takes only 1.5 seconds:

from functools import reduce

if __name__ == '__main__':
    harry_potter_words = ['Once', 'upon', 'a', 'time', 'Harry', 'Potter', 'met',          
                          'Ron', 'Weasley', 'and', 'Hermine', 'Granger', 'and',  
                          'they', 'became', 'Harry', 'Potters', 'best', 'friends', 
                          '.']

    harry_potter_all_parts = harry_potter_words*170000
    

    # Save the time now to calculate the overall time later
    start = time.process_time()
    
    # Step 1: Apply mapper
    mapped = map(mapper, harry_potter_all_parts)
   

    # Step 2: Apply reducer
    reduced = reduce(reducer, mapped)

    # Print the dictionary with the word_counts
    print(reduced)
    # Print the difference in time between now and when the system started
    print(time.process_time() - start)

Out:
{'Once': 170000, 'upon': 170000, 'a': 170000, 'time': 170000, 'Harry': 340000,
 'Potter': 170000, 'met': 170000, 'Ron': 170000, 'Weasley': 170000, 'and': 340000, 
 'Hermine': 170000, 'Granger': 170000, 'they': 170000, 'became': 170000, 
 'Potters': 170000, 'best': 170000, 'friends': 170000, '.': 170000}
1.5625

Thus, using the MapReduce algorithm, we were able to more than halve the query time for all Harry Potter books without doing any horizontal or vertical scaling. However, if the query time of 1.5 seconds is still too large for us, we could simply split the word list arbitrarily and run the mapper on different computers in parallel to further speed up the process. Without the MapReduce algorithm, this would not be possible.

Disadvantages of MapReduce

Our example has impressively shown that we can use MapReduce to query large amounts of data faster and at the same time prepare the algorithm for horizontal scaling. However, MapReduce cannot always be used or also brings disadvantages depending on the use case:

  • Some queries cannot be brought into the MapReduce schema.
  • The map functions run isolated from each other. Thus, it is not possible for the processes to communicate with each other.
  • Distributed systems are much more difficult to supervise and control than a single computer. Therefore, it should be carefully considered whether a computing cluster is really needed. The Kubernetes software tool can be used to control a computer cluster.

This is what you should take with you

  • MapReduce is an algorithm that allows large data sets to be processed in parallel and quickly.
  • The MapReduce algorithm splits a large query into several small subtasks that can then be distributed and processed on different computers.
  • Not every application can be converted to the MapReduce scheme, so sometimes it is not even possible to use this algorithm.
  • The paper from Google that first introduced MapReduce can be found here.
close
Das Logo zeigt einen weißen Hintergrund den Namen "Data Basecamp" mit blauer Schrift. Im rechten unteren Eck wird eine Bergsilhouette in Blau gezeigt.

Don't miss new articles!

We do not send spam! Read everything in our Privacy Policy.

Cookie Consent with Real Cookie Banner