What do we use MapReduce for?
MapReduce was originally used by Google to efficiently search the large volume of search results. However, the algorithm only became really famous through its use within the Hadoop framework. It stores large amounts of data in the Hadoop Distributed File System (HDFS for short) and uses MapReduce for queries or aggregations of information in the range of terabytes or petabytes.
Suppose we have stored all parts of the Harry Potter novels in Hadoop as a PDF and now want to count the individual words that appear in the books. This is a classic task where the division into a map function and a reduce function can help us.
How was it done before?
Before there was the possibility to split such complex queries to a whole computer cluster and to calculate them in parallel, one was forced to run through the complete data set one after the other. Of course, the larger the data set, the longer the query time.
Suppose we already have the Harry Potter texts word for word in a Python list:
We can count the words that occur by looping through this list with a For loop and loading each word into the “Counter” from the Python “Collections” module. This function then does the word counting for us and outputs the ten most frequent words. Using the Python module “Time”, we can display how long it took our computer to execute this function.
According to the website wordcounter.io, the first Harry Potter part has a total of 76,944 words. Since our example sentence has only 20 words (including period), this means that we have to repeat this example sentence about 3,850 times (76,944 / 20 ~ 3,847) to get a list with as many words as Harry Potter’s first part:
Our function needs 64 milliseconds to run through all the words of the first part and count how often they occur. If we perform the same query for all Harry Potter parts with a total of 3,397,170 words (source: wordcounter.io), it takes a total of 2.4 seconds.
This query takes a comparatively long time and naturally becomes longer and longer for larger data sets. The only way to speed up the execution of the function is to equip a computer with a more powerful processor (CPU), i.e. to improve its hardware. When one tries to speed up the execution of an algorithm by improving the hardware of the device, this is called vertical scaling.
How does the MapReduce algorithm work?
With the help of MapReduce, it is possible to significantly speed up such a query by splitting the task into smaller subtasks. This in turn has the advantage that the subtasks can be divided among and executed by many different computers. This means that we do not have to improve the hardware of a single device, but can use many, comparatively less powerful, computers and still reduce the query time. Such an approach is called horizontal scaling.
Let’s get back to our example: Up to now, we had figuratively proceeded in such a way that we read all Harry Potter parts and simply extended the tally sheet with the individual words by one tally after each word we read. The problem with this is that we can’t parallelize this approach. Assuming a second person wants to assist us, she can’t do so because she needs the tally sheet we’re currently working with to continue. As long as she doesn’t have it, she can’t support it.
However, she can support us by already starting with the second part of the Harry Potter series and creating a separate tally list just for the second book. Finally, we can then merge all the individual tally sheets and, for example, add up the frequency of the word “Harry” on all the tally sheets.
This also makes it relatively easy to scale the task horizontally by having one person work on each Harry Potter book. If we want to work even faster, we can also involve several people and let each person work on a single chapter. In the end, we only have to combine all the results of the individual persons to arrive at an overall result.
How does MapReduce work in Python?
In total, we need two functions a mapper and a reducer. We define the mapper in such a way that it returns a dictionary for each word it receives with the word as key and the value 1:
Analogous to our example, the mapper returns a tally list that says: “The word that was passed to me occurs exactly once”. In the second step, the reducer then takes all the individual tally sheets and merges them into one large overall tally sheet. It distinguishes two cases: If the word that is passed to it already occurs in its large tally list, then it simply adds a dash in the corresponding line. If the new word does not yet appear in its list, the reducer simply adds a new line with the word to the large tally list.
If we merge these two subtasks, the same query as before takes only 1.4 seconds:
Thus, using the MapReduce algorithm, we were able to more than halve the query time for all Harry Potter books without doing any horizontal or vertical scaling. However, if the query time of 1.4 seconds is still too large for us, we could simply split the word list arbitrarily and run the mapper on different computers in parallel to further speed up the process. Without the MapReduce algorithm, this would not be possible.
What are the Disadvantages of MapReduce?
Our example has impressively shown that we can use MapReduce to query large amounts of data faster and at the same time prepare the algorithm for horizontal scaling. However, MapReduce cannot always be used or also brings disadvantages depending on the use case:
- Some queries cannot be brought into the MapReduce schema.
- The map functions run isolated from each other. Thus, it is not possible for the processes to communicate with each other.
- Distributed systems are much more difficult to supervise and control than a single computer. Therefore, it should be carefully considered whether a computing cluster is really needed. The Kubernetes software tool can be used to control a computer cluster.
This is what you should take with you
- MapReduce is an algorithm that allows large data sets to be processed in parallel and quickly.
- The MapReduce algorithm splits a large query into several small subtasks that can then be distributed and processed on different computers.
- Not every application can be converted to the MapReduce scheme, so sometimes it is not even possible to use this algorithm.
Thanks to Deepnote for sponsoring this article! Deepnote offers me the possibility to embed Python code easily and quickly on this website and also to host the related notebooks in the cloud.
Other Articles on the Topic of MapReduce
- The paper from Google that first introduced MapReduce can be found here.