Tag Archives: reduce

A JavaScript Implentation of MapReduce’s WordCount

MapReduce WordCount

You can view a javascript implementation of the WordCound Program in Mapreduce at Javascript Implementation of Mapreduce WordCount

One of the big things in the world of Data Science and Cloud Computing is the map-reduce implementation of various algorithms. This is not always a straightforward procedure and so learning to think in terms of map-reduce implementations can be a challenging conversion from thinking in a functional programming frame of mind. In light of this I thought it would be convenient to try to help users “visualize” this concept. This is a challenging task because there are many concepts of cloud computing that I am unable to provide in this environment. However, just as many of the books on MapReduce provide pseudo-code on various implementations of algorithms in a Map-Reduce environment, I will attempt to show how data flows from the input to the mappers to the shuffle and sort phase to the reducers and finally to generate the output. I leave the users the task of actually putting these into the context of a Java MapReduce environment.

I want to first speak about the concept of (key, value) pairs which is a very important in MapReduce programming. I will speak about this in the context of a WordCount program. The purpose of a WordCount program is to count the number of occurrences of each word in a given file. First data is input to the mapper in (key, value) pairs. For our example, the key will be the line number of input (so each line of input will go to a different mapper) and the value will be the text present on that line. Once the mapper has the input, it will perform some operation on it and output data again in (key, value) pairs. In the WordCount example, the mappers will simply output each word that occurs as a new key on that line and the integer “1” as the associated value (note that a single mapper can output multiple (key, value) pairs).

One of the main things to understand in a MapReduce is that there are a number of Mappers running on a given input file and these Mappers cannot interact with one another. Suppose we have two different mappers, lets call them Mapper1 and Mapper2 that are each working on two different lines of input from a file. If both lines of input have the word “apple”, there is no way for Mapper2 to know that Mapper1‘s line of input also has this word. In this setting that’s perfectly fine because the Shuffle and Sort phase is where all the (key, value) pairs that were output by the mappers, compares the keys to one another and if they are equal to one another combines their respective values into a list of values. Unequal keys are sorted.

So if both Mapper1 and Mapper2 contained the word “apple” in their line of text, then the (key, value) pair (apple, 1) will occur twice. So the Shuffle and Sort phase will notice this and output the (key, value) pair (apple, {1, 1}).

Each reducer is then given a key and a list of values that were output by the mappers. The goal will be to perform some operation and again output data in (key, value) pairs. In the WordCount example, we will use what is known as the sumReducer. It gets this name because its job is simply to sum the values in the list of values and output the (key, value) pair that is the original key and this sum of values.

You can view a javascript implementation of this at Javascript Implementation of Mapreduce WordCount

What is a “Hard” Problem?

Throughout our lives, we are introduced to a wide variety of problems. Naturally we tend to think of some problems as more difficult than others. If you were doing the exercises at the end of a chapter in a book, the author generally tends to begin with those exercises that can be completed in a few minutes or directly from the material in the chapter. The exercises in the end tend to be more time consuming and possibly require outside resources. In this sense, these authors tend to compare the difficulty of problems by the amount of time they expect the average reader to take to solve them.

This is a popular way to look at difficulty – in terms of how difficult it is for not only a single person to solve a problem (say you or I), but how long it would take our peers as well as yourself to solve the problem. But how can we measure this? Should I assume that because a problem takes me two years to solve that it would take the average person the same amount of time? Or, if a problem has never been solved, is that because of the difficulty of the problem or could it be for another reason like because no-one had thought of the problem before?

In particular, these questions were being asked in the world of computer science in the 1960s and 1970s. It was around this time that Stephen Cook came up with the concept being able to compare how difficult one problem was in comparison to another. A problem A can be “reduced” to another problem B if a way to solve problem B quickly also gives way to a way to solve problem A quickly.

This concept of reducibility provided a foundation for this concept of difficulty as it was no longer enough to say “I think this problem is hard”, or “the average person would take just as long as me to solve this problem”. No, instead, Cook considered problems that many thought were difficult and showed that the satisfiability problem (the question of whether a given boolean formula contains an assignment to the variables that satisfies all the clauses) could be used to show that all quickly verifiable decision problems (also called problems in NP), where the instances with a “yes” answer have short proofs of the fact that the answer is indeed “yes” could be reduced to this problem.

This meant that the satisfiability problem was at least as hard as every one of these quickly verifiable decision problems, so if this problem could be solved quickly, then every one of these quickly verifiable decision problems could also be solved quickly. So the satisfiability problem is at least as hard as every problem in this class, NP. We call such problems that have this property NP-Hard. Decision Problems that are NP-Hard are called NP-Complete. Richard Karp later published a paper that proposed 21 NP-Complete problems, one of which was the Knapsack Problem I spoke about last time.

Garey Johnson NP-Complete Image

This concept of NP-Complete gives light to the image that became famous in the book by Garey and Johnson “Computers and Intractability: A Guide to the Theory of NP-Completeness”. It shows that the concept of difficulty still builds on the ability of our peers to solve a given problem. But by introducing a class of “hard” problems, and a litmus test for inclusion into that class, researchers could now consider new problems and determine their difficulty by comparing it to known hard problems.

(a quick note. I’ve used the term “quickly” here, but the formal phrase that I mean by that is that the problem is solvable by an algorithm whose worse case running time is bounded by a polynomial on the input parameters of the problem)

Other Blogs that have covered this topic:
Explorations
To Imaging, and Beyond!

The Euclidean Algorithm

I remember being in school and learning about fractions. In particular, I remember the problems we had when trying to add and subtract fractions. This problem also presented itself when we tried to multiply fractions, although we still received partial credit if we couldn’t reduce fractions to their lowest terms.

What I’m referring to is the problem of finding the Greatest Common Divisor (GCD) of two numbers. The GCD of two numbers is the largest number that divides into both numbers with a remainder of zero each time. This problem has many applications, but for most of us we can relate to it because of our frustrations with fractions.

The algorithm that I’m writing about and have written a script for is called the Euclidean Algorithm, which solves precisely this problem. To be more precise, the Euclidean Algorithm finds the greatest common divisor between two integer numbers.

There are different versions of the algorithm, but the one I have implemented finds the GCD by subtracting the smaller number from the larger number, and if the result is greater than 0, the procedure repeats itself with the two lower numbers. Otherwise, the result (the GCD) is the final number that was greater than 0.

Lets see an example:
Consider the number 9 and 21.
21 – 9 = 12. 12 is greater than 0, so we repeat the procedure with 9 and 12.
12 – 9 = 3. 3 is greater than 0, so we repeat the procedure with 3 and 9.
9 – 3 = 6. 6 is greater than 0, so we repeat the procedure with 3 and 6.
6 – 3 = 3. 3 is greater than 0, so we repeat the procedure with 3 and 3.
3 – 3 = 0. 0 is not greater than 0, so can exit the loop portion of the algorithm.
Since 3 was the last positive number that we arrived at in this procedure, we see that the GCD of 9 and 12 is 3.

To learn more and see more examples, check out My Script on The Euclidean Algorithm.