Term Frequency Inverse Document Frequency

TFIDF Example Page

I recently found myself at a gathering where there was a presentation followed by a question and answer session. The presenter spoke about how recent advents in technology had allowed for questions to be submitted online, which led to many more questions than they normally receive. He then stated that rather than attempt to answer each question individually, that the questions would be grouped together to attempt to cover as many areas as possible. I must say that while the presentation and the Q & A sessions in themselves were informative, I was probably most excited when I heard the speaker present this problem. I immediately began analyzing the problem, developing a strategy and decided to speak with the presenter afterwards.

I like to think of three trains of thought when it comes to computers. There are those who watch science fiction movies and believe that computers can do literally anything; there are those who haven’t seen computers tackle complex tasks, so they think of computers as simple “adding machines”; and then there are those of us who work with computers on a day to day basis. There is an extremely long list of things that computers can do – a list which far outnumbers the things that people in the second category try to limit computers to. However, the fact that computers have not gotten to an “I, Robot” mentality does not make them useless.

As a case in point, one of the more popular and thus more important areas of data mining is that of text analytics. This area attempts to combine the fields of computer science and natural language processing to build tools that gather meaning from text documents. Several algorithms have emerged in this field and the one I choose to write about today is the Term Frequency Inverse Document Frequency Matrix. This is the means by which I suggested to help with the question and answer session I mentioned earlier.

The Term Frequency Inverse Document Frequency (TF-IDF) Matrix considers as input a list of documents. Each of these documents consists of words (or terms). Two important things are computed for each term, the “term frequency” and the “inverse document frequency”. There are several ways to compute each of these, but the most basic way of computing each is to let the term frequency of a given word and document tf(w, d) be the number of times the word w appears in the document d. In a similar manner, the inverse document frequency idf(w, D) tells how common the word w is across all the documents d \in D. The calculation of the TF-IDF matrix consists of creating a column corresponding to each word and a row for each document. The tf and idf values are then computed for each cell and the value in the cell is the product tf(w, d)*idf(w, D).

Once the TF-IDF matrix has been computed, we can use a metric like Cosine Similarity to determine how similar two documents are. When every pair of document is compared using the cosine similarity metric, it forms a similarity matrix that shows how the documents relate to one another. This matrix can be clustered via k-means clustering or hierarchal clustering show the similar documents.

I’ve written a script which considers a set of quotations and randomly displays a set of those quotations and uses the TF-IDF matrix to determine how similar these quotations are. Let me know what you think.

Covariance of Vectors

Covariance Image Link

Most of the things we think about have many different ways we could describe them. Take for example a movie. I could describe a movie by its genre, its length, the number of people in the movie, the number of award winners, the length of the explosions, the number of fight scenes, the number of scenes, the rating it was given by a certain critic, etc. The list goes on and on. How much do these things influence one another? How likely is a person to enjoy a movie? Is that related to the number of award winners in the movie? Answering this type of a question can often help understand things like what might influence a critics rating or more importantly which movies are worth my $15 ticket price.

Movies are just one example of this. Other areas like sports, traffic congestion, or food and a number of others can be analyzed in a similar manner. With data becoming available at unprecedented rates and areas like cloud computing and data science becoming key buzzwords in industry, the ability to understand these relationships is becoming more and more important.

As a mathematician, I enjoy being able to say with certainty that some known truth is the cause of some other known truth, but it not always easy (or even possible) to prove the existence of such a relationship. We are left instead with looking at trends in data to see how similar things are to one another over a data set. Measuring the covariance of two or more vectors is one such way of seeking this similarity.

Before delving into covariance though, I want to give a refresher on some other data measurements that are important to understanding covariance.
– Sum of a vector:
If we are given a vector of finite length we can determine its sum by adding together all the elements in this vector. For example, consider the vector v = (1, 4, -3, 22). Then sum(v) = 1 + 4 + -3 + 22 = 24.

– Length of a vector:
If we are given a vector of finite length, we call the number of elements in the vector the length of the vector. So for the example above with the vector v = (1, 4, -3, 22), there are four elements in this vector, so length(v) = 4.

– Mean of a vector:
The mean of a finite vector is determined by calculating the sum and dividing this sum by the length of the vector. So, working with the vector above, we already calculated the sum as 24 and the length as 4, which we can use to calculate the mean as the sum divided by the length, or 24 / 4 = 6.

– Variance of a vector:
Once we know the mean of a vector, we are also interested in determining how the values of this vector are distributed across its domain. The variance measures this by calculating the average deviation from the mean. Here we calculate the deviation from the mean for the ith element of the vector v as (vi)2. We can get the average deviation from the mean then by computing the average of these values.

So if the vector v has n elements, then the variance of v can be calculated as Var(v) = (1/n)i = 1 to n((vi)2).

Once again dealing with the vector above with v = (1, 4, -3, 22), where the mean is 6, we can calculate the variance as follows:

vivi(vi)2
1-525
4-24
-3-981
2216256

To calculate the mean of this new vector (25, 4, 81, 324), we first calculate the sum as 25 + 4 + 81 + 256 = 366. Since the length of the new vector is the same as the length of the original vector, 4, we can calculate the mean as 366 / 4 = 91.5

The covariance of two vectors is very similar to this last concept. Instead of being interested in how one vector is distributed across its domain as is the case with variance, covariance is interested in how two vectors X and Y of the same size are distributed across their respective means. What we are able to determine with covariance is things like how likely a change in one vector is to imply change in the other vector. Having a positive covariance means that as the value of X increases, so does the value of Y. Negative covariance says that as the value of X increases, the value of Y decreases. Having zero covariance means that a change in the vector X is not likely to affect the vector Y.

With that being said, here is the procedure for calculating the covariance of two vectors. Notice that it is very similar to the procedure for calculating the variance of two vectors described above. As I describe the procedure, I will also demonstrate each step with a second vector, x = (11, 9, 24, 4)

1. Calculate the means of the vectors.
As we’ve seen above, the mean of v is 6.
We can similarly calculate the mean of x as 11 + 9 + 24 + 4 = 48 / 4 = 12

2. Subtract the means of the vectors from each element of the vector (xiX) and (YiY).

We did this for v above when we calculated the variance. Below are the values for v and for x as well.

vi – meanv

-5

-2

-9

16

ivixixi – meanx
1111-1
249-3
3-32412
4224-8

3. For each element i, multiply the terms (xiX) and (YiY).

This gives us the following vector in our example:
(-5)(-1), (-2)(-3), (-9)(12), (16)(-8) = (5, 6, -108, -128).

4. Sum the elements obtained in step 3 and divide this number by the total number of elements in the vector X (which is equal to the number of elements in the vector Y).

When we sum the vector from step 3, we wind up with 5 + 6 + -108 + -128 = -225
And the result of dividing -225 by 4 gives us -225/4 = – 56.25.

This final number, which for our example is -56.25, is the covariance.

Some important things to note are

  • If the covariance of two vectors is positive, then as one variable increases, so does the other.
  • If the covariance of two vectors is negative, then as one variable increases, the other decreases.
  • If the covariance of two vectors is 0, then one variable increasing (decreasing) does not impact the other.
  • The larger the absolute value of the covariance, the more often the two vectors take large steps at the same time.
  • A low covariance does not necessarly mean that the two variables are independent. I’ll give a quick example to illustrate that.
    Consider the vectors x and y given by x = (-3, -2, -1, 0, 1, 2, 3) and y = (9, 4, 1, 0, 1, 4, 9).
    The mean of x is 0, while the mean of y is 7.
    The mean adjusted values are (-3, -2, -1, 0, 1, 2, 3) and (2, -3, -6, -7, -6, -3, 2).
    The product of these mean adjusted values is (-6, 6, 6, 0, -6, -6, 6).
    If we sum this last vector, we get 0, which after dividing by 7 still gives a value of 0.
    So the covariance of these two vectors is 0.

    We can easily see that for each value xi in x, the corresponding yi is equal to xi2

I have written a script to help understand the calculation of two vectors.

The Gram-Schmidt Process and Orthogonal Vectors

Suppose I gave you some red fingerpaint and asked you to make all the colors you could from this paint. You’d probably come up with a diverse collection of pinks, reds and burgendys – going through the range of reds – but you would be unable to produce a color that does not depend solely on red, like purple.

Shades of Red

If I were to ask you to produce purple, your reply may be something like “well, give me blue and I’ll be able to color in purple”. When we think in terms of colors, it is easy to understand the concept of the span of a set of colors. In this context, span refers to the set of colors we can create from our original colors.

Now, lets think in terms of numbers (actually vectors) instead of colors. If I gave you a similar task as above, but instead of the color red, I gave you the vector (1, 0, 0) and told you to see what other numbers you could get from this (by scalar multiplication and the addition of any two vectors already produced) then you would probably come back to me and show me how you could use the vector (1, 0, 0) to produce the entire real number line in that first dimension, but leaving the other coordinates at 0.

If (similar to the color example) I asked you to use the vector (1, 0, 0) to produce the vector (1, 1, 0), then you may reply with a similar statement as above: “you give me the vector (0, 1, 0) and I’ll produce (1, 1, 0)”. That’s because the vectors (1, 0, 0) and (1, 1, 0) are linearly independent. This is a mathematical way of saying what we’ve already stated, that you cannot get one vector as a scalar multiple of the other vector for any real number scalar.

If two vectors are linearly independent then neither one belongs to the span of the other. Just as you cannot create blue from red, you cannot create red from blue. So this means that if I were to give you the colors red and blue, then the set of colors that you can create has increased from what you could create from either only red or only blue. Similarly, the vector sets {(1, 0, 0), (0, 1, 0)} spans more vectors than just {(1, 0, 0)} or {(0, 1, 0)}.

Consider then the following two sets of vectors: {(1, 0, 0), (0, 1, 0)} and {(1, 0, 0), (1, 1, 0)}. Notice that (1, 1, 0) [belongs to] span({(1, 0, 0), (0, 1, 0)}) because (1, 1, 0) = (1, 0, 0) + (0, 1, 0). Likewise (0, 1, 0) [belongs to] span({(1, 0, 0), (1, 1, 0) because (0, 1, 0) = (1, 1, 0) – (1, 0, 0). So we can see that these two sets span the same sets of vectors. But which is a better set to use?

Lets to back to colors and think of the set {red, purple}. Here we can think of purple as a simple combination of red and blue (i.e. purple = red + blue). What happens if we mix red and purple? If we think of it in terms of an axis, the fact that purple contains red in it means that as we walk along the purple axis, we are walking along the red axis as well. If we considered the set {red, blue}, we see that this is not happening. As we change the amount of red in our color, the amount of blue is unaffected. Likewise, if we change the amount of blue, the amount of red is not affected. If we were to draw these axes, we could see that this happens because the colors red and blue are perpendicular (or orthogonal) to one another, while red and purple are not.

Color Axis Example

Replacing these colors with vectors again, we get the same thing. We have the option of using the vector set {(1, 0, 0), (1, 1, 0)} or {(1, 0, 0), (0, 1, 0)} as our axis and it is better to use the second set because the two vectors are not just independent of one another, but also are at a 90[degree] angle, or are orthogonal to one another.

When given a set of vectors (or a set of colors), an important problem is to determine the span of those vectors and to be able to produce an orthogonal set of vectors that spans that same space. The Gram-Schmidt process provides a procedure for producing these orthogonal vectors.

The way the procedure works is to build an orthogonal set of vectors from the original set by computing the projection of the current vector being worked on in terms of the previous vectors in the orthogonal set. This projection procedure is defined as proju(v) = (u, v / u, u)u. The formula for the ith vector of the Gram-Schmidt process is

ui = vij = 1 to i-1 projuj(vi).

Here is a script I’ve written to help with this process.

Permutation Problems

I love to play with puzzles. When I was in grade school I would spend hours at a time figuring out ways to solve from things like Tetris, Mindsweeper, Solitare, and Freecell. Later I was introduced to puzzles involving numbers like Sudoku and Nonograms.

These puzzles are often interesting in part because there is generally a very large way that things can be arranged, but only a few of these arrangements are correct. Generally a person solving a puzzle will figure out certain things that must be true or cannot be true, which helps in solving the puzzle and reducing the number of possible cases. Initially, though, we are often left with a situation where we have a new puzzle and our only method is to keep trying every possible solution until we start to notice a pattern (or reach a solution).

For example, consider a Sudoku puzzle. We are given a partially filled in grid and our job is to fill in the remaining cells with the rules that every row, column and marked subgrid must have the numbers 1 – 9 exactly once. One initial attempt at solving such a puzzle could be to attempt to permute the string 1, 2, 3, 4, 5, 6, 7, 8, 9 until we find a solution that fits the first row, then do the same with the second row, and so on and so forth.

One immediate question is how many ways are there to permute the numbers 1, …, 9? We can answer this by realizing that each permutation is a new string. So for each string that we construct, we have 9 choices for the first element in the string. Then once that element has been chosen, we are not allowed for that element to appear anywhere else. So there are only 8 possible choices for what can go in the second string. Continuing this process, we see that the number of possible permutations we can construct from the string 1, …, 9 is

9 * 8 * 7 * 6 * 5 * 4 * 3 * 2 * 1 = 362880

This is a large number of possible strings to generate just to get one row of a Sudoku, so hopefully you’ll be able to notice the pattern before going through this whole set (because once you’ve generated the first row you still have to do the other 8 rows).

Nonetheless because there is often great value that can be gained by knowing how to permute through all possible solutions, I have written three functions that help with this process: Next_Permutation, Previous_Permutation, and Random_Permutation.

Before I give these algorithms, I want to highlight two notations on ordering string. A string (a1, a2, …, an) is said to be in lexicographical order (or alphabetical order) if for each i [in] 1, …, n-1, ai [<=] ai+1. Likewise, a string is said to be in reverse lexicographical order if for each i [in] 1, …, n-1, ai [>=] ai+1.

Next Permutation

If the given string is not in reverse lexicographic order, then there exists two elements j1 and j2 such that j1 [<] j2 and aj1 [<] aj2.
1. The Next_Permutation algorithm first searches for the largest element j1 such that aj1 [<] aj1 + 1. Since we said the string is not in reverse lexicographic order, this j1 must exist.
2. Once this j1 is found, we search for the smallest element aj2 such that aj2 [>] aj1. Again, since we know that this is true for j1 + 1, we know that such a j2 must exist.
3. We swap the elements aj1 and aj2.
4. The elements after j1 + 1 are then placed in lexicographic order (i.e. in order such that ai [>=] ai+1

If the given string is in reverse lexicographic order, then we simply reverse the string.

Previous Permutation

If the given string is not in lexicographic order, then there exists two elements j1 and j2 such that j1 [<] j2 and aj1 [>] aj2.
1. The Previous_Permutation algorithm first searches for the largest element j1 such that aj1 [>] aj1 + 1. Since we said the string is not in reverse lexicographic order, this j1 must exist.
2. Once this j1 is found, we search for the smallest element aj2 such that aj1 [>] aj2. Again, since we know that this is true for j1 + 1, we know that such a j2 must exist.
3. We swap the elements aj1 and aj2.
4. The elements after j1 + 1 are then placed in reverse lexicographic order (i.e. in order such that ai [<=] ai+1

If the given string is in lexicographic order, then we simply reverse the string.

Random Permutation

This generates a random string permutation.

This script can be seen here, set to work on permutations of colors.

My Review of “The Golden Ticket: P, NP, and the Search for the Impossible”

The Golden Ticket Image

I came home from work on Wednesday a bit too tired to go for a run and a bit too energetic to sit and watch TV. So I decided to pace around my place while reading a good book. The question was did I have a good book to read. I had been reading sci-fi type books earlier this month and wanted a break from that, so I looked in my mailbox and noticed that I had just received my copy of “The Golden Ticket: P, NP, and the Search for the Impossible“. At the time, I was of the mindset that I had just gotten off of work and really didn’t want to be reading a text book as if I was still at work. But I decided to give it a try and at least make it through the first few pages and if it got to be overwhelming, I’d just put it down and do something else.

About three hours later I was finishing the final pages of the book and impressed that the author (Lance Fortnow) was able to treat complexity theory the same way that I see physics professors speaking about quantum physics and the expanding universe on shows like “the Universe” and “Through the Wormhole with Morgan Freeman” where complex topics are spoken about with everyday terminology. It wouldn’t surprise me to see Dr. Fortnow on shows like “The Colbert Report” or “The Daily Show” introducing the topics in this book to a wider audience.

Below is the review I left on Amazon.

I really enjoyed this book. It was a light enough read to finish in one sitting on a weeknight within a few hours, but also showed its importance by being able to connect the dots between the P = NP problem to issues in health care, economics, security, scheduling and a number of other problems. And instead of talking in a "professor-like" tone, the author creates illustrative examples in Chapters 2 and 3 that are easy to grasp. These examples form the basis for much of the problems addressed in the book.

This is a book that needed to be written and needs to be on everyone's bookshelf, particularly for those asking questions like "what is mathematics" or "what is mathematics used for". This book answers those questions, and towards the end gives examples (in plain English) of the different branches of mathematics and theoretical computer science, without making it read like a text book.

Also, here is a link to the blog that Lance Fortnow and William Gasarch run called “Computational Complexity”, and here is a link to the website of the book, “The Golden Ticket: P, NP, and the Search for the Impossible”