I recently found myself at a gathering where there was a presentation followed by a question and answer session. The presenter spoke about how recent advents in technology had allowed for questions to be submitted online, which led to many more questions than they normally receive. He then stated that rather than attempt to answer each question individually, that the questions would be grouped together to attempt to cover as many areas as possible. I must say that while the presentation and the Q & A sessions in themselves were informative, I was probably most excited when I heard the speaker present this problem. I immediately began analyzing the problem, developing a strategy and decided to speak with the presenter afterwards.
I like to think of three trains of thought when it comes to computers. There are those who watch science fiction movies and believe that computers can do literally anything; there are those who haven’t seen computers tackle complex tasks, so they think of computers as simple “adding machines”; and then there are those of us who work with computers on a day to day basis. There is an extremely long list of things that computers can do – a list which far outnumbers the things that people in the second category try to limit computers to. However, the fact that computers have not gotten to an “I, Robot” mentality does not make them useless.
As a case in point, one of the more popular and thus more important areas of data mining is that of text analytics. This area attempts to combine the fields of computer science and natural language processing to build tools that gather meaning from text documents. Several algorithms have emerged in this field and the one I choose to write about today is the Term Frequency Inverse Document Frequency Matrix. This is the means by which I suggested to help with the question and answer session I mentioned earlier.
The Term Frequency Inverse Document Frequency (TF-IDF) Matrix considers as input a list of documents. Each of these documents consists of words (or terms). Two important things are computed for each term, the “term frequency” and the “inverse document frequency”. There are several ways to compute each of these, but the most basic way of computing each is to let the term frequency of a given word and document tf(w, d) be the number of times the word w appears in the document d. In a similar manner, the inverse document frequency idf(w, D) tells how common the word w is across all the documents d \in D. The calculation of the TF-IDF matrix consists of creating a column corresponding to each word and a row for each document. The tf and idf values are then computed for each cell and the value in the cell is the product tf(w, d)*idf(w, D).
Once the TF-IDF matrix has been computed, we can use a metric like Cosine Similarity to determine how similar two documents are. When every pair of document is compared using the cosine similarity metric, it forms a similarity matrix that shows how the documents relate to one another. This matrix can be clustered via k-means clustering or hierarchal clustering show the similar documents.
I’ve written a script which considers a set of quotations and randomly displays a set of those quotations and uses the TF-IDF matrix to determine how similar these quotations are. Let me know what you think.