# Category Archives: Blog

These are my thoughts on various topics.

# Hierarchical Clustering

Hierarchical Clustering algorithms give a nice introduction for computer science students to unsupervised machine learning. I say this because the bottom-up approach to Hierarchical clustering (which I have implemented here) is very similar to Kruskal’s algorithm for finding the minimum spanning tree of a graph.

In Kruskal’s algorithm, we begin by creating a forest, or a set of trees where each node is its own tree. The algorithm then selects the two trees that are closest together (closest being defined as the minimum cost edge between two distinct trees) and merges those trees together. This process of merging the closest two trees is then repeated until there is only one tree remaining, which is a minimum spanning tree of the graph.

Similarly, bottom-up hierarchical clustering of a group of points begins by saying that each point is its own cluster. Then the clusters are compared to one another to check if two clusters will be merged into one. Generally, there will be some stopping criteria, , saying that we do not want to merge two clusters together if their distance is greater than . So if the minimum distance between two clusters is less than we will proceed as in Kruskal’s algorithm by merging these two clusters together into one cluster. We repeat this process of merging the closest two clusters together until we find that the minimum distance between two clusters is greater than or equal to , in which case we can stop and the result is a partition of our data set into distinct clusters.

Hierarchical clustering is comparable to K-Means Clustering. Here are some differences between the two approaches:

1. K-Means Clustering requires an initial number of desired clusters, while Hierarchical clustering does not.
2. A run of K-Means Clustering will always give K clusters, whereas Hierarchical Clustering can give more or less, depending on our tolerance .
3. K-Means can undo previous mistakes (assignments of an element to the wrong cluster), while Hierarchical Clustering cannot.

So, here is a link to my page on Hierarchical Clustering. Hope you enjoy.

# Introduction to JavaScript Programming

I received a lot of attention from friends interested in programming after my recent blog post entitled “Introduction to Python Programming”. While many found it interesting, the fact that Python is more useful to mathematicians hindered sine of my friends desire to learn it as their first language.

In out conversations, my recommendation for a first language was JavaScript. This is a powerful language in the sense that just about anybody who is involved with the internet knows it, and it’s likely to boost a person’s resume. It also has many similarities to more powerful languages like C++ and Java, so while not trivial, it could be a good launch pad into more advanced languages. But my favorite reason is that unlike many other programming languages that rely in an MS-DOS like command like approach for run time interaction, JavaScript’s basic interaction is with the standard internet browsers we use everyday. There isn’t even anything you need to download or install. Just create a basic HTML file in a text editor (like notepad, wordpad, or notepad++). This makes it easier to show off your creations which makes learning more fun.

The script I’ve finished provides examples on writing output, declaring variables, data types, conditionals, loops, and functions. Although I do not go into detail about all the events and objects on an HTML page, I do finish with three examples of more advanced JavaScript programs. Once you’ve selected a program, the code well be revealed in the text area. There is also a button that, when clicked, will execute that script on a new HTML tab.

I hope you enjoy, and let me know if you have any suggestions or comments.

With that being said, here is a link to my sample JavaScript code.

# Simple Linear Regression

We live in a world that is filled with patterns – patterns all around us just waiting to be discovered. Some of these patterns are not as easily discovered because of the existence of outside noise.

Consider for example an experiment where a set of people were each given the task of drinking a number of beers and having their blood alcohol level taken afterwards. Some noise factors in this could include the height and weight of the individual, the types of drinks, the amount of food eaten, and the time between drinks. Even with this noise, though, we can still see a correlation between the number of drinks and their blood alcohol level. Consider the following graph showing people’s blood-alcohol level after a given number of drinks. The x-axis represents the number of drinks and the y-axis is the corresponding blood alcohol level.

 x 5 2 9 8 3 7 3 5 3 5 4 6 5 7 1 4 y 0.1 0.03 0.19 0.12 0.04 0.095 0.07 0.06 0.02 0.05 0.07 0.1 0.085 0.09 0.01 0.05

We can definitely see a correlation, and although the data doesn’t quite fit on a straight line. It leads us to ask further questions like can we use this data to build a model that estimates a person’s blood-alcohol level and how strong is this model?

One of the tools we can use to model this problem is linear regression. A linear regression takes a two-dimensional data set, with the assumption that one column (generally represented by the x variable) is independent and the second column (generally represented by the y variable) being dependent on the first column. The assumption is that the relationship between the two columns is linear and can be represented by the linear equation

y = 0 + 1x + e.

The right hand side of the above equation has three terms. The first two (0 and 1) are the parameters of the linear equation (the y-intercept and slope respectively), while the third term of the right hand side of the above equation represents the error term. The error term represents the difference between this linear equation and the y values in the data provided. We are seeking a line that minimizes the error term. That is, we are seeking to minimize

D = i = 1 to n [yi - (0 + 1xi)]2

There are several ways one could approach this problem. In fact, there are several lines that one could use to build a linear model. The first line that one may use to model these points is the one generated by only mean of the y values of the points, called the horizontal line regression.

For the data set above, the mean of the y values can be calculated as = 0.0738, so we could build a linear model based on this mean that would be y = 0.738. This horizontal line regression model is a horizontal line that predicts the same score (the mean), regardless of the x value. This lack of adjustments means it is generally a poor fit for most models. But as we will see later, this horizontal line regression model does serve a purpose in determining how well the model we develop performs.

A second attempt at solving this problem would be to generate the least squares line. This is the line that minimizes the D value listed above. We can see that D is a multi-variable polynomial, and we can find the minimum of such a polynomial using calculus, partial derivatives and Gaussian elimination (I will omit the work here because it deters us from the main point of this blog post, however Steven J. Miller has a good write-up of this).

The calculus leads us to the following equations:

SXY = i = 1 to n(xy) –
 (i = 1 to nx)(i = 1 to ny) n
SXX = i = 1 to n(x2) –
 (i = 1 to nx)2 n
1 =
 SXY SXX
 0 = – 1

To calculate the least squares line for this example, we first need to calculate a few values:
i = 1 to n(xy) = 6.98
i = 1 to n(x2) = 443
i = 1 to nx = 77
i = 1 to ny = 1.18
Sxx = 72.44
Sxy = 1.30

This lets us evaluate that
1 = 0.018
and
0 = -0.0127

So the resulting linear equation for this data is

= -0.0127 + 0.018*x

Below is a graph of the two attempts at building a linear model for this data.

In the above image, the green line represents the horizontal line regression model and the blue line represents the least-squares line. As stated above, the horizontal line regression model is a horizontal line that does not adjust as the data changes. The least-squares line adjusts both the slope and y-intercept of this line according to the data provided to better fit the data provided. The question becomes how well does the least-squares line fit the data.

The Sum of Squares Error (SSE) sums the deviation at each point of our data from the least-squares line.

SSE = i = 1 to n(yii)2

A second metric that we are interested in is how well the horizontal line regression linear model estimates our data. This is called the Total Sum of Squares (SST).

SST = i = 1 to n(yi)2

The horizontal line regression model ignores the independent variable x from our data set and thus any line that takes this independent variable into account will be an improvement on the horizontal line regression model. Because of this, the SST sum is a worse case scenario of how poorly our model can perform.

Knowing now that SST is always greater than SSE, the regression sum of squares (SSR) is the difference between the total sum of squares and the sum of squares error.

SSR = SST – SSE

This tells us how much of the total sum of squares is accounted for by the model.

Finally, the coefficient of determination (r2) is defined by

r2 = SSR / SST

This tells SSR as a percentage of SST, or the amount of the variation in the data that is explained by the model.

So, check out my script on simple linear regression and let me know what you think.

# Probability: Sample Spaces

I’ve been doing a few games lately (can be seen here, here and here) and, while I think those are very good ways to become interested in some of the avenues of math research, I also have had a few people come to me with questions regarding help with their classes. So I decided to write a script to try to help understand some elementary probability theory, focusing on discrete sample spaces.

In statistics, any process of observation is referred to as an experiment.
The set of all possible outcomes of an experiment is called the sample space and it is usually denoted by S. Each outcome in a sample space is called an element of the sample space. An event is a subset of the sample space or which the event occurs. Two events are said to be mutually exclusive if they have no elements in common.

Similar to set theory, we can form new events by performing operations like unions, intersections and compliments on other events. If A and B are any two subsets of a sample space S, then their union A ∪ B is the subset of S that contains all the elements that are in either A, in B, or in both; their intersection A ∩ B is the subset of S that contains all the elements that are in both A and B; the compliment A’ of A is the subset of S that contains all the elements of S that are not in A.

A probability is a function that assigns real numbers to events of a sample space. The following are the axioms of probability that apply when the sample space is discrete (finite or countable).

Axiom 1: The probability of an event is a non-negative real number; that is P(A) ≥ 0 for any subset A of S.
Axiom 2: The probability of the entire sample space is 1; that is P(S) = 1.
Axiom 3: If A1, A2, A3, … , is a finite or infinite sequence of mutually exclusive events of S, then
P(A1 ∪ A2 ∪ A3 ∪ …) = P(A1) + P(A2) + P(A3) + …
If A and B are any two events in a sample space S and P(A) ≠ 0, the conditional probability of B given A is

P(B | A) =
 P(A ∩ B) P(A)

Two events A and B are independent if and only if P(A | B) = P(A) ∙ P(B).

# Dots and Boxes Game

When I was in high school, one of my favorite ways to waste time in class (not recommended) was to play a game called dots and boxes (although at the time we just called it dots). I was very surprised to find later that this game belongs to a class of games called “Impartial Combinatorial Games”. These are games where the moves available to the player depend only on the position of the game, and not the player.

In a game of Dots and Boxes, we start with an initial grid with dots at each row and column intersection. At each player’s turn, they have the option of drawing either a horizontal or vertical line between two neighboring dots (depending on if the dots are in the same row or column). If a player fills in the last line on a box (the 4th side), we say that player “owns” the box. The game ends when there are no neighboring dots without a line between them. At the conclusion of the game, the player who owns the most dots is declared the winner.

The game is impartial because there is no restriction on which move a player can make other than the fact that a player cannot re-do a move that has already been made (a partial version of this game would be if player one could only move horizontally and player two could only move vertically).

I have implemented a javascript version of this game. Check it out and let me know what you think.

I also spoke earlier about the discovery that this game in particular was an active area of research. I wanted to provide a link to a paper entitled “Solving Dots and Boxes” by Joseph K. Barker and Richard E Korf that speaks about winning strategies for each player in a game of dots and boxes.

# Assembly Line Scheduling

I wanted to take a minute to help some users become more familiar with Dynamic Programming, so I decided to write a script on the Assembly Line Scheduling Problem.

To introduce the problem I want to tell you a story about a friend of mine. Keisha recently started a clothing company that uses two assembly lines to produce articles of clothing. She has separated the the process of manufacturing an item of clothing into n steps, so each assembly line is separated into n different stations, with each station performing a specific task (So for example station three’s job may be to add a right sleeve to shirts). The task of a specific station is independent of which line the station occurs on (so if station three’s job is to add a right sleeve to shirts, this will be true in both assembly line 1 and assembly line 2). Lets denote the jth station (with j = 1, 2, …, n) on line i (where i is 1 or 2) by Si, j. Although they’re doing the same jobs the time it takes the employee at station S1, j may be different from the time it takes the employee at station S2, j. We will denote the time required at station Si, j by ai, j. For each line, there is also an amount of time required for the article of clothing to enter assembly line i, ei; and an amount of time required for the article of clothing to exit assembly line i, xi.

One of the reasons that assembly lines are very productive is that stations on the same assembly line are generally in close proximity to one another, resulting in a very low cost of transferring an item from one station to the next on the same assembly line. When we have multiple lines in place, as Keisha has, there is a (possibly beneficial) cost of transferring an item from one line to another. Lets denote this cost by ti, j which represents the cost of transferring a partially completed item of clothing from line i after having gone through station Si, j (again, i is 1 or 2 and j = 1, 2, …, n).

The problem that Keisha would like solved is to determine which station to choose between lines 1 and 2 in order to minimize the total time it takes to produce an article of clothing.

Consider the following example:

Our goal is to get the clothing through the 3 states to produce a final product. What if we initially had the product take the route through station S2, 1 instead of station S1, 1? Lets assume that we make the decisions to send the article of clothing to stations S2, 2 and S2, 3 afterwards. This would result in a solution whose total cost is 3 + 8 + 4 + 6 + 3 = 24. Is this solution optimal (aka is this solution the minimum total time through the factory)? Lets consider what would happen if we had chosen station S1, 1 instead of S2, 1. The entry cost for line 1 is 1, the time required at station S1, 1 is 5 and the transfer time to go to assembly line 2 is 1. So the cost of this new solution is 1 + 5 + 1 + 4 + 6 + 3 = 20, which gives a cheaper solution.

This is called the principle of optimality (optimal substructure property) which states that in order for an overall solution to be optimal, the solution must also give the optimal solutions to every subproblem of the original problem. This problem of solving all subproblems may seem like a daunting task at first, but lets consider the example above again.

Initially, we have a new product and there are two options – either line one or line two. We will need these values in the future, so lets keep track of both choices in the form of a table.

 Station 1 cost0 e1 + a1, 1 cost1 e2 + a2, 1

After this initial step, the question becomes given the current path to station j-1, which assembly line can best serve station j? This cam be computed for each j > 1 by
cost1(j) = min{cost1(j-1) + a1, j, cost2(j-1) + t2, j-1 + a1, j}
cost2(j) = min{cost2(j-1) + a2, j, cost1(j-1) + t1, j-1 + a2, j}

As you can see, the calculation of costi(j) relies on the computation of costi(j-1). By calculating these values from station 1 to to station n, we are able to simply look up the values in the table instead of having to recalculate these values.

These give optimal solutions to each of the subproblems. We repeat this same step for all stages j = 2, …, n then we arrive at the final step were we finish the job. Lets define total_cost to indicate the cost of the optimal solution.
total_cost = min{cost1(n) + x1, cost2(n) + x2}

We’d like to see which value minimizes total_cost. Then we can trace back to find the values that minimized cost1 or cost2 at each step depending on which assembly line was chosen. The following algorithm does just this, and stores the assembly line chosen at each state in the variable line.

For the above example, the table would be calculated as follows:

 Station 1 Station 2 Station 3 Total Cost cost1 6 13 18 21 cost2 11 11 17 20

We can reconstruct the optimal path through assembly lines by seeing that we finish by going through station S2, 3.
We arrive at station S2, 3 by going through the assembly line station S2, 2.
We arrive at station S2, 2 by going through the assembly line station S1, 1.

This is precisely the path that is highlighted in the image above.

The algorithm to construct these paths and compute the total_cost for such problems is given below.

Algorithm FastestWay(a, t, e, x, m)
cost1 [<-] e1 + a1, 1
cost2 [<-] e2 + a2, 1
for (j [<-] 2 to n)
if (cost1(j-1) + a1, j [<=] cost2(j-1) + t2, j-1 + a1, j
cost1(j) [<-] cost1(j-1) + a1, j
line1(j) [<-] 1
else
cost1(j) [<-] cost2(j-1) + t2, j-1 + a1, j
line1(j) [<-] 2

if (cost2(j-1) + a2, j [<=] cost1(j-1) + t1, j-1 + a2, j
cost2(j) [<-] cost2(j-1) + a2, j
line2(j) [<-] 1
else
cost2(j) [<-] cost1(j-1) + t1, j-1 + a2, j
line2(j) [<-] 2

if (cost1(n) + x1 [<=] cost2(n) + x2)
total_cost = cost1(n) + x1
final_line = 1
else
total_cost = cost2(n) + x2
final_line = 2

Note: I used Introduction to Algorithms by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein to help with this post.

# Polynomial Arithmetic

With students beginning to attend classes across the nation, I wanted to focus the site towards some of the things they’re going to be addressing. This latest page publicize some scripts that I wrote to help with polynomial arithmetic. Originally I wrote these as homework exercises for a class in programming, but I have found them useful ever since – both in teaching mathematics classes like college algebra, which spends a lot of attention on polynomials, and in my research life. Its funny (and sad) the number of simple errors that a person (mathematician or not) can make when performing simple arithmetic, so I found it very useful to have a calculator more advanced than the simple scientific calculators that are so easily available.

I’m not going to spend a lot of time discussing the importance of polynomials, or trying to justify their need. I will bring up some problems that I’d like to address in the future, that deal with polynomials. The first is finding the roots of the characteristic polynomial of a matrix. This is useful in research because these roots are the eigenvalues of the matrix and can give many properties of the matrix. There are also some data analysis tools like Singular Value Decomposition and Principal Component Analysis where I will probably build out from this initial set of instances.

The user interface for the scripts I’ve written generate two polynomials and ask the user what is to be done with those polynomials. The options are to add the two, subtract polynomial 2 from polynomial 1, multiply the two, divide polynomial 1 by polynomial 2, and divide polynomial 2 by polynomial 1. There is also the option to make the calculations more of a tutorial by showing the steps along the way. Users who want new problems can generate a new first or second polynomial and clear work.

For addition and subtraction, the program works by first ensuring that both polynomials have the same degree. This can be achieved by adding terms with zero coefficient to the lower degree polynomial. Once this has been accomplished, we simply add the terms that have the same exponent.

For multiplication, the program first builds a matrix A, where the element ai, i+j on row i and column i+j of the matrix A is achieved by multiplying the ith term of the first polynomial by the jth term of the second polynomial. If an was not given a value in the matrix, then we put a value of zero in that cell. Once this matrix is formed, we can sum the columns of the matrix to arrive at the final answer.

The division of two polynomials works first by dividing the first term of the numerator by the first term of the denominator. This answer is then multiplied by the denominator and subtracted from the numerator. Now, the first term in the numerator should cancel and we use the result as the numerator going froward. This process is repeated as long as the numerator’s degree is still equal to or greater than the denominator’s degree.

Check out the latest page on polynomial arithmetic and let me know what you think.

# Hidden Markov Models: The Baum-Welch Algorithm

Suppose you are at a table at a casino and notice that things don’t look quite right. Either the casino is extremely lucky, or things should have averaged out more than they have. You view this as a pattern recognition problem and would like to understand the number of ‘loaded’ dice that the casino is using and how these dice are loaded. To accomplish this you set up a number of Hidden Markov Models, where the loaded die are the latent (hidden) variables, and would like to determine which of these, if any is more likely to be using.

First lets go over a few things.

We will call each roll of the dice an observation. The observations will be stored in variables o1, o2, …, oT, where T is the number of total observations.

To generate a hidden Markov Model (HMM) we need to determine 5 parameters:

• The N states of the model, defined by S = {S1, …, SN}
• The M possible output symbols, defined by = {1, 2, …, M}
• The State transition probability distribution A = {aij}, where aij is the probability that the state at time t+1 is Sj, given that the state at time t is Si.
• The Observation symbol probability distribution B = {bj(k)} where bj(k) is the probability that the symbol k is emitted in state Sj.
• The initial state distribution = {i}, where i is the probability that the model is in state Si at time t = 0.

The HMMs we’ve generated are based on two questions. For each question, you have provided 3 different answers which leads to 9 possible HMMs. Each of these models has its corresponding state transition and emission distributions.

• How often does the casino change dice?
• 0) Dealer Repeatedly Uses Same Dice
• 1) Dealer Uniformly Changes Die
• 2) Dealer Rarely Uses Same Dice
• Which sides on the loaded dice are more likely?
• 0) Larger Numbers Are More Likely
• 1) Numbers Are Randomly Likely
• 2) Smaller Numbers Are More Likely
How often does the casino change dice?
Which sides on
are more likely?
 (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) (2, 0) (2, 1) (2, 2)

One of the interesting problems associated with Hidden Markov Models is called the Learning Problem, which asks the question “How can I improve a HMM so that it would be more likely to have generated the sequence O = o1, o2, …, oT?

The Baum-Welch algorithm answers this question using an Expectation-Maximization approach. It creates two auxiliary variables t(i) and t(i, j). The variable t(i) represents the probability of being in state i at time t, given the entire observation sequence. Likewise t(i, j) represents the joint probability of being in state i at time t and of being in state j at time t+1, given the entire observation sequence. They can be calculated by

t(i) =  (t(i) * t(i) ) j = 1 to N(t(j) * t(j))

and

t(i, j) =  (t(i) * ai, j * t+1(j) * bj(ot+1) ) i’ = 1 to Nj’ = 1 to N(t(i’) * ai’, j’ * t+1(j’) * bj’(ot+1) )

As you can see, these are a direct result of calculations of from the Forward algorithm and from the Backwards algorithm. Once we have calculated these variables, we can update the parameters of the model as follows:

i = 1(i)

i,j = t = 1 to T-1(t(i)) t = t to T-1 (t(i, j))

// [b bar]_{j, k} = Sigma_{t = 1 to T, o_t = o_k} gamma_{t, j} / Sigma_{t = 1 to T} gamma_{t, j}, 1 <= j <= N, 1 <= k <= M

j(ok) =  t = 1 to T-1, ot = ok t(j) t = 1 to T-1 t(j)

We can iterate this procedure a finite number of times or until it converges. This will generate a new model, = {N, , , , }.

There is more on this example at LEARNINGlover.com: Hidden Marokv Models: The Baum-Welch Algorithm.

Some further reading on Hidden Markov Models: