Tag Archives: math tutor

Floyd-Warshall Shortest Paths

The Floyd Warshall algorithm is an all pairs shortest paths algorithm. This can be contrasted with algorithms like Dijkstra’s which give the shortest paths from a single node to all other nodes in the graph.

Floyd Warshall’s algorithm works by considering first the edge set of the graph. This is the set of all paths of the graph through one edge. Node pairs that are connected to one another through an edge will have their shortest path set to the length of that edge, while all other node pairs will have their shortest path set to infinity. The program then runs through every triplet of nodes (i, j, k) and checks if the path from i to k and the path from k to j is shorter than the current path from i to j. If so, then the distance and the path is updated.

So lets consider an example on the graph in the image above. The edge set of this graph is E = {(0, 1), (0, 2), (0, 3), (1, 3), (3, 4)}. So our initial table is:

  0 1 2 3 4
0 inf (0, 1) (0, 2) (0, 3) inf
1 (0, 1) inf inf (1, 3) inf
2 (0, 2) inf inf inf inf
3 (0, 3) (1, 3) inf inf (3, 4)
4 inf inf inf (3, 4) inf

As we look to update the paths, we first look for routes that go through node 0:

Because node 0 connects to both node 1 and node 2, but node 1 does not connect to node 2, we have the following truth holding in the matrix above:
cost(0, 1) + cost(0, 2) < cost(1, 2), so we can update the shortest path from node 1 to node 2 to be (1, 0, 2).

Because node 0 connects to both node 2 and node 3, but node 2 does not connect to node 3, we have the following truth holding in the matrix above:
cost(0, 2) + cost(0, 3) < cost(2, 3), so we can update the shortest path from node 2 to node 3 to be (2, 0, 3).

Because node 3 connects to both node 0 and node 4, but node 0 does not connect to node 4, we have the following truth holding in the matrix above:
cost(0, 3) + cost(3, 4) < cost(0, 4), so we can update the shortest path from node 0 to node 4 to be (0, 3, 4).

Because node 3 connects to both node 1 and node 4, but node 1 does not connect to node 4, we have the following truth holding in the matrix above:
cost(1, 3) + cost(3, 4) < cost(1, 4), so we can update the shortest path from node 1 to node 4 to be (1, 3, 4).

Because node 3 connects to both node 2 and node 4, but node 2 does not connect to node 4, we have the following truth now holding:
cost(2, 3) + cost(3, 4) < cost(2, 4), so we can update the shortest path from node 2 to node 4 to be (2, 0, 3, 4).

The final table giving the list of shortest paths from every node to every other node is given below.

  0 1 2 3 4
0 inf (0, 1) (0, 2) (0, 3) (0, 3, 4)
1 (0, 1) inf (1, 0, 2) (1, 3) (1, 3, 4)
2 (0, 2) (1, 0, 2) inf (2, 0, 3) (2, 0, 3, 4)
3 (0, 3) (1, 3) (2, 0, 3) inf (3, 4)
4 (0, 3, 4) (1, 3, 4) (2, 0, 3, 4) (3, 4) inf

To see more examples and to help answer questions, check out the script in my examples section on the Floyd-Warshall algorithm

Degree Centrality of a Graph

Degree Centrality Example

I wanted to spend some time on centrality measures of a graph. These are measurements of how important each node (or edge) is to the overall graph. But how do we define, or determine, importance? There is no unique way to answer this question, so there are varying metrics for measuring centrality. Which one you choose depends on several factors including how many other nodes of the graph are included, as well as the run time of the metrics you’re considering.

I have just published a script focusing on the degree centrality metric. The degree centrality metric is called a “walk metric” because it determines how important a node is by how many other nodes that can be reached by walks of up to a certain length. Lets look at the definition of the degree of a node to see if we can understand why it is called a walk metric.

In an undirected graph G = (V, E), the degree of a node u [in] V is the |{v | (u, v) [in] E}|. This is the size of the set of nodes that are connected to node u via a single edge. Another way of describing a single edge is a walk of length one. So the degree metric measures the importance of a node by the number of unique walks of length one.

The normalized degree centrality of a node v in a graph G = (V,E) measures how many nodes are connected to the node v, compared to the maximum possible number of edges that can be connected to this node. Because we are dealing with simple undirected graphs (at most a single edge between any two distinct vertices), this maximum possible number will always be |V – 1|. So the normalized degree can be calculated by dividing the degree of the node (the number of nodes it is connected to) by |V – 1|.

So for the example above, the node 0 has degree 6 because it is connected to nodes 2, 5, 9, 10, 11, and 12. There are 15 total nodes in this graph, so to calculate the normalized degree centrality of the node 0, it will be 6 / 14, which rounds to 0.428571.

To see more examples and to help answer questions, check out the script in my examples section on degree centrality

Lets Learn About XOR Encryption

One of the more common things about this generation is the constant desire to write up (type) their thoughts. So many of the conversations from my high school days were long lasting, but quickly forgotten. Today’s generation is much more likely to blog, tweet, write status updates or simply open up a notepad file and write up their thoughts after such a conversation.

When we feel that our thoughts are not ready for public eyes (maybe you want to run your idea by the Patent and Trademark Office before speaking about it) we may seek some form of security to ensure that they stay private. An old fashioned way of doing this was to write in a diary and enclosed it within a lock and key. The mathematical field of encryption also tries to grant privacy by encoding messages so that only people with the necessary information can read them.

The type of encryption I want to speak about today is called XOR encryption. It is based on the logical operation called “exclusive or” (hence the name XOR). The exclusive or operation is true between two logical statements if exactly one of the two statements is true, but not both statement. This can be represented with the following truth table

Input 1 Input2 XOR Result

XOR Encryption is particularly useful in this day and age because we every character we type is understood by the computer as a sequence of zeros and ones. The current standard encoding that is used is Unicode (also known as UTF-8). Under this encoding the letter ‘a’ is represented as the binary string ‘01100001’. Similarly every letter, number and special character can be represented as its own binary string. These binary strings are just an assignment of numbers to these characters so that we can to help represent them in the computer. The numbers can the be thought of in base 10, which is how we generally think about numbers, or in base 2 which is how computers generally work with numbers (or a number of other ways). The way we would use these binary strings in encoding is first by translating a text from human-readable text to machine readable text via its binary string. For example, the word “Invincible”, we would get the following binary strings:

Letter Unicode in base 10 Unicode in base 2
I 73 01001001
n 110 01101110
v 118 01110110
i 105 01101001
n 110 01101110
c 99 01100011
i 105 01101001
b 98 01100010
l 108 01101100
e 101 01100101

To encrypt the message we need a key to encode the message and will simply perform an XOR operation on the key and every character in the string. Similarly, do decrypt the message we perform XOR operation on the key and every character in the encoded message. This means that the key (much like a normal key to a diary) must be kept private and only those whom the message is to be shared between have access to it.

Here is a link to the script where you can check out XOR Encrpytio. Try it out and let me know what you think.

Topological Sort

One of the things I generally say about myself is that I love learning. I can spend hours upon hours reading papers and algorithms to better understand a topic. Some of these topics are stand alone segments that I can understand in one sitting. Sometimes, however, there is a need to read up on some preliminary work in order to fully understand a concept.

Lets say that I was interested in organizing this information into a new course. The order I present these topics is very important. Knowing which topics depend on one another allows me to use the topological sorting algorithm to determine an ordering for the topics that respects the preliminary work.

The input for the topoligical sorting algorithm is a Directed Acyclic Graph (DAG). This is a set of relationships between pairs of topics, where if topic 1 must be understood before topic 2, we would add the relationship (topic 1, topic 2) to the graph. DAGs can be visualized by a set of nodes (points) representing the topics. Relationships like the one above (topic 1, topic 2) can then represented by a directed arc originating at topic 1 and flowing in the direction of topic 2. We say that the graph is “Acyclic” because there cannot be a cycle in the topic preliminaries. This amounts to us saying that a topic cannot be a prerequisite for itself. An example of a DAG is shown in the image above.

With the topics represented as a DAG, the topologial ordering algorithm works by searching the set of nodes for the one with no arcs coming into it. This node (or these nodes is multiple are present) represents the topic that can be covered next without losing understanding of the material. Such a node is guaranteed to exist by the acyclic property of the DAG. Once the node is selected, we can remove this node as well as all arcs that originate at this node from the DAG. The algorithm then repeats the procedure of searching for a nod with no arcs coming into it. This process repeats until there are no remaining nodes from which to choose.

Now lets see how the topological sort algorithm works on the graph above. We will first need to count the in-degree (the number of arcs coming into) each node.

Node | Indegree
0 | 2
1 | 2
2 | 0
3 | 2
4 | 2
5 | 2
6 | 0
7 | 2
8 | 3

Node to be removed (i.e. node with the minimum indegree): Node 2.
Arcs connected to node 2: (2, 5), (2, 3)
Resulting Indegree Count:
Node | Indegree
0 | 2
1 | 2
3 | 1
4 | 2
5 | 1
6 | 0
7 | 2
8 | 3

Node to be removed: Node 6:
Arcs connected to node 6: (6, 1), (6, 3), (6, 4), (6, 5), (6, 7), (6, 8)
Resulting Indegree Count:
Node | Indegree
0 | 2
1 | 1
3 | 0
4 | 1
5 | 0
7 | 1
8 | 2

Node to be removed: Node 3
Arcs connected to node 3: (3, 0), (3, 8)
Resulting Indegree Count:
Node | Indegree
0 | 1
1 | 1
4 | 1
5 | 0
7 | 1
8 | 1

Node to be removed: Node 5
Arcs connected to node 5: (5, 0), (5, 8)
Resulting Indegree Count:
Node | Indegree
0 | 0
1 | 1
4 | 1
7 | 1
8 | 0

Node to be removed: Node 0:
Arcs connected to node 0: (0, 1), (0, 4)
Resulting Indegree Count:
Node | Indegree
1 | 0
4 | 0
7 | 1
8 | 0

Node to be removed: Node 1
Arcs connected to node 1: none
Resulting Indegree Count:
Node | Indegree
4 | 0
7 | 1
8 | 0

Node to be removed: Node 4
Arcs connected to node 4: none
Resulting Indegree Count:
Node | Indegree
7 | 1
8 | 0

Node to be removed: Node 8
Arcs connected to node 8: (8, 7)
Resulting Indegree Count:
Node | Indegree
7 | 0

Node to be removed: Node 7
Arcs connected to node 7: none
Resulting Indegree Count:
Node | Indegree

Since there are no nodes remaining, we have arrived at a topological ordering. Going through this iteration, we can see that we arrived at the ordering (2, 6, 3, 5, 0, 1, 4, 8, 7). There were several occasions where there were multiple nodes with indegree of 0 and we could have selected an alternative node. This would have given us a different topological ordering of the nodes, but it would still be valid.

There are more learning opportunities and an interactive demonstration of the algorithm at Topological Sort Examples at LEARNINGlover.

The RSA Algorithm

I can remember back when I was in school, still deciding whether I wanted to study pure or applied mathematics. One of the common questions I would receive from those in applied mathematical realms would sound like “What’s the point of doing mathematics with no real world applications?”. Generally my response to these questions was about the intrinsic beauty of mathematics, no different from an artist painting not for some desire to be a millionaire, but because of an burning desire to paint. Whether their paintings would one day be on the walls of a Smithsonian museum or sit on their mother’s refrigerator is generally outside of the thought process of the artist. So too, would I argue about the thought process of a pure mathematician.

When I was an undergrad and learned about the RSA algorithm (named for Ron Rivest, Adi Shamir, and Leonard Adleman who discovered the algorithm) it helped me explain this concept a lot better. The algorithm is based on prime numbers and the problem of finding the divisors of a given number. Many mathematicians throughout the ages have written papers on the beauty of prime numbers (see Euclid, Eratosthenes, Fermat, Goldbach, etc). For a large period in time one of the beautiful things about prime numbers was that they were so interesting in themselves. There were questions about how to check if a number is prime, question of patterns in primes, famous conjectures like the Goldbach conjecture and the twin prime conjecture, quick ways of finding prime numbers or numbers that are almost always prime, etc. In short, this was an active area of research that much of the applied world was not using. This all changed in 1977 when Rivest, Shamir and Adleman published the RSA algorithm.

The algorithm is in the area called public key cryptography. These algorithms differ from many of the previous cryptography algorithms, namely symmetric key cryptography. Whereas symmetric key cryptography depends uses the same device (key) to encode as to decode, public key cryptography creates two keys – one for encoding that is generally shared with others, and another for decoding which is kept private. These two keys in generally relate to a mathematical problem that is very difficult to solve.

In my example script for the RSA Algorithm, I show two people who want to communicate, Alice and Bob. Bob wants people to be able to send him messages securely so he decides to use the RSA algorithm. To do this, he first needs to think of two prime numbers, p1 and p2.
From these, he computes the following:
n = p1 * p2

Next, he computes Euler’s function on this n which can be calculated as
(n) = (p1 – 1) * (p2 – 1)

Then Bob looks for a number e that is relatively prime to . This is what he will use as the encryption key.

One he has e, he can calculate d, which is the multiplicative inverse of e in (n).
This means that e * d = 1 (mod (n)).

The public key that will be used for encryption will be the pair (e, n). This is what he posts publicly via his web page for others to communicate with him securely. Bob also has a private key pair (d, n) that he will use to decrypt messages.

Alice sees Bob’s public key and would like to communicate with him. So she uses it to encode a message. The formula she uses to encrypt her message is c = me mod n, where c is the encrypted message. Once Alice encrypts her message, she sends it to Bob.

Bob receives this encoded message and uses the private key (d, n) to decode the message from Alice. The formula to decrypt is m = cd mod n.

For a more illustrative idea of how this algorithm works as well as examples, be sure to visit Script for the RSA Algorithm.

QR Decomposition

Suppose we have a problem that can be modeled by the system of equations Ax = b with a matrix A and a vector b. We have already shown how to use Gaussian Elimination to solve these methods, but I would like to introduce you to another method – QR decomposition. In addition to solving the general system of equations, this method can also be used in a number of other algorithms including the linear regression analysis to solve the least squares problem and to find the eigenvalues of a matrix.

Generally, when we see a system of equations, the best way to proceed in solving it is Gaussian elimination, or LU decomposition. However, there are some very special matrices where this method can lead to rounding errors. In such cases, we need a better, or more stable algorithm, which is when the QR decomposition method becomes important.

Any matrix A, of dimensions m by n with m >= n, can be represented as the product of two matrices, an m by m orthogonal matrix Q, i.e. QTQ = I, and an m by n upper triangular matrix R with the form [R 0]T. We perform this decomposition by will first converting the columns of the matrix A into vectors. Then, for each vector ak, we will calculate the vectors uk and ek given by

uk = i = 1 to k-1projej ak
ek = uk / ||uk||
Then Q = [e1, …, en] and

R =
<e1, a1> <e1, a2> <e1, a3>
0 <e2, a2> <e2, a3>
0 0 <e3, a3>

Here is a link to the JavaScript program I wrote to show how QR Decomposition works.

The A* Algorithm

As a child I remember traveling on road trips, sitting in the back of a car trying to do my best to keep myself busy for what would occupy the next six to ten hours of my life. One of the things I grew to like were the simple maze books that were sold on the magazine racks at some of the gas stations where we’d stop for food. There are two basic strategies I employed for solving these mazes: For simpler mazes, I could generally just take a look at the overall maze structure, decide upon a path through the maze, then write a path without any mistakes. For more complex mazes though, I would generally begin a route that looks the most promising. If that route reaches a point where I can see that it will be impossible to finish, then I’d go back to where the decision was made, exclude the route I had just tried, and select the “most promising” remaining route. I’d continue this process until I had completed the maze, or until it became simple enough for me to solve the maze using only my memory.

The A* Algorithm works in a similar manner to the second approach I just described. We begin at a starting point, and consider where to move next from that starting point. The set of possible options for this next move is determined by the neighbor function for a given cell. For each neighbor the algorithm estimates the length of the route through that cell by calculating the sum of two values, f(n) = g(n) + h(n), where

g(n) is the (known) cost to travel from the starting point to the cell n.
h(n) is the (approximate) cost to travel from the cell n to the final cell.

The sum g(n) + h(n) allows us to approximate the total cost of a route through the cell n.

The A* algorithm begins at the starting position of the maze. There are two sets we will be considering throughout the process of determining the optimal route, called the closedSet and the openSet. The elements of closedSet are the nodes whose total distance from the starting position has been calculated, whereas the elements of openSet represents nodes whose total distance is still under consideration. There is also a map called prev which is used to reconstruct the path. Below is how the algorithm operates:

A* Algorithm Pseudocode
closedSet is the empty set
openSet = {start}
prev is the empty set

While there are still elements in openSet,
     Find the element c* in openSet with the lowest value f(c).
     If c* is the target position
          Reconstruct the path.
     Else If c* is not the target position,
          Remove c* from openSet
          Add c* to closedSet
          For each neighbor n of c* that is not in closedSet,
               Calculate a temporary g value, temp_g(n) = g(c*) + dist(c*, n).
               If n is not in openSet, or if n is in openSet and temp_g(n) < g(n),                     Set prev(n) = c*                     Set g(n) = temp_g(n)                     Set f(n) = g(n) + h(n).                     if n is not in openSet                          add n to openSet.      End If End While End Algorithm An important question becomes what makes a good heuristic function to approximate the distance to the goal. This can lead to an in depth discussion based on the word "good", but the necessary condition for any heuristic is that it NEVER over-estimates the cost of the path from the cell to the goal. Some examples of possible heuristics for mazes are the Euclidean Distance (the square root of the sum of the squares of the horizontal and vertical differences in distances, i.e. dE(x, y) = (i(xi – yi)2) and TaxiCab Distance (the sum of the differences in the horizontal and vertical dimensions, i.e. dT(x, y) = i|xi – yi|). Both of these are feasible metrics for heuristics on a maze. Other herusitics, like h(n) = 0 for all n are possible, but doing this would make the algorithm treat all cells equally and ignore the heuristic part of the A* algorithm, turning it into Dijkstra’s algorithm.

I’ve written a script that generates random mazes and uses the A* algorithm to find the optimal path through this maze. For this script, I used the taxicab distance heuristic.

Check it out and let me know what you think.

Naive Bayesian Classification

Suppose there is a sequence of events that took place e1, …, ene, with each event belonging to a certain classification group g1, …, gng. Then the problem of determining which of these groups a new event belongs is the classification problem.

A naive Bayes classifier will determine to which of the possible classification groups a new observation belongs with the (naive) assumption that every feature of this new observation has no relationship to any other feature of this observation.

This assumption of independence of the columns of the feature vector allows us to use Bayes’ Theorem to determine the probability that the new observation will belong to each of the observation groups. Bayes’ Theorem will then say that the probability this new observation belongs to a classification group, given the features is equal to the probability of the occurrence of that classification group in the observed data (i.e. P(C)) multiplied by the conditional probability of the joint distribution of the features given the same classification group P(F1, …, Fnf. The naive assumption allows us to quickly calculate the joint distribution of the features, given the classification group as the product of each feature given that same classification group.

This can be written as:

P(C | F1, …, Ffn) =
P(C) P(F1, …, Ffn | C)
P(F1, …, Ffn)
= P(C) P(F1 | C) * … * P(Ffn | C
P(C) P(F1, …, Ffn | C)
P(F1, …, Ffn)
= P(C) i = 1 to nfP(Fi | C)
P(C) P(F1, …, Ffn | C)
P(F1, …, Ffn)

So suppose we have observations that give the following data:
P(F1 | N) = 0.286
P(F2 | N) = 0.143
P(F3 | N) = 0.429
P(F4 | N) = 0.429

P(F1 | Y) = 0.143
P(F2 | Y) = 0.571
P(F3 | Y) = 0.571
P(F4 | Y) = 0.571

P(Y) = 0.5
P(N) = 0.5

Then P(N | F1, F2, F3, F4) = (0.5 * 0.286 * 0.143 * 0.429 * 0.429)
= 0.008

Then P(Y | F1, F2, F3, F4) = (0.5 * 0.143 * 0.571 * 0.571 * 0.571)
= 0.001

After we normalize the two terms, we wind up with Then P(N | F1, F2, F3, F4) = .2204


Then P(Y | F1, F2, F3, F4) = .7796

So the naive Bayes classifier says that the likely classification group for this observation is Y.

Check out my examples page for more examples of the naive Bayes classifier.

Simple Linear Regression

I finished a script that helps explain the concepts of simple linear regression.

We live in a world that is filled with patterns – patterns all around us just waiting to be discovered. Some of these patterns are not as easily discovered because of the existence of outside noise.

Consider for example an experiment where a set of people were each given the task of drinking a number of beers and having their blood alcohol level taken afterwards. Some noise factors in this could include the height and weight of the individual, the types of drinks, the amount of food eaten, and the time between drinks. Even with this noise, though, we can still see a correlation between the number of drinks and their blood alcohol level. Consider the following graph showing people’s blood-alcohol level after a given number of drinks. The x-axis represents the number of drinks and the y-axis is the corresponding blood alcohol level.

Example of Linear Regression

x 5 2 9 8 3 7 3 5 3 5 4 6 5 7 1 4
y 0.10 0.03 0.19 0.12 0.04 0.095 0.070 0.060 0.020 0.050 0.070 0.10 0.085 0.090 0.010 0.050

We can definitely see a correlation, and although the data doesn’t quite fit on a straight line. It leads us to ask further questions like can we use this data to build a model that estimates a person’s blood-alcohol level and how strong is this model?

One of the tools we can use to model this problem is linear regression. A linear regression takes a two-dimensional data set, with the assumption that one column (generally represented by the x variable) is independent and the second column (generally represented by the y variable) being dependent on the first column. The assumption is that the relationship between the two columns is linear and can be represented by the linear equation

y = 0 + 1x + e.

The right hand side of the above equation has three terms. The first two (0 and 1) are the parameters of the linear equation (the y-intercept and slope respectively), while the third term of the right hand side of the above equation represents the error term. The error term represents the difference between this linear equation and the y values in the data provided. We are seeking a line that minimizes the error term. That is, we are seeking to minimize

D = i = 1 to n [yi – (0 + 1xi)]2

There are several ways one could approach this problem. In fact, there are several lines that one could use to build a linear model. The first line that one may use to model these points is the one generated by only mean of the y values of the points, called the horizontal line regression.

For the data set above, the mean of the y values can be calculated as = 0.0738, so we could build a linear model based on this mean that would be y = 0.738. This horizontal line regression model is a horizontal line that predicts the same score (the mean), regardless of the x value. This lack of adjustments means it is generally a poor fit for most models. But as we will see later, this horizontal line regression model does serve a purpose in determining how well the model we develop performs.

A second attempt at solving this problem would be to generate the least squares line. This is the line that minimizes the D value listed above. We can see that D is a multi-variable polynomial, and we can find the minimum of such a polynomial using calculus, partial derivatives and Gaussian elimination (I will omit the work here because it deters us from the main point of this blog post, however Steven J. Miller has a good write-up of this).

The calculus leads us to the following equations:

SXY = i = 1 to n(xy) –
(i = 1 to nx)(i = 1 to ny)

SXX = i = 1 to n(x2) –
(i = 1 to nx)2

1 =

0 = 1

To calculate the least squares line for this example, we first need to calculate a few values:
i = 1 to n(xy) = 6.98
i = 1 to n(x2) = 443
i = 1 to nx = 77
i = 1 to ny = 1.18
Sxx = 72.44
Sxy = 1.30

This lets us evaluate that
1 = 0.018
0 = -0.0127

So the resulting linear equation for this data is

= -0.0127 + 0.018*x

Below is a graph of the two attempts at building a linear model for this data.

Example of Models of Linear Regression

In the above image, the green line represents the horizontal line regression model and the blue line represents the least-squares line. As stated above, the horizontal line regression model is a horizontal line that does not adjust as the data changes. The least-squares line adjusts both the slope and y-intercept of this line according to the data provided to better fit the data provided. The question becomes how well does the least-squares line fit the data.

The Sum of Squares Error (SSE) sums the deviation at each point of our data from the least-squares line.

SSE = i = 1 to n(yii)2

A second metric that we are interested in is how well the horizontal line regression linear model estimates our data. This is called the Total Sum of Squares (SST).

SST = i = 1 to n(yi)2

The horizontal line regression model ignores the independent variable x from our data set and thus any line that takes this independent variable into account will be an improvement on the horizontal line regression model. Because of this, the SST sum is a worse case scenario of how poorly our model can perform.

Knowing now that SST is always greater than SSE, the regression sum of squares (SSR) is the difference between the total sum of squares and the sum of squares error.


This tells us how much of the total sum of squares is accounted for by the model.

Finally, the coefficient of determination (r2) is defined by

r2 = SSR / SST

This tells SSR as a percentage of SST, or the amount of the variation in the data that is explained by the model.

So, check out my script on simple linear regression and let me know what you think.