Tag Archives: examples

Simple Linear Regression

I finished a script that helps explain the concepts of simple linear regression.

We live in a world that is filled with patterns – patterns all around us just waiting to be discovered. Some of these patterns are not as easily discovered because of the existence of outside noise.

Consider for example an experiment where a set of people were each given the task of drinking a number of beers and having their blood alcohol level taken afterwards. Some noise factors in this could include the height and weight of the individual, the types of drinks, the amount of food eaten, and the time between drinks. Even with this noise, though, we can still see a correlation between the number of drinks and their blood alcohol level. Consider the following graph showing people’s blood-alcohol level after a given number of drinks. The x-axis represents the number of drinks and the y-axis is the corresponding blood alcohol level.

Example of Linear Regression

x5298373535465714
y0.100.030.190.120.040.0950.0700.0600.0200.0500.0700.100.0850.0900.0100.050

We can definitely see a correlation, and although the data doesn’t quite fit on a straight line. It leads us to ask further questions like can we use this data to build a model that estimates a person’s blood-alcohol level and how strong is this model?

One of the tools we can use to model this problem is linear regression. A linear regression takes a two-dimensional data set, with the assumption that one column (generally represented by the x variable) is independent and the second column (generally represented by the y variable) being dependent on the first column. The assumption is that the relationship between the two columns is linear and can be represented by the linear equation

y = 0 + 1x + e.

The right hand side of the above equation has three terms. The first two (0 and 1) are the parameters of the linear equation (the y-intercept and slope respectively), while the third term of the right hand side of the above equation represents the error term. The error term represents the difference between this linear equation and the y values in the data provided. We are seeking a line that minimizes the error term. That is, we are seeking to minimize

D = i = 1 to n [yi – (0 + 1xi)]2

There are several ways one could approach this problem. In fact, there are several lines that one could use to build a linear model. The first line that one may use to model these points is the one generated by only mean of the y values of the points, called the horizontal line regression.

For the data set above, the mean of the y values can be calculated as = 0.0738, so we could build a linear model based on this mean that would be y = 0.738. This horizontal line regression model is a horizontal line that predicts the same score (the mean), regardless of the x value. This lack of adjustments means it is generally a poor fit for most models. But as we will see later, this horizontal line regression model does serve a purpose in determining how well the model we develop performs.

A second attempt at solving this problem would be to generate the least squares line. This is the line that minimizes the D value listed above. We can see that D is a multi-variable polynomial, and we can find the minimum of such a polynomial using calculus, partial derivatives and Gaussian elimination (I will omit the work here because it deters us from the main point of this blog post, however Steven J. Miller has a good write-up of this).

The calculus leads us to the following equations:

SXY = i = 1 to n(xy) –
(i = 1 to nx)(i = 1 to ny)

n
SXX = i = 1 to n(x2) –
(i = 1 to nx)2

n
1 =
SXY

SXX
0 = 1

To calculate the least squares line for this example, we first need to calculate a few values:
i = 1 to n(xy) = 6.98
i = 1 to n(x2) = 443
i = 1 to nx = 77
i = 1 to ny = 1.18
Sxx = 72.44
Sxy = 1.30

This lets us evaluate that
1 = 0.018
and
0 = -0.0127

So the resulting linear equation for this data is

= -0.0127 + 0.018*x

Below is a graph of the two attempts at building a linear model for this data.

Example of Models of Linear Regression

In the above image, the green line represents the horizontal line regression model and the blue line represents the least-squares line. As stated above, the horizontal line regression model is a horizontal line that does not adjust as the data changes. The least-squares line adjusts both the slope and y-intercept of this line according to the data provided to better fit the data provided. The question becomes how well does the least-squares line fit the data.

The Sum of Squares Error (SSE) sums the deviation at each point of our data from the least-squares line.

SSE = i = 1 to n(yii)2

A second metric that we are interested in is how well the horizontal line regression linear model estimates our data. This is called the Total Sum of Squares (SST).

SST = i = 1 to n(yi)2

The horizontal line regression model ignores the independent variable x from our data set and thus any line that takes this independent variable into account will be an improvement on the horizontal line regression model. Because of this, the SST sum is a worse case scenario of how poorly our model can perform.

Knowing now that SST is always greater than SSE, the regression sum of squares (SSR) is the difference between the total sum of squares and the sum of squares error.

SSR = SST – SSE

This tells us how much of the total sum of squares is accounted for by the model.

Finally, the coefficient of determination (r2) is defined by

r2 = SSR / SST

This tells SSR as a percentage of SST, or the amount of the variation in the data that is explained by the model.

So, check out my script on simple linear regression and let me know what you think.

Introduction to Python Programming

Here is a link to my sample Python code.

One of the somewhat unforeseen consequences of taking a career in applied mathematics, particularly in this day and age, is that you will eventually need to write computer programs that implement the mathematical algorithms. There are several languages in which one can do this, each with its own positives and negatives and you will find that things that are simple in some are difficult in another. After speaking with a number of people, both students and professionals who work with mathematics on a regular basis, I reasoned that it may be helpful to provide some source code examples to help mathematicians get started with programming in some of these languages. I decided to start with Python because its a powerful language, available for free, and its learning curve isn’t too steep.

I will be working with Python 2.7, which can be downloaded from https://www.python.org/download/releases/2.7/. I understand, however, that a limitation to coding is the required setup often necessary before one can even write their first line of code. So while I do encourage you to download Python, I will also provide a link to the online Python compiler at Compile Online, which should allow users to simply copy and paste the code into a new tab in their browser and by simply clicking the “Execute Script” command in the upper left corner, see the output of the code.

With that being said, here is a link to my sample Python code.

My Review of “The Golden Ticket: P, NP, and the Search for the Impossible”

The Golden Ticket Image

I came home from work on Wednesday a bit too tired to go for a run and a bit too energetic to sit and watch TV. So I decided to pace around my place while reading a good book. The question was did I have a good book to read. I had been reading sci-fi type books earlier this month and wanted a break from that, so I looked in my mailbox and noticed that I had just received my copy of “The Golden Ticket: P, NP, and the Search for the Impossible“. At the time, I was of the mindset that I had just gotten off of work and really didn’t want to be reading a text book as if I was still at work. But I decided to give it a try and at least make it through the first few pages and if it got to be overwhelming, I’d just put it down and do something else.

About three hours later I was finishing the final pages of the book and impressed that the author (Lance Fortnow) was able to treat complexity theory the same way that I see physics professors speaking about quantum physics and the expanding universe on shows like “the Universe” and “Through the Wormhole with Morgan Freeman” where complex topics are spoken about with everyday terminology. It wouldn’t surprise me to see Dr. Fortnow on shows like “The Colbert Report” or “The Daily Show” introducing the topics in this book to a wider audience.

Below is the review I left on Amazon.

I really enjoyed this book. It was a light enough read to finish in one sitting on a weeknight within a few hours, but also showed its importance by being able to connect the dots between the P = NP problem to issues in health care, economics, security, scheduling and a number of other problems. And instead of talking in a "professor-like" tone, the author creates illustrative examples in Chapters 2 and 3 that are easy to grasp. These examples form the basis for much of the problems addressed in the book.

This is a book that needed to be written and needs to be on everyone's bookshelf, particularly for those asking questions like "what is mathematics" or "what is mathematics used for". This book answers those questions, and towards the end gives examples (in plain English) of the different branches of mathematics and theoretical computer science, without making it read like a text book.

Also, here is a link to the blog that Lance Fortnow and William Gasarch run called “Computational Complexity”, and here is a link to the website of the book, “The Golden Ticket: P, NP, and the Search for the Impossible”

Learn Math Through Set Relations

This is an image of a script I wrote to help users understand mathematics through set theory and relations.

I have just finished a script that helps users understand mathematics through set theory and relations.

Much of our world deals with relationships – both in the sense of romantic ones or ones that show some interesting property between two sets. When mathematicians think of set theory, a relation between the set A and the set B is a set of ordered pairs, where the first element of the ordered pair is from the set A and the second element of the ordered pair id from the set B. So if we say that R is a relation on the sets A and B, that would mean that R consists of elements that look like (a, b) where a is in A and b is in B. Another way of writing this is that R is a subset of A x B. For more on subsets and cross product, I refer you to my earlier script work on set operations.

Relations can provide a useful means of relating an abstract concept to a real world one. I think of things like the QB rating system in the NFL as an example. We have a set of all quarterbacks in the NFL (or really all people who have thrown a pass) and we would like some means of saying that one QB is performing better than another. The set of statistics kept on a QB is a large set, so attempting to show that one QB is better by showing that every year that they played one is better in every statistical category can be (a) exhaustive, and (b) will lead to very few interesting comparisons. Most of the really good QBs have some areas that they are really good and others that they are not. The QB rating system provides a relation between the set of all QBs in the NFL and the set of real numbers. Once this relation was defined, we can say that one QB is performing better than another if his QB rating is higher. Similarly we can compare a QB to his own statistics at different points in his career to see the changes and trends.

This is just one example, and there are countless others that I could have used instead.

Once we understand what a relation is, we have several properties that we are interested in. Below I list four, although there are many more.

Properties of Relations:
A relation R is symmetric if whenever an element (a, b) belongs to R, then so does (b, a).

A relation R is reflexive if for every element a in the universe of the relation, the element (a, a) belongs to R.

A relation R is transitive if for every pair of elements (a, b) and (c, d) and b = c, then the element (a, d) belongs to R.

A relation R is anti-symmetric if the elements (a, b) and (b, a) do not belong to the relation whenever a is not equal to b.

Once we understand what a relation is, there are a few common ones that we are interested in. Below I list four, but again, I want to stress that these are some of the more common ones, but there are several others.

Types of Relations:
A relation R is a function (on its set of defined elements) if there do not exist elements (a, b) and (a, c) which both belong to R.

A relation R is an equivalence relation if R is symmetric, reflexive and transitive.

A relation R is a partial order set if R is anti-symmetric, reflexive and transitive.

A relation R is a total order set if it is a partial order set and for every pair of elements a and b, either (a, b) is in R or (b, a) is in R.

A partial order is just an ordering, but not everything can be compared to everything else. Think about the Olympics, and a sport like gymnastics. Consider the floor and the balance beam. One person can win gold on the floor and another person wins gold on the balance beam. That puts each of them in the “top” of the order for their particular section, but there’s no way of comparing the person who won the floor exercise to the person who won the balance beam. So we say the set is “partially ordered”. More formally, lets say that two people (person X and person Y) relate if they competed in the same event and the the first person (in this case person X) received an equal or higher medal in that event than the second person (in this case person Y). Obviously any person receives the same medal as themselves, so this relation is reflexive. And if Jamie received an equal or higher medal than Bobby and Bobby received an equal or higher medal than Chris, then Jamie must have received an equal or higher medal than Chris so this relation is transitive. To test this relation for anti-symmetry, suppose that Chuck received an equal or higher medal than Charlie and Charlie received an equal or higher medal than Chuck. This means that they must have received the same medal, but since only one medal is awarded at each color for each event (meaning one gold, one silver and one bronze…if this is not true, assume it is), this must mean that Chuck and Charlie are the same person, and this relation is thus anti-symmetric.

If we have a partial ordering where we can compare everything, then we say that the set is “totally ordered”.

An equivalence relation tries to mimic equality on our relation. So, staying with that example of the Olympics, an example of an equivalence relation could be to say that two athletes relate to one another if they both received the same color medal in their event (for the sake of argument lets assume that no athlete competes in more than one event). Then obviously an athlete receives the same medal as themselves, so this relation is reflexive. If two people received the same medal, then it doesn’t matter if we say Chris and Charlie or Charlie and Chris, so the relation is symmetric. And Finally if Chris received the same medal as Charlie an if Charlie received the same medal as Jesse, then all three people received the same medals, so Chris and Jesse received the same medals and this relation is transitive. Because this relation has these three properties, it is called an equivalence relation.

Sieve of Eratosthenes

Prime numbers are an important concept in Number Theory and Cryptography which often uses the difficulty of finding prime numbers as a basis for building encryption systems that are difficult to break without going through all (or a very large number of) possible choices.

Remember that a prime number is a number greater than 1 whose only divisors are 1 and that number itself. One of the most famous algorithms for searching for prime numbers is the Sieve of Eratosthenes. I added a script which implements the Sieve of Eratosthenes to my Examples page.

This algorithm prints out all prime numbers less than a given number by first canceling out all multiples of 2 (the smallest prime), then all multiples of 3 (the second smallest prime), then all multiples of 5 (the third smallest prime – multiples of 4 do not need to be considered because they are also multiples of 2), etc until we have reached a number which cannot be a divisor of this maximum number.

So if we are given a number, n, the first step of the algorithm is write out a table that lists all the numbers that are less than n. For example lets run this Sieve on 50. So all numbers less than 50 are

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50.

So since 1 is not a prime number (by the definition of prime numbers), we cancel that number out.

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50.

Next, we look at the list and the first number that is not crossed out is a prime. That number is 2. We will put a mark by 2 and cancel out all of 2’s multiples.

1, 2*, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50.

Again, we look at the list and the first number that is not marked or crossed out is 3, so that number is prime. We will put a mark by 3 and cancel out all of 3’s multiples.

1, 2*, 3*, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50.

Once again, we look at the list and the first number that is not marked or crossed out is 5, so that number is prime. We will put a mark by 5 and cancel out all of 5’s multiples.

1, 2*, 3*, 4, 5*, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50.

We look at the list and the first number that is not marked or crossed out is 7, so that number is prime. We will put a mark by 7 and cancel out all of 7’s multiples.

1, 2*, 3*, 4, 5*, 6, 7*, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50.

Now, we look and the first number that is not crossed out is 11. However, since 11 is greater than sqrt(50) we know that each of 11’s multiples that are less than 50 will have been cancelled out by a previous prime number. So we have finished the algorithm.

Check out my script which implements the Sieve of Eratosthenes for more examples.