Tag Archives: linear

Simple Linear Regression

I finished a script that helps explain the concepts of simple linear regression.

We live in a world that is filled with patterns – patterns all around us just waiting to be discovered. Some of these patterns are not as easily discovered because of the existence of outside noise.

Consider for example an experiment where a set of people were each given the task of drinking a number of beers and having their blood alcohol level taken afterwards. Some noise factors in this could include the height and weight of the individual, the types of drinks, the amount of food eaten, and the time between drinks. Even with this noise, though, we can still see a correlation between the number of drinks and their blood alcohol level. Consider the following graph showing people’s blood-alcohol level after a given number of drinks. The x-axis represents the number of drinks and the y-axis is the corresponding blood alcohol level.

Example of Linear Regression

x5298373535465714
y0.100.030.190.120.040.0950.0700.0600.0200.0500.0700.100.0850.0900.0100.050

We can definitely see a correlation, and although the data doesn’t quite fit on a straight line. It leads us to ask further questions like can we use this data to build a model that estimates a person’s blood-alcohol level and how strong is this model?

One of the tools we can use to model this problem is linear regression. A linear regression takes a two-dimensional data set, with the assumption that one column (generally represented by the x variable) is independent and the second column (generally represented by the y variable) being dependent on the first column. The assumption is that the relationship between the two columns is linear and can be represented by the linear equation

y = 0 + 1x + e.

The right hand side of the above equation has three terms. The first two (0 and 1) are the parameters of the linear equation (the y-intercept and slope respectively), while the third term of the right hand side of the above equation represents the error term. The error term represents the difference between this linear equation and the y values in the data provided. We are seeking a line that minimizes the error term. That is, we are seeking to minimize

D = i = 1 to n [yi – (0 + 1xi)]2

There are several ways one could approach this problem. In fact, there are several lines that one could use to build a linear model. The first line that one may use to model these points is the one generated by only mean of the y values of the points, called the horizontal line regression.

For the data set above, the mean of the y values can be calculated as = 0.0738, so we could build a linear model based on this mean that would be y = 0.738. This horizontal line regression model is a horizontal line that predicts the same score (the mean), regardless of the x value. This lack of adjustments means it is generally a poor fit for most models. But as we will see later, this horizontal line regression model does serve a purpose in determining how well the model we develop performs.

A second attempt at solving this problem would be to generate the least squares line. This is the line that minimizes the D value listed above. We can see that D is a multi-variable polynomial, and we can find the minimum of such a polynomial using calculus, partial derivatives and Gaussian elimination (I will omit the work here because it deters us from the main point of this blog post, however Steven J. Miller has a good write-up of this).

The calculus leads us to the following equations:

SXY = i = 1 to n(xy) –
(i = 1 to nx)(i = 1 to ny)

n
SXX = i = 1 to n(x2) –
(i = 1 to nx)2

n
1 =
SXY

SXX
0 = 1

To calculate the least squares line for this example, we first need to calculate a few values:
i = 1 to n(xy) = 6.98
i = 1 to n(x2) = 443
i = 1 to nx = 77
i = 1 to ny = 1.18
Sxx = 72.44
Sxy = 1.30

This lets us evaluate that
1 = 0.018
and
0 = -0.0127

So the resulting linear equation for this data is

= -0.0127 + 0.018*x

Below is a graph of the two attempts at building a linear model for this data.

Example of Models of Linear Regression

In the above image, the green line represents the horizontal line regression model and the blue line represents the least-squares line. As stated above, the horizontal line regression model is a horizontal line that does not adjust as the data changes. The least-squares line adjusts both the slope and y-intercept of this line according to the data provided to better fit the data provided. The question becomes how well does the least-squares line fit the data.

The Sum of Squares Error (SSE) sums the deviation at each point of our data from the least-squares line.

SSE = i = 1 to n(yii)2

A second metric that we are interested in is how well the horizontal line regression linear model estimates our data. This is called the Total Sum of Squares (SST).

SST = i = 1 to n(yi)2

The horizontal line regression model ignores the independent variable x from our data set and thus any line that takes this independent variable into account will be an improvement on the horizontal line regression model. Because of this, the SST sum is a worse case scenario of how poorly our model can perform.

Knowing now that SST is always greater than SSE, the regression sum of squares (SSR) is the difference between the total sum of squares and the sum of squares error.

SSR = SST – SSE

This tells us how much of the total sum of squares is accounted for by the model.

Finally, the coefficient of determination (r2) is defined by

r2 = SSR / SST

This tells SSR as a percentage of SST, or the amount of the variation in the data that is explained by the model.

So, check out my script on simple linear regression and let me know what you think.

Learn Duality in Linear Programming

I have just completed a script to help learn about duality in linear programming.

Many real world problems can be formulated as Linear Programming problems. There are often many different ways to formulate a single problem. Some of these alternative formulations can easily be proven to be equal via simple algebra and arithmetic. For example, one person may see a problem as maximizing profit while another may see the same problem as minimizing losses. The relationship between the two alternative formulations can then be shown to be that one simply has the negative objective function value of the other.

Sometimes, though, this relationship between alternative formulations is not as easy to detect. Two alternative formulations that arise regularly in linear programming problems are a primal problem and a dual problem. The dual of a linear programming problem is the problem of finding the best bound on the objective function in terms of the constraints. This dual is formulated so that every feasible solution to the dual is a bound on the primal objective function. The Weak Duality Theorem for Linear Programming says that the optimal solution for a minimization problem is always an upper bound for its dual (which is a corresponding maximization problem). Correspondingly, the optimal solution for a maximization problem is always a lower bound for its dual. This leads to the Strong Duality Theorem for Linear Programming which says that if we can find feasible solutions to the primal and dual problems with matching objective function values, then both of these solutions must be optimal for their respective problems.

This shows an importance of duality. What my script provides is a means of formulating the dual of a given problem to help understand the concept.

The rules for constructing a dual linear program are as follows:

Primal:Dual:
The ith constraint is [<=]Variable yi is [>=] 0
The ith constraint is [>=]Variable yi is is [<=] 0
The ith constraint is =Variable yi is unbounded
Variable xj is [>=] 0The jth constraint is [>=]
Variable xj is [<=] 0The jth constraint is [<=]
Variable xj is unboundedThe jth constraint is =

Other Blogs that have covered this topic:
A Narrow Margin
Optimization and data mining

Linear Search Algorithm

I have published code that shows examples of the Linear Search Algorithm.

The linear search algorithm iterates through each item in our data structure in search for a specific value. If the current item matches, we can return, else we must continue to the next item.

In the worse case, this requires that we search through all items because in a unsorted structure, we cannot say whether an untesetd value is the value we are searching for.

Other Blogs that have covered this topic:
Dream.In.Code
Coding Bot