# ID3 Algorithm Decision Trees

As I grow LEARNINGlover.com, I’m always thinking of different ways to expose my own personality through the site. This is partially because it is easier for me to talk about subjects where I am already knowledgeable, but it is more-so being done to help make some of these algorithms and concepts I encode more understandable, and sometimes relating foreign concepts to everyday life makes them easier to understand.

Today, I’d like to write about decision trees, and the ID3 algorithm for generating decision trees in particular. This is a machine learning algorithm that builds a model from a training data set consisting of a feature vector and an outcome. Because our data set consists of an outcome element, this falls into the category of supervised machine learning.

The model that the ID3 algorithm builds is called a decision tree. It builds a tree based on the features, or columns of the data set with a possible decision corresponding to each value that the feature can have. The algorithm selects the next feature by asking “which feature tells me the most about our data set?” This question can be answered first by asking how much “information” is in the data set, and then comparing that result with the amount of information in each individual feature.

In order to execute this algorithm we need a way to measure both the amount the information in outcomes of the overall data set as well as how much each feature tells us about the data set. For the first, we will use entropy, which comes from the field of information theory and encoding. Entropy is based on the question of how many bits are necessary to encode the information in a set. The more information, the higher the entropy, and the more bits required to encode that information. Although we are not encoding, the correlation between high information and high entropy suits our purposes.

To understand how much each feature tells us about the outcomes of the data set we will build on the concept of entropy to define the information gain of a feature. Each feature has multiple options, so the dataset can be partitioned based on each possible value of this feature. Once we have this partition, we can calculate the entropy of each subset of the rows of data. We define the information gain of a feature as the sum over all possible outcomes of that feature can have of the entropy of that outcome multiplied by the probability of that outcome.

To illustrate this algorithm, I decided to relate it to the question of whether we think of a character in a novel as a hero or villain. This is interesting because I try to read at least one book a month and as I’m reading, I often find myself asking this question about characters based on the traits of the characters as well as characters I’ve read about. In order to build an interactive script for this problem, I considered 25 possible character traits that could be present. A subset of these 25 character traits will be selected and a row will be generated grading a fictional character on a scale of 0 to 3 (0 meaning that they do not possess the trait at all, 3 meaning that the trait is very strong in their personality), and users will be asked whether they think a character with the given character traits should be listed as a hero or a villain. Then there is a button at the bottom of the script with the text “Build Tree” that executes the ID3 Algorithm and shows a decision tree that could be used to reach the set of decisions given by the user.

The possible features are:
Abstract, Adaptable, Aggressive, Ambition, Anxiety, Artistic, Cautious, Decisive, Honesty, Dutiful, Fitness, Intellect, Independent, Introverted, Lively, Open-minded, Orderly, Paranoid, Perfectionist, Romantic, Sensitive, Stable, Tension, Warmth and Wealthy

Once users select the option to build the tree, there will be several links outlining each step in the process to build this tree. These links will allow for users to expand the information relating to that step and minimize that information when done. Hopefully this will help users to understand each step more. I must say that as much fun as it has been writing this program, there were several questions when trying to explain it to others. Hopefully users get as much fun from using this tool as I had in creating it. As always feel free to contact me with any comments and or questions.

Ok, so here’s a link to the ID3 Algorithm Page. Please check it out and let me know what you think.

I had been meaning to write a script and blog post on descriptive statistics for some time now, but with work and winter weather and the extra work that winter weather brings, and now that the winter weather is over trying to get back into an exercise routine (running up a hill is such a challenging experience, but when I get to the top of that hill I feel like Rocky Balboa on the steps at the steps at the entrance of the Philadelphia Museum of Art), I haven’t had the time to devote to this site that I would have liked. Well, that’s not entirely true. I have still been programming in my spare time. I just haven’t been able to share it here. I went to a conference in February and in my down time, I was able to write a script on descriptive statistics that I think gives a nice introduction to the area.

Before I go into descriptive statistics though, lets talk about statistics, which is concerned with the collection, analysis, interpretation and presentation of data. Statistics can generally be broken down into two categories, descriptive statistics and infernalinferential statistics, depending on what we would like to do with that data. When we are concerned with visualizing and summarizing the given data, descriptive statistics gives methods to operate on this data set. On the other hand, if we wish to draw conclusions about a larger population from our sample, then we would use methods from inferential statistics.

In the script on descriptive statistics I’ve written, I consider three different types of summaries for descriptive statistics:

Measures of Central Tendency
Mean – the arithmetic average of a set of values
Median – the middle number in a set of values
Mode – the most used number in a set of values

Dispersion
Maximum – the largest value in the data set
Minimum – the smallest value in the data set
Standard Deviation – the amount of variation in a set of data values
Variance – how far a set of numbers is spread out

Shape
Kurtosis – how peaked or flat a data set is
Skewness – how symmetric a data set is

Plots
Histogram Plots – a bar diagram where the horizontal axis shows different categories of values, and the height of each bar is related to the number of observations in the corresponding category.
Box and Whisker Plots – A box-and-whisker plot for a list of numbers consists of a rectangle whose left edge is at the first quartile of the data and whose right edge is at the third quartile, with a left whisker sticking out to the smallest value, and a right whisker sticking out to the largest value.
Stem and Leaf Plots – A stem and leaf plot illustrates the distribution of a group of numbers by arranging the numbers in categories based on the first digit.

# Naive Bayesian Classification

Suppose there is a sequence of events that took place e1, …, ene, with each event belonging to a certain classification group g1, …, gng. Then the problem of determining which of these groups a new event belongs is the classification problem.

A naive Bayes classifier will determine to which of the possible classification groups a new observation belongs with the (naive) assumption that every feature of this new observation has no relationship to any other feature of this observation.

This assumption of independence of the columns of the feature vector allows us to use Bayes’ Theorem to determine the probability that the new observation will belong to each of the observation groups. Bayes’ Theorem will then say that the probability this new observation belongs to a classification group, given the features is equal to the probability of the occurrence of that classification group in the observed data (i.e. P(C)) multiplied by the conditional probability of the joint distribution of the features given the same classification group P(F1, …, Fnf. The naive assumption allows us to quickly calculate the joint distribution of the features, given the classification group as the product of each feature given that same classification group.

This can be written as:

P(C | F1, …, Ffn) =
 P(C) P(F1, …, Ffn | C) P(F1, …, Ffn)
= P(C) P(F1 | C) * … * P(Ffn | C
 P(C) P(F1, …, Ffn | C) P(F1, …, Ffn)
= P(C) i = 1 to nfP(Fi | C)
 P(C) P(F1, …, Ffn | C) P(F1, …, Ffn)

So suppose we have observations that give the following data:
P(F1 | N) = 0.286
P(F2 | N) = 0.143
P(F3 | N) = 0.429
P(F4 | N) = 0.429

P(F1 | Y) = 0.143
P(F2 | Y) = 0.571
P(F3 | Y) = 0.571
P(F4 | Y) = 0.571

P(Y) = 0.5
P(N) = 0.5

Then P(N | F1, F2, F3, F4) = (0.5 * 0.286 * 0.143 * 0.429 * 0.429)
= 0.008

Then P(Y | F1, F2, F3, F4) = (0.5 * 0.143 * 0.571 * 0.571 * 0.571)
= 0.001

After we normalize the two terms, we wind up with Then P(N | F1, F2, F3, F4) = .2204

and

Then P(Y | F1, F2, F3, F4) = .7796

So the naive Bayes classifier says that the likely classification group for this observation is Y.

Check out my examples page for more examples of the naive Bayes classifier.

# Simple Linear Regression

We live in a world that is filled with patterns – patterns all around us just waiting to be discovered. Some of these patterns are not as easily discovered because of the existence of outside noise.

Consider for example an experiment where a set of people were each given the task of drinking a number of beers and having their blood alcohol level taken afterwards. Some noise factors in this could include the height and weight of the individual, the types of drinks, the amount of food eaten, and the time between drinks. Even with this noise, though, we can still see a correlation between the number of drinks and their blood alcohol level. Consider the following graph showing people’s blood-alcohol level after a given number of drinks. The x-axis represents the number of drinks and the y-axis is the corresponding blood alcohol level.

 x 5 2 9 8 3 7 3 5 3 5 4 6 5 7 1 4 y 0.1 0.03 0.19 0.12 0.04 0.095 0.07 0.06 0.02 0.05 0.07 0.1 0.085 0.09 0.01 0.05

We can definitely see a correlation, and although the data doesn’t quite fit on a straight line. It leads us to ask further questions like can we use this data to build a model that estimates a person’s blood-alcohol level and how strong is this model?

One of the tools we can use to model this problem is linear regression. A linear regression takes a two-dimensional data set, with the assumption that one column (generally represented by the x variable) is independent and the second column (generally represented by the y variable) being dependent on the first column. The assumption is that the relationship between the two columns is linear and can be represented by the linear equation

y = 0 + 1x + e.

The right hand side of the above equation has three terms. The first two (0 and 1) are the parameters of the linear equation (the y-intercept and slope respectively), while the third term of the right hand side of the above equation represents the error term. The error term represents the difference between this linear equation and the y values in the data provided. We are seeking a line that minimizes the error term. That is, we are seeking to minimize

D = i = 1 to n [yi – (0 + 1xi)]2

There are several ways one could approach this problem. In fact, there are several lines that one could use to build a linear model. The first line that one may use to model these points is the one generated by only mean of the y values of the points, called the horizontal line regression.

For the data set above, the mean of the y values can be calculated as = 0.0738, so we could build a linear model based on this mean that would be y = 0.738. This horizontal line regression model is a horizontal line that predicts the same score (the mean), regardless of the x value. This lack of adjustments means it is generally a poor fit for most models. But as we will see later, this horizontal line regression model does serve a purpose in determining how well the model we develop performs.

A second attempt at solving this problem would be to generate the least squares line. This is the line that minimizes the D value listed above. We can see that D is a multi-variable polynomial, and we can find the minimum of such a polynomial using calculus, partial derivatives and Gaussian elimination (I will omit the work here because it deters us from the main point of this blog post, however Steven J. Miller has a good write-up of this).

The calculus leads us to the following equations:

SXY = i = 1 to n(xy) –
 (i = 1 to nx)(i = 1 to ny) n
SXX = i = 1 to n(x2) –
 (i = 1 to nx)2 n
1 =
 SXY SXX
 0 = – 1

To calculate the least squares line for this example, we first need to calculate a few values:
i = 1 to n(xy) = 6.98
i = 1 to n(x2) = 443
i = 1 to nx = 77
i = 1 to ny = 1.18
Sxx = 72.44
Sxy = 1.30

This lets us evaluate that
1 = 0.018
and
0 = -0.0127

So the resulting linear equation for this data is

= -0.0127 + 0.018*x

Below is a graph of the two attempts at building a linear model for this data.

In the above image, the green line represents the horizontal line regression model and the blue line represents the least-squares line. As stated above, the horizontal line regression model is a horizontal line that does not adjust as the data changes. The least-squares line adjusts both the slope and y-intercept of this line according to the data provided to better fit the data provided. The question becomes how well does the least-squares line fit the data.

The Sum of Squares Error (SSE) sums the deviation at each point of our data from the least-squares line.

SSE = i = 1 to n(yii)2

A second metric that we are interested in is how well the horizontal line regression linear model estimates our data. This is called the Total Sum of Squares (SST).

SST = i = 1 to n(yi)2

The horizontal line regression model ignores the independent variable x from our data set and thus any line that takes this independent variable into account will be an improvement on the horizontal line regression model. Because of this, the SST sum is a worse case scenario of how poorly our model can perform.

Knowing now that SST is always greater than SSE, the regression sum of squares (SSR) is the difference between the total sum of squares and the sum of squares error.

SSR = SST – SSE

This tells us how much of the total sum of squares is accounted for by the model.

Finally, the coefficient of determination (r2) is defined by

r2 = SSR / SST

This tells SSR as a percentage of SST, or the amount of the variation in the data that is explained by the model.

So, check out my script on simple linear regression and let me know what you think.

# A JavaScript Implentation of MapReduce’s WordCount

You can view a javascript implementation of the WordCound Program in Mapreduce at Javascript Implementation of Mapreduce WordCount

One of the big things in the world of Data Science and Cloud Computing is the map-reduce implementation of various algorithms. This is not always a straightforward procedure and so learning to think in terms of map-reduce implementations can be a challenging conversion from thinking in a functional programming frame of mind. In light of this I thought it would be convenient to try to help users “visualize” this concept. This is a challenging task because there are many concepts of cloud computing that I am unable to provide in this environment. However, just as many of the books on MapReduce provide pseudo-code on various implementations of algorithms in a Map-Reduce environment, I will attempt to show how data flows from the input to the mappers to the shuffle and sort phase to the reducers and finally to generate the output. I leave the users the task of actually putting these into the context of a Java MapReduce environment.

I want to first speak about the concept of (key, value) pairs which is a very important in MapReduce programming. I will speak about this in the context of a WordCount program. The purpose of a WordCount program is to count the number of occurrences of each word in a given file. First data is input to the mapper in (key, value) pairs. For our example, the key will be the line number of input (so each line of input will go to a different mapper) and the value will be the text present on that line. Once the mapper has the input, it will perform some operation on it and output data again in (key, value) pairs. In the WordCount example, the mappers will simply output each word that occurs as a new key on that line and the integer “1” as the associated value (note that a single mapper can output multiple (key, value) pairs).

One of the main things to understand in a MapReduce is that there are a number of Mappers running on a given input file and these Mappers cannot interact with one another. Suppose we have two different mappers, lets call them Mapper1 and Mapper2 that are each working on two different lines of input from a file. If both lines of input have the word “apple”, there is no way for Mapper2 to know that Mapper1‘s line of input also has this word. In this setting that’s perfectly fine because the Shuffle and Sort phase is where all the (key, value) pairs that were output by the mappers, compares the keys to one another and if they are equal to one another combines their respective values into a list of values. Unequal keys are sorted.

So if both Mapper1 and Mapper2 contained the word “apple” in their line of text, then the (key, value) pair (apple, 1) will occur twice. So the Shuffle and Sort phase will notice this and output the (key, value) pair (apple, {1, 1}).

Each reducer is then given a key and a list of values that were output by the mappers. The goal will be to perform some operation and again output data in (key, value) pairs. In the WordCount example, we will use what is known as the sumReducer. It gets this name because its job is simply to sum the values in the list of values and output the (key, value) pair that is the original key and this sum of values.

You can view a javascript implementation of this at Javascript Implementation of Mapreduce WordCount

# Hidden Markov Models: The Baum-Welch Algorithm

Suppose you are at a table at a casino and notice that things don’t look quite right. Either the casino is extremely lucky, or things should have averaged out more than they have. You view this as a pattern recognition problem and would like to understand the number of ‘loaded’ dice that the casino is using and how these dice are loaded. To accomplish this you set up a number of Hidden Markov Models, where the loaded die are the latent (hidden) variables, and would like to determine which of these, if any is more likely to be using.

First lets go over a few things.

We will call each roll of the dice an observation. The observations will be stored in variables o1, o2, …, oT, where T is the number of total observations.

To generate a hidden Markov Model (HMM) we need to determine 5 parameters:

• The N states of the model, defined by S = {S1, …, SN}
• The M possible output symbols, defined by = {1, 2, …, M}
• The State transition probability distribution A = {aij}, where aij is the probability that the state at time t+1 is Sj, given that the state at time t is Si.
• The Observation symbol probability distribution B = {bj(k)} where bj(k) is the probability that the symbol k is emitted in state Sj.
• The initial state distribution = {i}, where i is the probability that the model is in state Si at time t = 0.

The HMMs we’ve generated are based on two questions. For each question, you have provided 3 different answers which leads to 9 possible HMMs. Each of these models has its corresponding state transition and emission distributions.

• How often does the casino change dice?
• 0) Dealer Repeatedly Uses Same Dice
• 1) Dealer Uniformly Changes Die
• 2) Dealer Rarely Uses Same Dice
• Which sides on the loaded dice are more likely?
• 0) Larger Numbers Are More Likely
• 1) Numbers Are Randomly Likely
• 2) Smaller Numbers Are More Likely
How often does the casino change dice?
Which sides on
are more likely?
 (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) (2, 0) (2, 1) (2, 2)

One of the interesting problems associated with Hidden Markov Models is called the Learning Problem, which asks the question “How can I improve a HMM so that it would be more likely to have generated the sequence O = o1, o2, …, oT?

The Baum-Welch algorithm answers this question using an Expectation-Maximization approach. It creates two auxiliary variables t(i) and t(i, j). The variable t(i) represents the probability of being in state i at time t, given the entire observation sequence. Likewise t(i, j) represents the joint probability of being in state i at time t and of being in state j at time t+1, given the entire observation sequence. They can be calculated by

t(i) =  (t(i) * t(i) ) j = 1 to N(t(j) * t(j))

and

t(i, j) =  (t(i) * ai, j * t+1(j) * bj(ot+1) ) i’ = 1 to Nj’ = 1 to N(t(i’) * ai’, j’ * t+1(j’) * bj’(ot+1) )

As you can see, these are a direct result of calculations of from the Forward algorithm and from the Backwards algorithm. Once we have calculated these variables, we can update the parameters of the model as follows:

i = 1(i)

i,j = t = 1 to T-1(t(i)) t = t to T-1 (t(i, j))

// [b bar]_{j, k} = Sigma_{t = 1 to T, o_t = o_k} gamma_{t, j} / Sigma_{t = 1 to T} gamma_{t, j}, 1 <= j <= N, 1 <= k <= M

j(ok) =  t = 1 to T-1, ot = ok t(j) t = 1 to T-1 t(j)

We can iterate this procedure a finite number of times or until it converges. This will generate a new model, = {N, , , , }.

There is more on this example at LEARNINGlover.com: Hidden Marokv Models: The Baum-Welch Algorithm.

Some further reading on Hidden Markov Models:

# Hidden Markov Models: The Viterbi Algorithm

I just finished working on LEARNINGlover.com: Hidden Marokv Models: The Viterbi Algorithm. Here is an introduction to the script.

Suppose you are at a table at a casino and notice that things don’t look quite right. Either the casino is extremely lucky, or things should have averaged out more than they have. You view this as a pattern recognition problem and would like to understand the number of ‘loaded’ dice that the casino is using and how these dice are loaded. To accomplish this you set up a number of Hidden Markov Models, where the loaded die are the latent variables, and would like to determine which of these, if any is more likely to be using.

First lets go over a few things.

We will call each roll of the dice an observation. The observations will be stored in variables o1, o2, …, oT, where T is the number of total observations.

To generate a hidden Markov Model (HMM) we need to determine 5 parameters:

• The N states of the model, defined by S = {S1, …, SN}
• The M possible output symbols, defined by = {1, 2, …, M}
• The State transition probability distribution A = {aij}, where aij is the probability that the state at time t+1 is Sj, given that the state at time t is Si.
• The Observation symbol probability distribution B = {bj(k)} where bj(k) is the probability that the symbol k is emitted in state Sj.
• The initial state distribution = {i}, where i is the probability that the model is in state Si at time t = 0.

The HMMs we’ve generated are based on two questions. For each question, you have provided 3 different answers which leads to 9 possible HMMs. Each of these models has its corresponding state transition and emission distributions.

• How often does the casino change dice?
• 0) Dealer Repeatedly Uses Same Dice
• 1) Dealer Uniformly Changes Die
• 2) Dealer Rarely Uses Same Dice
• Which sides on the loaded dice are more likely?
• 0) Larger Numbers Are More Likely
• 1) All Numbers Are Equally Likely
• 2) Smaller Numbers Are More Likely
How often does the casino change dice?
Which sides on
are more likely?
 (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) (2, 0) (2, 1) (2, 2)

One of the interesting problems associated with Hidden Markov Models is called the Decoding Problem, which asks the question “What is the most likely sequence of states that the HMM would go through to generate the sequence O = o1, o2, …, oT?

The Viterbi algorithm finds answers this question using Dynamic Programming. It creates an auxiliary variable t(i) which has the highest probability that the partial observation sequence o1, …, ot can have, given that the current state is i. This variable can be calculated by the following formula:

t(i) = maxq1, …, qt-1 p{q1, …, qt-1, qt = i, o1, …, ot | }.

1(j) = jbj(o1), for 1 j N.

Once we have calculated t(j) we also keep a pointer to the max state. We can then find the optimal path by looking at arg max 1 j N T(j) and then backtrack the sequence of states using the pointer.

There is more on this example at LEARNINGlover.com: Hidden Marokv Models: The Viterbi Algorithm.

Some further reading on Hidden Markov Models:

# Covariance of Vectors

Most of the things we think about have many different ways we could describe them. Take for example a movie. I could describe a movie by its genre, its length, the number of people in the movie, the number of award winners, the length of the explosions, the number of fight scenes, the number of scenes, the rating it was given by a certain critic, etc. The list goes on and on. How much do these things influence one another? How likely is a person to enjoy a movie? Is that related to the number of award winners in the movie? Answering this type of a question can often help understand things like what might influence a critics rating or more importantly which movies are worth my \$15 ticket price.

Movies are just one example of this. Other areas like sports, traffic congestion, or food and a number of others can be analyzed in a similar manner. With data becoming available at unprecedented rates and areas like cloud computing and data science becoming key buzzwords in industry, the ability to understand these relationships is becoming more and more important.

As a mathematician, I enjoy being able to say with certainty that some known truth is the cause of some other known truth, but it not always easy (or even possible) to prove the existence of such a relationship. We are left instead with looking at trends in data to see how similar things are to one another over a data set. Measuring the covariance of two or more vectors is one such way of seeking this similarity.

Before delving into covariance though, I want to give a refresher on some other data measurements that are important to understanding covariance.
– Sum of a vector:
If we are given a vector of finite length we can determine its sum by adding together all the elements in this vector. For example, consider the vector v = (1, 4, -3, 22). Then sum(v) = 1 + 4 + -3 + 22 = 24.

– Length of a vector:
If we are given a vector of finite length, we call the number of elements in the vector the length of the vector. So for the example above with the vector v = (1, 4, -3, 22), there are four elements in this vector, so length(v) = 4.

– Mean of a vector:
The mean of a finite vector is determined by calculating the sum and dividing this sum by the length of the vector. So, working with the vector above, we already calculated the sum as 24 and the length as 4, which we can use to calculate the mean as the sum divided by the length, or 24 / 4 = 6.

– Variance of a vector:
Once we know the mean of a vector, we are also interested in determining how the values of this vector are distributed across its domain. The variance measures this by calculating the average deviation from the mean. Here we calculate the deviation from the mean for the ith element of the vector v as (vi)2. We can get the average deviation from the mean then by computing the average of these values.

So if the vector v has n elements, then the variance of v can be calculated as Var(v) = (1/n)i = 1 to n((vi)2).

Once again dealing with the vector above with v = (1, 4, -3, 22), where the mean is 6, we can calculate the variance as follows:

 vi vi – (vi – )2 1 -5 25 4 -2 4 -3 -9 81 22 16 256

To calculate the mean of this new vector (25, 4, 81, 324), we first calculate the sum as 25 + 4 + 81 + 256 = 366. Since the length of the new vector is the same as the length of the original vector, 4, we can calculate the mean as 366 / 4 = 91.5

The covariance of two vectors is very similar to this last concept. Instead of being interested in how one vector is distributed across its domain as is the case with variance, covariance is interested in how two vectors X and Y of the same size are distributed across their respective means. What we are able to determine with covariance is things like how likely a change in one vector is to imply change in the other vector. Having a positive covariance means that as the value of X increases, so does the value of Y. Negative covariance says that as the value of X increases, the value of Y decreases. Having zero covariance means that a change in the vector X is not likely to affect the vector Y.

With that being said, here is the procedure for calculating the covariance of two vectors. Notice that it is very similar to the procedure for calculating the variance of two vectors described above. As I describe the procedure, I will also demonstrate each step with a second vector, x = (11, 9, 24, 4)

1. Calculate the means of the vectors.
As we’ve seen above, the mean of v is 6.
We can similarly calculate the mean of x as 11 + 9 + 24 + 4 = 48 / 4 = 12

2. Subtract the means of the vectors from each element of the vector (xiX) and (YiY).

We did this for v above when we calculated the variance. Below are the values for v and for x as well.

vi – meanv

-5

-2

-9

16

 i vi xi xi – meanx 1 1 11 -1 2 4 9 -3 3 -3 24 12 4 22 4 -8

3. For each element i, multiply the terms (xiX) and (YiY).

This gives us the following vector in our example:
(-5)(-1), (-2)(-3), (-9)(12), (16)(-8) = (5, 6, -108, -128).

4. Sum the elements obtained in step 3 and divide this number by the total number of elements in the vector X (which is equal to the number of elements in the vector Y).

When we sum the vector from step 3, we wind up with 5 + 6 + -108 + -128 = -225
And the result of dividing -225 by 4 gives us -225/4 = – 56.25.

This final number, which for our example is -56.25, is the covariance.

Some important things to note are

• If the covariance of two vectors is positive, then as one variable increases, so does the other.
• If the covariance of two vectors is negative, then as one variable increases, the other decreases.
• If the covariance of two vectors is 0, then one variable increasing (decreasing) does not impact the other.
• The larger the absolute value of the covariance, the more often the two vectors take large steps at the same time.
• A low covariance does not necessarly mean that the two variables are independent. I’ll give a quick example to illustrate that.
Consider the vectors x and y given by x = (-3, -2, -1, 0, 1, 2, 3) and y = (9, 4, 1, 0, 1, 4, 9).
The mean of x is 0, while the mean of y is 7.
The mean adjusted values are (-3, -2, -1, 0, 1, 2, 3) and (2, -3, -6, -7, -6, -3, 2).
The product of these mean adjusted values is (-6, 6, 6, 0, -6, -6, 6).
If we sum this last vector, we get 0, which after dividing by 7 still gives a value of 0.
So the covariance of these two vectors is 0.

We can easily see that for each value xi in x, the corresponding yi is equal to xi2

I have written a script to help understand the calculation of two vectors.