Tag Archives: tree

ID3 Algorithm Decision Trees

ID3 Algorithm Page Image

As I grow LEARNINGlover.com, I’m always thinking of different ways to expose my own personality through the site. This is partially because it is easier for me to talk about subjects where I am already knowledgeable, but it is more-so being done to help make some of these algorithms and concepts I encode more understandable, and sometimes relating foreign concepts to everyday life makes them easier to understand.

Today, I’d like to write about decision trees, and the ID3 algorithm for generating decision trees in particular. This is a machine learning algorithm that builds a model from a training data set consisting of a feature vector and an outcome. Because our data set consists of an outcome element, this falls into the category of supervised machine learning.

The model that the ID3 algorithm builds is called a decision tree. It builds a tree based on the features, or columns of the data set with a possible decision corresponding to each value that the feature can have. The algorithm selects the next feature by asking “which feature tells me the most about our data set?” This question can be answered first by asking how much “information” is in the data set, and then comparing that result with the amount of information in each individual feature.

In order to execute this algorithm we need a way to measure both the amount the information in outcomes of the overall data set as well as how much each feature tells us about the data set. For the first, we will use entropy, which comes from the field of information theory and encoding. Entropy is based on the question of how many bits are necessary to encode the information in a set. The more information, the higher the entropy, and the more bits required to encode that information. Although we are not encoding, the correlation between high information and high entropy suits our purposes.

Entropy Formula

To understand how much each feature tells us about the outcomes of the data set we will build on the concept of entropy to define the information gain of a feature. Each feature has multiple options, so the dataset can be partitioned based on each possible value of this feature. Once we have this partition, we can calculate the entropy of each subset of the rows of data. We define the information gain of a feature as the sum over all possible outcomes of that feature can have of the entropy of that outcome multiplied by the probability of that outcome.

Information Gain Formula

To illustrate this algorithm, I decided to relate it to the question of whether we think of a character in a novel as a hero or villain. This is interesting because I try to read at least one book a month and as I’m reading, I often find myself asking this question about characters based on the traits of the characters as well as characters I’ve read about. In order to build an interactive script for this problem, I considered 25 possible character traits that could be present. A subset of these 25 character traits will be selected and a row will be generated grading a fictional character on a scale of 0 to 3 (0 meaning that they do not possess the trait at all, 3 meaning that the trait is very strong in their personality), and users will be asked whether they think a character with the given character traits should be listed as a hero or a villain. Then there is a button at the bottom of the script with the text “Build Tree” that executes the ID3 Algorithm and shows a decision tree that could be used to reach the set of decisions given by the user.

The possible features are:
Abstract, Adaptable, Aggressive, Ambition, Anxiety, Artistic, Cautious, Decisive, Honesty, Dutiful, Fitness, Intellect, Independent, Introverted, Lively, Open-minded, Orderly, Paranoid, Perfectionist, Romantic, Sensitive, Stable, Tension, Warmth and Wealthy

Once users select the option to build the tree, there will be several links outlining each step in the process to build this tree. These links will allow for users to expand the information relating to that step and minimize that information when done. Hopefully this will help users to understand each step more. I must say that as much fun as it has been writing this program, there were several questions when trying to explain it to others. Hopefully users get as much fun from using this tool as I had in creating it. As always feel free to contact me with any comments and or questions.

Ok, so here’s a link to the ID3 Algorithm Page. Please check it out and let me know what you think.

Visualizing Huffman Coding Trees

Huffman Coding
Image of Huffman Tree

Here is a link to a script I finished to help visualize the Huffman Coding Algorithm.

What would you do if you wanted to transfer a message, say one written in English but you only had a limited set of characters. Suppose these characters are 0 and 1. The only way of doing this is by writing some type of procedure to transfer from our 26 letter alphabet to the 0-1 binary alphabet. There are several ways of developing these encoding functions, but we will focus on those that attempt to translate each individual character into a sequence of 0s and 1s. One of the more popular such codes today is the ASCII code, which maps each character to a binary string (of 0s and 1s) of length 8. For example, here is the ASCII code for the upper and lower case alphanumeric characters.

ASCII English
01100001 a
01100010 b
01100011 c
01100100 d
01100101 e
01100110 f
01100111 g
01101000 h
01101001 i
01101010 j
01101011 k
01101100 l
01101101 m
01101110 n
01101111 o
01110000 p
01110001 q
01110010 r
01110011 s
01110100 t
01110101 u
01110110 v
01110111 w
01111000 x
01111001 y
01111010 z

What you notice from this is that each of these encodings beings with “011”, which amounts to a lot of wasted space. ASCII code doesn’t care about this because the fixed length of each binary string allows for easy lookup of particular characters (i,e, you can start almost anywhere in the string with your decomposition as long as you start at a multiple of 8).

But what if we were interested in minimizing the total bits used by the encoded string? This is where the Huffman coding algorithm gains its fame. Unlike the ASCII coding scheme, Huffman codes assign shorter codes to the more frequently occurring characters in your string. Huffman was able to prove this tactic would guarantee the shortest possible encoding.

The Huffman Coding procedure operates as follows:
1. Input string to be encoded -> Input
2. For each character in the input string, calculate the frequency of that character (i.e. the number of times it occurs in the input)
3. Sort the array of characters in the input by their decreasing frequencies
4. Place the array of characters into the queue with each one represented by a node.
5. While there are two or more nodes remaining in the queue.
6. Remove the nodes representing the two characters with the lowest frequency from the queue.
7. Create a node which points to the two nodes just removed from the queue (node -> left points to one node; node -> right points to the other).
8. Insert this new node into the queue, with the frequency equal to the sum of the frequencies of the nodes it points to.
9. If the length of the queue is greater than 1, then goto 5.

Other Blogs that have covered this topic:
Techno Nutty
billatnapier

Learn About Binary Search Trees

Binary Search Tree Script
An Image of My Binary Search Tree Script

I just finished a script that shows the properties of the binary search tree data structure.

These data structures are organized such that the data lies in “nodes” and each node connects directly to up to two new nodes. These new nodes are called the children of the node, and the original node is called the parent. Because there are up to two children, we designate one child as the “left” child, and the other as the “right” child with the properties that the value stored in the left child is less than the value in the parent, which in itself is less than the value of its right child. If a parent has less than two children, then one (or both) of its children are given the value of null.

The insert and delete procedures need to make sure that they keep the elements of a binary search tree in sorted order.
To insert into a BST, we must first find the correct location where the new element will be placed. This means comparing the value of the new element to the current head of the tree, resulting in three possible outcomes.
if the head is null, then insert the new node at the current position because there is no subtree to compare it to.
if the value of the new element is less than the value at the head node, run the insert procedure on the left child of head.
if the value of the new element is greater than the value at the head node, run the insert procedure on the right child of head.

Similarly, the remove procedure for a binary search tree must first find the element to be removed. Once that element is found, there are three cases depending on the type of node we are dealing with.
if the node has no children, then simply remove the node from the tree.
if the node has only one child (either a left child or a right child), then have the parent of the node point to the child of the node (thus bypassing the node itself).
if the node has two children, then we have two options, either replace the node with the minimum value of the right subtree or the maximum value of the left subtree. The nodes that have these minimum and maximum values will have at most one child because by definition a value less than the minimum value in a right subtree would be a left child and thus would be less than the minimum value, contradicting the meaning of a minimum value. Because these nodes have at most one child, we can now use the procedures above to remove these nodes from the tree.
Because a binary search tree is different than a standard array, there are different methods for viewing the its contents. Three common such methods are preorder, inorder, and postorder traversal.
Preorder traversal visits the nodes of a binary search tree in the order (node), (left child), (right child).
Inorder traversal visits the nodes of a binary search tree in the order (left child), (node), (right child).
Postorder traversal visits the nodes of a binary search tree in the order (left child), (right child), (node).

We are also interested in the depth of a tree, which amounts to the amount of layers or levels of the tree. This can be computed by counting the longest path from the root of the tree to a leaf node (a node with no children) in the tree.

Other Blogs that have covered this topic:
Stoimen’s Web Log

New Years is a LEARNINGlover Thing!

Starting this web site has served as the perfect opportunity to unwind. In particular, two blogs I wrote recently have served different purposes. The first was “The Degrees of Consciousness of a Black Nerd“, where I spoke about many of the things I think about being who I am and relating the the (somewhat unique) set of people that I communicate with on a daily basis. The other was what I’ve been working on since mid December. Its the blog entry I wrote on Sudoku and the Sudoku program that I wrote last month using the Dancing Links Algorithm. Since originally writing that, I’ve updated the program with a lot of Sudoku problems, as well as two types of “hints”. One generates the “possibilities matrix” which basically just shows what is possible for each cell. The other scans the possibilities matrix and searches for isolated cells (cells where some number can only go in one row/column/subgrid). Both those additions were extremely fun and provided a nice opportunity to program in my spare time.

So I’ve been thinking about these two things and how much I enjoyed the two, but for different reasons. The Degrees of Consciousness of a Black Nerd brought much attention on facebook where the idea was both accepted and rejected and I was able to explain the ideas further and hear similar stories from others who had similar experiences. Sudoku, on the other hand hasn’t generated as much conversation. A few friends have told me that they liked the program, but I don’t know of too many people outside of myself using it. That’s OK. I didn’t really create the program for publication, moreso because I enjoy Sudoku and took it as a challenge to write a program to solve a puzzle.

That being said, I’ve been thinking about some other things that I would like to do – hopefully they’ll be more than just my opinion, but have some programming, operations research, or mathematical context to them as well. But here’s a list of things that I want to write about in the near future.

  • Connections between math and football (or sports in general). Anybody who knows me knows that I’m a huge sports fan. At other sites, I’ve posted stuff on QB rating systems and the flaws/inaccuracies in them. Part of me would like to look further into this stuff and either do a comparison, create a new rating system, or just try to understand (explain) different scouting metrics, particularly for QBs.
  • I wrote the Sudoku program, but that’s a well studied problem so it was easy to find research that helped me to understand other stuff. I also studied some problems in Ramsey Theory that can be represented by exact covering algorithms and it would be interesting to try to represent these as exact cover problems and to try to use the Dancing Links algorithm to solve these problems as well, maybe to find a comparative analysis to benchmarks.
  • One of the things I’ve picked up on lately is machine learning. While I’ve added flash cards on Bayesian Networks, I would like to add some programs on things like k-means clustering.
  • I added my sorting algorithms a few weeks ago, but would like to also add something on data structures (arrays, linked lists, trees, heaps, hash tables, etc). In undergrad this was called the class that weeded people out of computer science majors, so doing something on this type of stuff I think could be helpful to those who want to understand it better.
  • I have been in the process of writing a linear programming (Simplex) implementation for a while. I would like to get back to that to allow for a solver that can do at least simple algorithms.
  • I’ve written a few drafts that connect math to different areas that I’m interested in (music, sports, philosophy, religion, etc), but I need to find a better way to present the stuff because as they’re currently stated, this stuff can easily be misinterpreted.

The beautiful thing about this site is that I’m not constrained by any advisor or boss or deadlines. Its more of a how am I feeling right now kind of a thing and these are the things I’m feeling right now. So this list is kind of my “New Years Resolutions” for LEARNINGlover.com, and while I reserve the right to change my priorities any time I feel that something else deserves my attention more, these are the things I’m planning to spend time on in the next few weeks. 

Dijkstra’s Algorithm

I’ve just written a script that executes Dijkstra’s algorithm that seeks to find the shortest path tree on a randomly generated graph.

Given a weighted graph G and a vertex s, the single node shortest path problem seeks to find the shortest path from s to every other vertex in this graph such that the sum of the weights of the edges along these paths is minimized. One famous algorithm for this problem is Dijkstra’s Algorithm, which constructs this shortest path tree using techniques of greedy algorithms and dynamic programming.

Dijkstra’s Algorithm works as follows.

  • Initially we have an empty path tree T, and assume that the distance to every vertex in the graph has some maximum cost, say infinity, i.e. w(v) = infinity for all v in V.
  • Add the node s to the tree, and give the associated path cost a value of zero, i.e. w(s) = 0.
  • Find the edge which is adjacent to T that adds the vertex whose cost is minimum, i.e. we look for an e = (u, v) such that u is in T, and v is not in T and w(u) + cost(u, v) < w(v) is minimum. If no such edge exists go to 5.
  • Add the corresponding edge and vertex to the tree, and update the weight vector of the vertex v. Go to 3.
  • The path tree T now represents the shortest path from the vertex s to all other vertices reachable in the graph G. The weight vector w represents the costs of these paths.

For an example of Dijkstra’s algorithm executed on the Graph with the following adjacency matrix:

0 1 2 3 4 5 6 7
0 2 4 20
1 11 11 7 5
2 2 10 7
3 4 11 2
4 11 10
5 7 14
6 20 5 7
7 2 14

Graph with 8 Nodes

Suppose we are interested in discovering the shortest paths from the node 0.

Initially Dijkstra’s algorithm constructs an empty path tree, T = {}.

Because we want to discover the shortest paths from the node 0, our first step will be to add this node to the tree and update the weight vector.
T = {0}
w(0) = 0.

Now we will consider the edges adjacent to T, {(0, 2), (0, 3), and (0, 6)}. Our assumption is that the shortest path of every vertex in T has already been computed correctly and we will seek the edge that minimizes the value w(u) + c(u, v), where u is a member of T and (u, v) is an edge adjacent to T. The edge that does that currently is (0, 2) since w(0) = 0 and c(0, 2) = 2. We add the node 2 to T and update the weight vector.
T = {0, 2}
w(2) = 2.

The edges adjacent to T are now {(0, 3), (0, 6), (2, 4), (2, 6)}.
The associated path costs are:
w(0) + c(0, 3) = 0 + 4 = 4
w(0) + c(0, 6) = 0 + 20 = 20
w(2) + c(2, 4) = 2 + 10 = 12
w(2) + c(2, 6) = 2 + 7 = 9
We can see that the edge that minimizes the value w(u) + c(u, v), where u is a member of T and (u, v) is an edge adjacent to T is (0, 3). We add the node 3 to T and update the weight vector.
T = {0, 2, 3}
w(3) = 4

The edges adjacent to T are now {(0, 6), (2, 4), (2, 6), (3, 1), (3, 7)}.
The associated path costs are:
w(0) + c(0, 6) = 0 + 20 = 20
w(2) + c(2, 4) = 2 + 10 = 12
w(2) + c(2, 6) = 2 + 7 = 9
w(3) + c(3, 1) = 4 + 11 = 15
w(3) + c(3, 7) = 4 + 2 = 6
We can see that the edge that minimizes the value w(u) + c(u, v), where u is a member of T and (u, v) is an edge adjacent to T is (3, 7). We add the node 7 to T and update the weight vector.
T = {0, 2, 3, 7}
w(7) = 6

The edges adjacent to T are now {(0, 6), (2, 4), (2, 6), (3, 1), (5, 7)}.
The associated path costs are:
w(0) + c(0, 6) = 0 + 20 = 20
w(2) + c(2, 4) = 2 + 10 = 12
w(2) + c(2, 6) = 2 + 7 = 9
w(3) + c(3, 1) = 4 + 11 = 15
w(7) + c(5, 7) = 6 + 14 = 20
We can see that the edge that minimizes the value w(u) + c(u, v), where u is a member of T and (u, v) is an edge adjacent to T is (2, 6). We add the node 6 to T and update the weight vector.
T = {0, 2, 3, 6, 7}
w(6) = 9

The edges adjacent to T are now {(2, 4), (3, 1), (5, 7) (6, 1)}.
The associated path costs are:
w(2) + c(2, 4) = 2 + 10 = 12
w(3) + c(3, 1) = 4 + 11 = 15
w(7) + c(5, 7) = 6 + 14 = 20
w(6) + c(6, 1) = 9 + 5 = 14
We can see that the edge that minimizes the value w(u) + c(u, v), where u is a member of T and (u, v) is an edge adjacent to T is (2, 4). We add the node 4 to T and update the weight vector.
T = {0, 2, 3, 4, 6, 7}
w(4) = 12

The edges adjacent to T are now {(3, 1), (5, 7), (6, 1), (4, 1)}.
The associated path costs are:
w(3) + c(3, 1) = 4 + 11 = 15
w(7) + c(5, 7) = 6 + 14 = 20
w(6) + c(6, 1) = 9 + 5 = 14
w(4) + c(4, 1) = 12 + 11 = 23
We can see that the edge that minimizes the value w(u) + c(u, v), where u is a member of T and (u, v) is an edge adjacent to T is (6, 1). We add the node 1 to T and update the weight vector.
T = {0, 1, 2, 3, 4, 6, 7}
w(1) = 14

The edges adjacent to T are now {(5, 7), (1, 5)}.
The associated path costs are:
w(7) + c(5, 7) = 6 + 14 = 20
w(1) + c(1, 5) = 14 + 7 = 21
We can see that the edge that minimizes the value w(u) + c(u, v), where u is a member of T and (u, v) is an edge adjacent to T is (5, 7). We add the node 5 to T and update the weight vector.
T = {0, 1, 2, 3, 4, 5, 6, 7}
w(5) = 20

Now that T contains all the nodes from the tree, we know the shortest path from node 0 to all other nodes and and have solved the problem. The associated weight vector is w = [0, 14, 2, 4, 12, 20, 9, 6].

To learn more and see more examples, view Dijkstra’s Algorithm at LEARNINGlover.com