Tag Archives: supervised

ID3 Algorithm Decision Trees

ID3 Algorithm Page Image

As I grow LEARNINGlover.com, I’m always thinking of different ways to expose my own personality through the site. This is partially because it is easier for me to talk about subjects where I am already knowledgeable, but it is more-so being done to help make some of these algorithms and concepts I encode more understandable, and sometimes relating foreign concepts to everyday life makes them easier to understand.

Today, I’d like to write about decision trees, and the ID3 algorithm for generating decision trees in particular. This is a machine learning algorithm that builds a model from a training data set consisting of a feature vector and an outcome. Because our data set consists of an outcome element, this falls into the category of supervised machine learning.

The model that the ID3 algorithm builds is called a decision tree. It builds a tree based on the features, or columns of the data set with a possible decision corresponding to each value that the feature can have. The algorithm selects the next feature by asking “which feature tells me the most about our data set?” This question can be answered first by asking how much “information” is in the data set, and then comparing that result with the amount of information in each individual feature.

In order to execute this algorithm we need a way to measure both the amount the information in outcomes of the overall data set as well as how much each feature tells us about the data set. For the first, we will use entropy, which comes from the field of information theory and encoding. Entropy is based on the question of how many bits are necessary to encode the information in a set. The more information, the higher the entropy, and the more bits required to encode that information. Although we are not encoding, the correlation between high information and high entropy suits our purposes.

Entropy Formula

To understand how much each feature tells us about the outcomes of the data set we will build on the concept of entropy to define the information gain of a feature. Each feature has multiple options, so the dataset can be partitioned based on each possible value of this feature. Once we have this partition, we can calculate the entropy of each subset of the rows of data. We define the information gain of a feature as the sum over all possible outcomes of that feature can have of the entropy of that outcome multiplied by the probability of that outcome.

Information Gain Formula

To illustrate this algorithm, I decided to relate it to the question of whether we think of a character in a novel as a hero or villain. This is interesting because I try to read at least one book a month and as I’m reading, I often find myself asking this question about characters based on the traits of the characters as well as characters I’ve read about. In order to build an interactive script for this problem, I considered 25 possible character traits that could be present. A subset of these 25 character traits will be selected and a row will be generated grading a fictional character on a scale of 0 to 3 (0 meaning that they do not possess the trait at all, 3 meaning that the trait is very strong in their personality), and users will be asked whether they think a character with the given character traits should be listed as a hero or a villain. Then there is a button at the bottom of the script with the text “Build Tree” that executes the ID3 Algorithm and shows a decision tree that could be used to reach the set of decisions given by the user.

The possible features are:
Abstract, Adaptable, Aggressive, Ambition, Anxiety, Artistic, Cautious, Decisive, Honesty, Dutiful, Fitness, Intellect, Independent, Introverted, Lively, Open-minded, Orderly, Paranoid, Perfectionist, Romantic, Sensitive, Stable, Tension, Warmth and Wealthy

Once users select the option to build the tree, there will be several links outlining each step in the process to build this tree. These links will allow for users to expand the information relating to that step and minimize that information when done. Hopefully this will help users to understand each step more. I must say that as much fun as it has been writing this program, there were several questions when trying to explain it to others. Hopefully users get as much fun from using this tool as I had in creating it. As always feel free to contact me with any comments and or questions.

Ok, so here’s a link to the ID3 Algorithm Page. Please check it out and let me know what you think.

Hierarchical Clustering

Hierarchical Clustering algorithms give a nice introduction for computer science students to unsupervised machine learning. I say this because the bottom-up approach to Hierarchical clustering (which I have implemented here) is very similar to Kruskal’s algorithm for finding the minimum spanning tree of a graph.

In Kruskal’s algorithm, we begin by creating a forest, or a set of trees where each node is its own tree. The algorithm then selects the two trees that are closest together (closest being defined as the minimum cost edge between two distinct trees) and merges those trees together. This process of merging the closest two trees is then repeated until there is only one tree remaining, which is a minimum spanning tree of the graph.

Similarly, bottom-up hierarchical clustering of a group of points begins by saying that each point is its own cluster. Then the clusters are compared to one another to check if two clusters will be merged into one. Generally, there will be some stopping criteria, , saying that we do not want to merge two clusters together if their distance is greater than . So if the minimum distance between two clusters is less than we will proceed as in Kruskal’s algorithm by merging these two clusters together into one cluster. We repeat this process of merging the closest two clusters together until we find that the minimum distance between two clusters is greater than or equal to , in which case we can stop and the result is a partition of our data set into distinct clusters.

Hierarchical clustering is comparable to K-Means Clustering. Here are some differences between the two approaches:

  1. K-Means Clustering requires an initial number of desired clusters, while Hierarchical clustering does not.
  2. A run of K-Means Clustering will always give K clusters, whereas Hierarchical Clustering can give more or less, depending on our tolerance .
  3. K-Means can undo previous mistakes (assignments of an element to the wrong cluster), while Hierarchical Clustering cannot.

So, here is a link to my page on Hierarchical Clustering. Hope you enjoy.

K-Means Clustering

This is how the data looks before the clustering algorithm is run. This is how the data looks after the clustering algorithm is run.

I have now uploaded my K-Means Clustering Script. The script generates a set of random numbers (as ordered pairs) and asks the user how many clusters we should divide the numbers into and a maximum number of iterations to go through before we stop.

One of the largest problems that we face today is understanding data. Before we even get to the point of trying to interpret what the data means and making decisions based on that data there is often a problem with the general amount of data. Clustering algorithms seek to solve this problem by defining some notion of similarity and using that notion to group the data into sets or ‘clusters’, where two elements belong to the same cluster if they are considered similar. Once elements are placed into clusters, we can analyze this (generally smaller) set of clusters instead of the entire data set, which should help in understanding the data.

Finding an exact solution to this problem is computationally difficult. Instead, we can approximate a solution rather quickly using the k-means clustering algorithm. This algorithm attempts to separate a given data set into a user specified (k) number of groups. The k signifies the number of clusters that we will generate. The algorithm works by initially selecting k elements of the data to serve as the “center” of each cluster. Every element of the data is then compared to each cluster center and assigned to the cluster with the closest cluster center. Once every element in the data is assigned to a cluster, the cluster centers may have changed. So the next step is to measure the elements inside each cluster and determine the new cluster center. The process of assigning elements to (new) clusters and determining (new) cluster centers is repeated until either no element changes cluster or we have reached some maximum number of iteration that the user specifies that they do not want to exceed.

K-Means Clustering can be thought of as an algorithm in the area of unsupervised machine learning. Machine learning is a field of artificial intelligence that focuses on computer programs that have the ability to learn without being explicitly programmed. Unsupervised machine learning seeks to make interpret data without any knowledge of what a “correct” interpretation is. In comparison, supervised machine learning algorithms are useful for data that has been separated into categories. These algorithms generally divide the data into a training set and a test set and seek to produce a function that agrees with the results on the training set.