I have finished a script that runs the Apriori algorithm.

When we are given a large set of transactions, we are often interested in discovering patterns inside these transactions. The Apriori algorithm provides a means for formulating what are known as “association rules” for the set of transactions. An association rule is an observation from the database between different items inside a transaction. For example, the statement “If a customer buys chips they are 60% more likely to also purchase dip” could be an association rule based on data from supermarket purchases. The number 60% is called the confidence we have in the rule and we are generally interested in rules with higher confidence.

The Apriori algorithm takes as input a transaction database and a “threshold”. The initial pass through the database performs a count on each single item in the database and checks how many transactions contain each item. The algorithm proceeds by the Apriori property which states that “any subset of a frequent itemset must also be frequent”. What this means is that when checking the subsets of length 2 (and greater), we can ignore those subsets that contain an element that does not meet the minimum support, as it cannot be a part of a frequent itemset.

So lets look at an example. Suppose our list of transactions for the items Chips, Dip, Soda, Napkins and Paper Plates are as follows:

(Chips, Dip, Soda, Napkins, Paper Plates)

1 1 0 1 0

0 0 0 1 0

1 1 1 0 1

0 1 1 0 1

1 0 0 1 0

1 0 0 0 1

1 1 1 0 1

0 0 1 0 1

1 0 0 0 0

1 0 1 1 0

0 1 1 1 0

1 1 0 0 0

1 0 1 1 0

0 0 0 1 1

1 1 1 1 0

0 0 0 0 0

0 0 0 0 0

1 1 1 0 0

0 0 0 1 1

1 1 1 0 1

And lets suppose that we’re interested in finding the collections that occur more than a quarter (25%) of the time.

What we see from simply summing the columns and dividing by the number of columns is that 60% of the people purchased chips, 45% purchased dip, 50% purchased soda, 45% purchased napkins, and 40% purchased paper plates. Since these are all above our minimum threshold, they are all possible as elements of future collections.

When looking at larger collections, we see that:

– 35% of the people purchased both chips and dip

– 35% of the people purchased both chips and soda

– 25% of the people purchased both chips and napkins

– 35% of the people purchased both dip and soda

– 25% of the people purchased both soda and paper plates

– no other size two collections are above the minimum threshold.

You’ll note that since only 20% of the people purchased chips and paper plates, it is not above the minimum threshold and so we can ignore it in future collections. From the set of size 2 itemsets, we can derive our list of size three itemsets.

In particular, we see that:

– 25% of the people purchased all three of chips, dip, and soda.

– no other size three collections are above the minimum threshold.

The hierarchy of the itemsets is displayed in the image above.

To play around with this algorithm and understand more of its properties, visit Apriori algorithm.

Other Blogs that have covered this topic:

Analytics and Visualization of Big Data

Statistical Research