Skip to content

How does the Apriori Algorithm work?

The Apriori algorithm tries to learn association rules, i.e. logical relationships, from transactions in a database. The goal is to find frequently occurring itemsets, e.g. the products that frequently occur in the database.

In everyday life, we often encounter product combinations that can be bought cheaper in a bundle. In many fast-food restaurants, for example, the main course is sold together with a side dish and a drink at a discounted price. The principle behind this is always the same: products that are often bought together are offered at a lower price if they are bought directly as a bundle. This has not only caught on in fast-food restaurants. It can now also be found in supermarkets and fashion online stores, as it makes shopping easier for the customer and leads to higher sales for the seller.

Das Bild zeigt eine Person, die über ein Tablet auf einen Onlineshop zugreift in Anlehnung an den Apriori Algorithmus.
Apriori Algorithm & E-Commerce

To be able to find such bundles, the Apriori algorithm is used, among other things. For this purpose, the order databases of the providers are searched and an attempt is made to establish association rules. Before we can continue with the exact procedure of the algorithm, we have to clarify a few terms that are used in this context.

Explanation of the Terms

The Apriori algorithm uses some concepts and rules which are not really common in our linguistic usage. Therefore, we explain the most important ones briefly here, before we can devote ourselves to the flow of the algorithm.

What is an Itemset?

A set of elements is called an itemset. A set with k elements is also called a k-itemset. The itemset must consist of at least two elements and can represent, for example, the shopping cart during purchase. For example, shopping at the supermarket would have the itemset consisting of bread, butter, milk, and eggs.

Before we can look for frequently occurring bundles, we need metrics that measure how often certain itemsets are bought together. For this purpose, we use confidence and support.

Support measures how often a product is purchased relatively:

\(\) \[\text{Support (A)} = \frac{\text{Number of Transactions in which A occurs}}{\text{Number of all Transactions}}\]

Confidence, on the other hand, measures the support for all transactions containing elements A and B, and divides them by the support for A:

\(\) \[\text{Confidence (A + B)} = \frac{\text{Support (A∪B)}}{\text{Support (A)}} = \frac{\text{Number of Transactions in which A and B occur}}{\text{Number of Transactions in which A occurs}} \]

For a supermarket, these ratios would be calculated as follows:

Assume the database includes 500 purchases in total. In 300 cases, the purchase included a chocolate bar. So the support for chocolate is:

\(\) \[\text{Support (Chocolate)} = \frac{300}{500} = 60 % \]

In turn, in 100 purchases in which chocolate was bought, milk was also bought. The confidence in milk and chocolate thus results from:

\(\) \[\text{Confidence (Milk + Chocolate)} = \frac{100}{300} = 33,3 % \]

This means that chocolate is purchased in 60% of all transactions. Again, in one-third of the “chocolate – transactions” milk is also purchased.

What is an Association Rule?

An association rule tries to find certain regularities between two or more elements that reach a given level of support and confidence.

What is a Frequent Itemset?

A frequent itemset is a set of items that occur frequently together and reach a predefined level of support and confidence. Frequent itemsets can be found using association rules.

How does the Apriori Algorithm work?

The Apriori algorithm is one of the methods to find frequent item sets in a dataset. It works in two steps, namely “Join” and “Prune”, which are executed iteratively, i.e. several times in a row.

Join: In this step, itemsets of the set K are formed. K stands for the repetition step.

Prune: In this step, all itemsets that do not reach the predefined support threshold and are therefore considered rare are removed.

The Apriori algorithm makes use of the so-called antimonotonic property. In simple words, it means that if an itemset consisting of several elements does not reach the minimum support, then all supersets (all sets consisting of elements of the itemset) do not reach the minimum support either. In our example, it means that if the milk and cookies itemset does not reach the minimum support, then the milk, cookies, and chocolate itemset cannot exceed the minimum support.

In each new iteration, the itemset is extended by one element, and the “Join” and “Prune” steps are executed again. The algorithm terminates if no new itemset is found in an iteration, but only itemsets from the previous step remain.

Apriori by Example

For our example, let’s come back to our supermarket. There are a total of six transactions in its database.

TransactionWarenkorb
T_1Milch, Chocolate and Noodles
T_2Chocolate, Noodles and Rice
T_3Rice, Bread
T_4Milk, Chocolate, Rice
T_5Milk, Chocolate, Noodles, Bread
T_6Milk, Chocolate, Noodles, Rice
Transactions in a Supermarket

For our association rule, we want to achieve a support of 50% and a confidence of 70%. These values are completely arbitrary.

Step 1 (K=1): In the first step we search for itemsets with quantity 1, i.e. we count how often the individual products have been purchased in total. This corresponds to the join step in the first stage.

ItemsetCount Purchases
Milk4
Chocolate5
Noodles4
Rice4
Bread2
Itemsets in the first level

Step 2 (K=1): The Join step is followed by the Prune step, in which we remove all itemsets that do not reach the minimum support. We have chosen support of 50%. Due to the formality of calculating the support, it means that in six transactions the itemset must occur at least three times for the support to be fulfilled. All other itemsets can be dropped:

ItemsetCount Purchases
Milk4
Chocolate5
Noodles4
Rice4
Itemsets in the first level after Pruning

Therefore, the product/itemset “Bread” falls out, because it occurs only twice.

Step 3 (K=2): After doing the Join and Prune step, we move to the second stage. Here, the itemset size is now 2. The possible itemsets are the combination of all the remaining products from the previous stage.

ItemsetCount Purchases
Milk, Chocolate4
Milk, Noodles3
Milk, Rice2
Chocolate, Noodles4
Chocolate, Rice3
Noodles, Rice2
Itemsets in the second level

Step 4 (K=2): In the Prune step, the item sets that do not reach the minimum support of three are removed again. Thus, the combinations (milk, rice) and (noodles, rice) are dropped:

ItemsetCount Purchases
Milk, Chocolate4
Milk, Noodles3
Chocolate, Noodles4
Chocolate, Rice3
Itemsets in the second level

Step 5 (K=3): In this step, we form the itemsets with the quantity 3, which are the quantities that can be formed from the remaining products:

ItemsetCount Purchases
Milk, Chocolate, Noodles
Milk, Chocolate, Rice
Milk, Noodles, Rice
Noodles, Chocolate, Rice
Itemsets in the third level

We don’t have a single purchase for these itemsets, but you can still check if the itemsets would reach support. For this purpose, we make use of the Antimonoton property. Here we notice that the itemset (Milk, Chocolate, Rice) consists of the subsets (Milk, Chocolate), (Chocolate, Rice), and (Milk, Rice). However, the itemset (milk, rice) could not reach the support, so the larger itemset (milk, chocolate, rice) cannot be frequent either, even though there are no numbers for it.

With the same reasoning, the itemsets (milk, noodles, rice) and (noodles, chocolate, rice) are also dropped because the itemset (noodles, rice) does not occur often enough. The last remaining itemset can now be used to derive association rules using confidence.

Step 6: We can check the following association rules:

(Milk, Chocolate) -> Noodles:

\(\) \[\text{Confidence (Milk, Chocolate)} = \frac{\text{Support (Milk, Chocolate, Noodles)}}{\text{Support(Milk, Chocolate)}} = \frac{3}{4} = 75 %\]

(Milk, Noodles) -> Chocolate:

\(\) \[\text{Confidence (Milk, Noodles)} = \frac{\text{Support (Milk, Chocolate, Noodles)}}{\text{Support(Milk, Noodles)}} = \frac{3}{3} = 100 %\]

(Chocolate, Noodles) -> Milk:

\(\) \[\text{Confidence (Chocolate, Noodles)} = \frac{\text{Support (Milk, Chocolate, Noodles)}}{\text{Support(Chocolate, Noodles)}} = \frac{3}{4} = 75 %\]

(Milk) -> (Chocolate, Noodles):

\(\) \[\text{Confidence (Milk)} = \frac{\text{Support (Milk, Chocolate, Noodles)}}{\text{Support(Milk)}} = \frac{3}{4} = 75 %\]

(Chocolate) -> (Milk, Noodles):

\(\) \[\text{Confidence (Chocolate)} = \frac{\text{Support (Milk, Chocolate, Noodles)}}{\text{Support(Chocolate)}} = \frac{3}{5} = 60 %\]

Noodles -> (Milk, Chocolate):

\(\) \[\text{Confidence (Noodles)} = \frac{\text{Support (Milk, Chocolate, Noodles)}}{\text{Support(Noodles)}} = \frac{3}{4} = 75 %\]

After running the Apriori algorithm, a total of five association rules emerge that withstand our confidence level of 70%. These include the rule “(milk, chocolate) -> (noodles)”. This means that if milk and chocolate have already been purchased, then the purchase of noodles is also very likely.

What are the Advantages and Disadvantages of using the Apriori Algorithm?

The Apriori algorithm is a good way to find association rules in large databases with many transactions. Furthermore, the join and prune steps are relatively easy to develop in programming languages such as Python. In addition, there are also modules that further simplify the import of the Apriori algorithm.

However, the algorithm becomes very computationally intensive for large databases. As has already become clear in our example, many combinations and steps have to be calculated, which take up a lot of time and resources even in our simple example. Before implementation, therefore, it must be ensured that the effort is worthwhile.

What Applications use this Algorithm?

There are several applications in which association rules are searched and thus the Apriori algorithm is used. Here are some topics in which associations are searched:

  • Education: In education, associations are sought, for example, when trying to find out why some students do better in subjects than others.
  • Medicine: In the development of new drugs, association rules are studied to determine how a new drug interacts with physical characteristics. To do this, the sample that tested the drug is studied in more detail.
  • Biology: Around the world, forest fires are becoming more frequent. Therefore, we are currently investigating which early warning factors, such as extreme drought or high temperatures, are associated with forest fires in order to be able to detect and prevent them as early as possible.
  • E-commerce & Recommendation: We learned about the classic example of the Apriori algorithm in this article: shopping cart analysis. Based on the purchasing behavior of previous customers, it attempts to identify associations between products and then uses these for product recommendations.
Das Bild zeigt einen kleinen Einkaufswagen, der auf einer Laptop Tastatur steht.
Shopping Cart Analysis in e-commerce

How can the Apriori Algorithm be improved?

With a large database, the Apriori algorithm can be very inefficient and take a lot of time. Therefore, there are currently some methods to help make the algorithm more efficient and faster. These include:

  1. Hash-based Itemset Counting: This method tries to speed up the creation of itemsets. For this purpose, a so-called hash table is used, which gives each product or transaction a unique number. This means that memory-intensive strings do not have to be included and processed, but rather the more efficient hashes.
  2. Transaction Reduction: Before the Apriori algorithm is executed, the database is searched for repeated transactions, which are then excluded. In the same way, rare transactions are filtered out at an early stage, since they do not play a role in the calculation of the algorithm.
  3. Partitioning: For this method, the number of database scans is significantly reduced. The basic idea here is that an itemset can only be frequent if it occurs in at least one of two database partitions. This means that the entire database only has to be scanned twice, which can be much more efficient.
  4. Sampling: Instead of searching for frequent itemsets in the entire database, this approach takes only one examination unit of the database and examines it. However, there is a risk of losing itemsets that are actually frequent. To avoid this risk, comparatively low support should be used.
  5. Dynamic Itemset Counting: With this method, new potential itemsets can already be found while scanning the database.

This is what you should take with you

  • The Apriori algorithm tries to learn association rules, i.e. logical relationships, from transactions in a database. The goal is to find frequently occurring itemsets, e.g. the products that frequently occur in the database.
  • The Apriori algorithm works in two steps (“Join” and “Prune”), which are executed iteratively one after the other.
  • In each step, an itemset is created, which is larger by one product than in the previous step (“Join”). Then it is examined which of these itemsets fulfills the previously defined support and thus remains.
  • Although the Apriori algorithm is relatively easy to understand and implement, it can be very inefficient for large databases and require a lot of computing power. Therefore, there are already some approaches to improve the performance of the algorithm.
  • In practice, the Apriori algorithm is used whenever association rules are sought. This includes, for example, medicine, biology, or e-commerce.
N-gram

What are N-grams?

Unlocking NLP's Power: Explore n-grams in text analysis, language modeling, and more. Understand the significance of n-grams in NLP.

No-Free-Lunch Theorem

What is the No-Free-Lunch Theorem?

Unlocking No-Free-Lunch Theorem: Implications & Applications in ML & Optimization

Automated Data Labeling

What is Automated Data Labeling?

Unlock efficiency in machine learning with automated data labeling. Explore benefits, techniques, and tools for streamlined data preparation.

Synthetic Data Generation / Synthetische Datengenerierung

What is Synthetic Data Generation?

Elevate your data game with synthetic data generation. Uncover insights, bridge data gaps, and revolutionize your approach to AI.

Multi-Task Learning

What is Multi-Task Learning?

Boost ML efficiency with Multi-Task Learning. Explore its impact on diverse domains from NLP to healthcare.

Federated Learning

What is Federated Learning?

Elevate machine learning with Federated Learning. Collaborate, secure, and innovate while preserving privacy.

  • An interesting paper on improving the Apriori algorithm can be found here.
Das Logo zeigt einen weißen Hintergrund den Namen "Data Basecamp" mit blauer Schrift. Im rechten unteren Eck wird eine Bergsilhouette in Blau gezeigt.

Don't miss new articles!

We do not send spam! Read everything in our Privacy Policy.

Cookie Consent with Real Cookie Banner