Have you ever wonder why you are being suggested with some others products when you are doing online purchase? How the merchant website is able to efficiently pick up the items you may be interested in? Similarly, when you walk into a shop for your grocery, you find most of items your are looking for on the same corner. If you are running a shop, how would you decide which items must be put together? The only answer to all of those questions is Market Basket Analysis (MBA).
MBA is a technique which identifies the strength of association between pairs of items and finds patterns of co-occurrence. If we map it back to above real world applications, it allows to respond to the question ‘Which others products are likely to be purchased when a customer bought a particular product ? ‘
There are many librairies which implement the MBA technique but we will focus on the main two ones : Apriori and Fpgrowth. Apriori and Fpgrowth are both an algorithm for frequent item set mining and association rule learning over a dataset. They are implemented under mlxtend which is a librairy of useful tools for day-to-day data science tasks. As it is an extension to regular Python, it must be installed separately with this command : pip –quiet install mlxtend .
In this article, we will apply those algorithms into a dataset downloaded from Kaggle. The dataset contains the grocery shopping details for members during a period of time.
Apriori is a tree based algorithm. Basically, it generates candidates association and prunes them with support count filtering. Support count is the minimum number of time i.e threshold, the candidate must appears in the dataset to be considered.
Wikipedia shows a very high level but understandable algorithm for Apriori.
T is the transaction dataset and epsion is the support count. k is the number of transactions. L is the level of association between items i.e products. Ck is the candidate set for level k.
Fpgrowth is another way to find the frequent itemsets but without candidate generation, thus improving the performance compared to Apriori. This is suitable for large dataset. Basically, it uses a frequent-pattern tree (FP- tree) structure which stores the association information and prunes with support count value. A complete explanation of this algorithm is available on Wikipedia.
Both algorithms, Apriori and Fpgrowth, share the same metrics.
it indicates how frequently the item set appears in the whole dataset.
This shows the percentage in which Y is bought with X.
This is the ratio of observed support to what is expected if X and Y were independant.
You can see that Confidence relies on Support. So, you just need to look at Lift and Confidence as metrics.
The full project is made available on github.