Market Basket Analysis in Python for Grocery Dataset¶

Introduction & Data Commentary¶

The full dataset can be found on Kaggle here: https://www.kaggle.com/datasets/irfanasrullah/groceries?select=groceries+-+groceries.csv

It is fairly basic and no strong changes need to be performed in advance; I chose to drop the '# Item(s)' column in place below but you could easily do so in Excel or another spreadsheet tool beforehand.

Our first stop is to make data transformations, preparing our data for onehot encoding, which is essentially turning it into a grid with values of "True" or "False" (or "0" or "1") representing the products. This is done in the form of transactions, as is the intent of the dataset overall.

Having data in this format allows us to explore it at length and create various visualizations to help identify rules between given pairs of grocery store items and assess them on various metrics so we can figure out which rules are most closely correlated.

As an example, this would allow you to know through data if milk and eggs were actually closely correlated or if that is simply an assumption that people make without legitimacy behind it.

In this example, we're not looking to identify or answer questions about specific pairings although we could if we wanted to. What we do instead is to seek to trim the amount of rules present in order to understand which products are closely related and therefore how they could be marketed or have their positions relative to one another improved in-store.

I won't go into great detail about each metric here as that information is available online, ubiquitously.

Importing Packages Needed for Analysis¶

In [9]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import os
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
from pandas.plotting import parallel_coordinates

Checking and setting active directory to avoid file issues.

In [6]:
print(os.getcwd())
path = "F:\[Personal]\Data Analytics Portfolio\Market Basket Analysis\mba"
os.chdir(path)
print(os.getcwd())
F:\[Personal]\Data Analytics Portfolio\Market Basket Analysis\mba
F:\[Personal]\Data Analytics Portfolio\Market Basket Analysis\mba

Read In Data and Initial Transformations¶

In [39]:
df = pd.read_csv(r'F:\[Personal]\Data Analytics Portfolio\Market Basket Analysis\mba\groceries - groceries.csv')

# Our data has a column that we do not need for analysis so we will be dropping it altogether

df.drop(columns = df.columns[0], axis = 1, inplace = True)
In [31]:
# Transforming the dataframe into a list of lists representing transactions

transactions = df.values.astype(str).tolist()

# Delete the nans in the list of lists

transactions = [[item for item in row if item != 'nan'] for row in transactions]

# Print first 5 lists to check our work

print(transactions[:5])
[['citrus fruit', 'semi-finished bread', 'margarine', 'ready soups'], ['tropical fruit', 'yogurt', 'coffee'], ['whole milk'], ['pip fruit', 'yogurt', 'cream cheese', 'meat spreads'], ['other vegetables', 'whole milk', 'condensed milk', 'long life bakery product']]

Calculating the Support Metric¶

In [32]:
# First, we instantiate the transition encoder which will enable us to onehot encode the data

encoder = TransactionEncoder().fit(transactions)

# Perform onehot encoding (could be chained above but calling out separately for readability)

onehot = encoder.transform(transactions)

# Convert onehot to a DataFrame

onehot = pd.DataFrame(onehot, columns = encoder.columns_)

# Check output

print(onehot)
      Instant food products  UHT-milk  abrasive cleaner  artif. sweetener  \
0                     False     False             False             False   
1                     False     False             False             False   
2                     False     False             False             False   
3                     False     False             False             False   
4                     False     False             False             False   
...                     ...       ...               ...               ...   
9830                  False     False             False             False   
9831                  False     False             False             False   
9832                  False     False             False             False   
9833                  False     False             False             False   
9834                  False     False             False             False   

      baby cosmetics  baby food   bags  baking powder  bathroom cleaner  \
0              False      False  False          False             False   
1              False      False  False          False             False   
2              False      False  False          False             False   
3              False      False  False          False             False   
4              False      False  False          False             False   
...              ...        ...    ...            ...               ...   
9830           False      False  False          False             False   
9831           False      False  False          False             False   
9832           False      False  False          False             False   
9833           False      False  False          False             False   
9834           False      False  False          False             False   

       beef  ...  turkey  vinegar  waffles  whipped/sour cream  whisky  \
0     False  ...   False    False    False               False   False   
1     False  ...   False    False    False               False   False   
2     False  ...   False    False    False               False   False   
3     False  ...   False    False    False               False   False   
4     False  ...   False    False    False               False   False   
...     ...  ...     ...      ...      ...                 ...     ...   
9830   True  ...   False    False    False                True   False   
9831  False  ...   False    False    False               False   False   
9832  False  ...   False    False    False               False   False   
9833  False  ...   False    False    False               False   False   
9834  False  ...   False     True    False               False   False   

      white bread  white wine  whole milk  yogurt  zwieback  
0           False       False       False   False     False  
1           False       False       False    True     False  
2           False       False        True   False     False  
3           False       False       False    True     False  
4           False       False        True   False     False  
...           ...         ...         ...     ...       ...  
9830        False       False        True   False     False  
9831        False       False       False   False     False  
9832        False       False       False    True     False  
9833        False       False       False   False     False  
9834        False       False       False   False     False  

[9835 rows x 169 columns]
In [33]:
# Computing Support for Single Items

print(onehot.mean())
Instant food products    0.008033
UHT-milk                 0.033452
abrasive cleaner         0.003559
artif. sweetener         0.003254
baby cosmetics           0.000610
                           ...   
white bread              0.042095
white wine               0.019014
whole milk               0.255516
yogurt                   0.139502
zwieback                 0.006914
Length: 169, dtype: float64

Using the Aprori Algorithm¶

In [34]:
frequent_itemsets = apriori(onehot, min_support = 0.001, use_colnames = True)

print(len(frequent_itemsets))
13492
In [35]:
rules = association_rules(frequent_itemsets, metric = 'support', min_threshold = 0.001)
print(rules[:5])
               antecedents              consequents  antecedent support  \
0          (bottled water)  (Instant food products)            0.110524   
1  (Instant food products)          (bottled water)            0.008033   
2  (Instant food products)                 (butter)            0.008033   
3                 (butter)  (Instant food products)            0.055414   
4           (citrus fruit)  (Instant food products)            0.082766   

   consequent support   support  confidence      lift  leverage  conviction  
0            0.008033  0.001017    0.009200  1.145296  0.000129    1.001178  
1            0.110524  0.001017    0.126582  1.145296  0.000129    1.018386  
2            0.055414  0.001220    0.151899  2.741145  0.000775    1.113765  
3            0.008033  0.001220    0.022018  2.741145  0.000775    1.014301  
4            0.008033  0.001118    0.013514  1.682347  0.000454    1.005556  
In [36]:
print(frequent_itemsets.nlargest(n = 15, columns = 'support'))
      support            itemsets
154  0.255516        (whole milk)
96   0.193493  (other vegetables)
115  0.183935        (rolls/buns)
130  0.174377              (soda)
155  0.139502            (yogurt)
10   0.110524     (bottled water)
116  0.108998   (root vegetables)
147  0.104931    (tropical fruit)
125  0.098526     (shopping bags)
122  0.093950           (sausage)
99   0.088968            (pastry)
28   0.082766      (citrus fruit)
9    0.080529      (bottled beer)
89   0.079817        (newspapers)
18   0.077682       (canned beer)
In [129]:
plt.figure(figsize=(12,6))
plt.xticks(rotation = 90)
sns.barplot(x = 'itemsets', y = 'support', data = frequent_itemsets.nlargest(n = 15, columns = 'support'))
Out[129]:
<Axes: xlabel='itemsets', ylabel='support'>

Whole milk obtains the most Support, followed by root vegetables and bread products before finally getting into preserved/canned goods in this set of 15.

Exploring Our Newly Made Rules¶

In [37]:
print(rules.columns)
Index(['antecedents', 'consequents', 'antecedent support',
       'consequent support', 'support', 'confidence', 'lift', 'leverage',
       'conviction'],
      dtype='object')
In [38]:
print(rules[['antecedents','consequents']])
                    antecedents  \
0               (bottled water)   
1       (Instant food products)   
2       (Instant food products)   
3                      (butter)   
4                (citrus fruit)   
...                         ...   
102761     (whipped/sour cream)   
102762         (tropical fruit)   
102763        (root vegetables)   
102764                 (yogurt)   
102765             (whole milk)   

                                              consequents  
0                                 (Instant food products)  
1                                         (bottled water)  
2                                                (butter)  
3                                 (Instant food products)  
4                                 (Instant food products)  
...                                                   ...  
102761  (other vegetables, tropical fruit, root vegeta...  
102762  (other vegetables, whipped/sour cream, root ve...  
102763  (other vegetables, whipped/sour cream, tropica...  
102764  (other vegetables, whipped/sour cream, tropica...  
102765  (other vegetables, whipped/sour cream, tropica...  

[102766 rows x 2 columns]

Visualization: Heatmap¶

Making a new frequent_itemsets and a new rules allows me to tinker with the numbers to generate a heatmap which isn't overwhelming.

Finding the right support levels to limit data to be comprehensible is more of an art than a science and can be better inferred in a business where there is domain knowledge that would inform numerical choices in analysis.

Here, the focus in on a readable visualization and to demonstrate skill and ability to extend this framework to any such analysis.

In [146]:
frequent_itemsets2 = apriori(onehot, min_support = 0.0001, use_colnames = True, max_len = 2)
rules2 = association_rules(frequent_itemsets2)
In [147]:
# Convert antecedents and consequents into strings

rules2['antecedents'] = rules2['antecedents'].apply(lambda a: ','.join(list(a)))
rules2['consequents'] = rules2['consequents'].apply(lambda a: ','.join(list(a)))

# Print to check

print(rules2[['antecedents','consequents']])
              antecedents             consequents
0               baby food             brown bread
1               baby food                  butter
2               baby food             butter milk
3               baby food                cake bar
4               baby food               chocolate
5               baby food                    curd
6               baby food                 dessert
7               baby food       finished products
8               baby food       frozen vegetables
9               baby food   fruit/vegetable juice
10              baby food        other vegetables
11              baby food                   pasta
12              baby food                  pastry
13              baby food               pip fruit
14              baby food              rolls/buns
15              baby food             salty snack
16              baby food                 sausage
17              baby food           shopping bags
18              baby food             soft cheese
19              baby food                   soups
20              baby food          tropical fruit
21              baby food                 waffles
22              baby food      whipped/sour cream
23              baby food             white bread
24              baby food                  yogurt
25   sound storage medium           bottled water
26   sound storage medium                   candy
27   sound storage medium                cat food
28  preservation products            citrus fruit
29   sound storage medium  frozen potato products
30   sound storage medium                     ham
31   sound storage medium                  pastry
32  preservation products      whipped/sour cream
33   sound storage medium           shopping bags
34   sound storage medium                    soda
35   sound storage medium             white bread
In [164]:
# Transform the antecedent, consequent, and support columns into a matrix

support_table = rules2.pivot(index = 'consequents', columns = 'antecedents', values = 'support')

# Generate heatmap of Support

fig, ax = plt.subplots(figsize=(3, 5))
sns.heatmap(support_table, cmap = 'rocket_r', cbar_kws={'label': 'Support'})
plt.show()

From this heatmap, we immediately see high Support for pairs of preservation products with bread products and citrus fruit. We'll see similar results for other rules below, indicating the strength of non-perishable items and their antecedent -> consequent pairings.

Visualization: Scatterplot¶

Scatterplots in this type of analysis allow us to identify natural thresholds in our data and we can see an entire dataset worth of rules, not just a few at a time.

The goal should be to use the findings to help us prune different rules in order to indentify the best ones.

We don't so much see individual information here but playing with the plot and values allow us to prune in a visual fashion.

In [48]:
sns.scatterplot(x = 'antecedent support', y = 'consequent support', size = 'lift', data = rules2)
plt.show()

Visualization: Parallel Coordinates Plot¶

In [97]:
# To make this plot, we must convert our rules to coordinates
# To keep things clean and separate, let's use a 3 to denote our code

frequent_itemsets3 = apriori(onehot, min_support = 0.0001, use_colnames = True, max_len = 2)
rules3 = association_rules(frequent_itemsets3)

# Extract the first item in each set and store it as a scalar value. This allows us to use coords properly next.

rules3['antecedent'] = rules3['antecedents'].apply(lambda antecedent: list(antecedent)[0])
rules3['consequent'] = rules3['consequents'].apply(lambda consequent: list(consequent)[0])

rules3['rule'] = rules3.index

# Define coordinates and label

coords = rules3[['antecedent','consequent','rule']]
In [130]:
# Generate the plot, indicating Support

import matplotlib as mpl

fig, ax = plt.subplots(figsize=(3, 10))
cmap = plt.get_cmap('rainbow')
rev_map = cmap.reversed()
bounds = np.arange(rules3['rule'].min(), rules3['rule'].max())
norm = mpl.colors.BoundaryNorm(bounds, 256)
parallel_coordinates(coords, 'rule', colormap = rev_map, ax = ax)
ax.legend_.remove()
plt.colorbar(mpl.cm.ScalarMappable(norm=norm, cmap=rev_map), ax=ax, orientation = 'horizontal', label='Support')
plt.show()

Here we can see strong support for preservation products with other such products and baby food as an atecedent with varying degrees of support for a wide variety of food items.

Neither of those findings are particularly surprising but this does help offer a look at specifics when considering which products to group together in-store and in advertising campaigns.

In [139]:
# Generate the plot, indicating Lift

frequent_itemsets4 = apriori(onehot, min_support = 0.0001, use_colnames = True, max_len = 2)
rules4 = association_rules(frequent_itemsets4, metric = 'lift', min_threshold = 100)

# Extract the first item in each set and store it as a scalar value. This allows us to use coords properly next.

rules4['antecedent'] = rules4['antecedents'].apply(lambda antecedent: list(antecedent)[0])
rules4['consequent'] = rules4['consequents'].apply(lambda consequent: list(consequent)[0])

rules4['rule'] = rules4.index

# Define coordinates and label

coords = rules4[['antecedent','consequent','rule']]

import matplotlib as mpl

fig, ax = plt.subplots(figsize=(3, 10))
cmap = plt.get_cmap('rainbow')
rev_map = cmap.reversed()
bounds = np.arange(rules4['rule'].min(), rules4['rule'].max())
norm = mpl.colors.BoundaryNorm(bounds, 256)
parallel_coordinates(coords, 'rule', colormap = rev_map, ax = ax)
ax.legend_.remove()
plt.colorbar(mpl.cm.ScalarMappable(norm=norm, cmap=rev_map), ax=ax, orientation = 'horizontal', label='Lift')
plt.show()

Recall that Lift is the Support of both X and Y over the Support for X multiplied by the Support for Y.

We see similar rules by strength of their Lift here as well with preservation products and other preservable goods or non-perishables being linked together.

Results & Discussion¶

Generating rules and then pruning them is what you make of it; the support thresholds that you choose and narrow in on will dictate which pairings show prominently in analysis.

This is more of an art than a science because, despite narrowing down to rules with higher metrics like Support and Lift, it is ultimately on the analyst and leadership to determine where commonalities thrive in a broader sense.

In other words, we identified clear linkages between non-perishable food items, the specifics of which ones might fit together most closely indicated by analysis.

What we found through the different visualizations and exploration that we did, is that preservation products, baby food, and sound storage mediums make excellent antecedents to a variety of products as seen in the parallel coordinates plot.

In terms of pairings on a more granular level, bread products and citrus fruit were identifed as having strong support, as we saw in the heatmap.

Earlier on, we looked at support over all frequent items, defined by the Support metric, and found that whole milk, vegetables, bread products, and non-perishable food items were largely found within the top 15 items.

The scatterplot was a useful visualization to help pare down rules over time, although I had already been experimenting with different thresholds. Still, it was a good verification that I was able to eliminate much of the "noise" that would have been present before such pruning.

All of these findings inform us about the links between these product types and the specific products, which enables us to suggest the aforementioned pairings be advertised and placed closely together in a brick-and-mortar store.