instacart-notebooks

by khanhnamle1994

notebooks/Customer-Segments-with-PCA.ipynb/

In this notebook I will try to find a possible customer segmenetation enabling to classify customers according the their different purchases. I hope this information will be useful for the next prediction task. Since there are thousands of products in the dataset I will rely on aisles, which represent categories of products. Even with aisles features will be too much so I will use Principal Component Analysis to find new dimensions along which clustering will be easier. I will then try to find possible explanations for the identified clusters.

Initial Exploration

import numpy as np 
import pandas as pd 

from subprocess import check_output
print(check_output(["ls"]).decode("utf8"))
Customer-Segments-with-PCA.ipynb
Exploratory-Data-Analysis.ipynb
aisles.csv
departments.csv
order_products__prior.csv
order_products__train.csv
orders.csv
products.csv

orders = pd.read_csv('orders.csv')
orders.head()
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
0 2539329 1 prior 1 2 8 NaN
1 2398795 1 prior 2 3 7 15.0
2 473747 1 prior 3 3 12 21.0
3 2254736 1 prior 4 4 7 29.0
4 431534 1 prior 5 4 15 28.0
prior = pd.read_csv('order_products__prior.csv')
prior.head()
order_id product_id add_to_cart_order reordered
0 2 33120 1 1
1 2 28985 2 1
2 2 9327 3 0
3 2 45918 4 1
4 2 30035 5 0
train = pd.read_csv('order_products__train.csv')
train.head()
order_id product_id add_to_cart_order reordered
0 1 49302 1 1
1 1 11109 2 1
2 1 10246 3 0
3 1 49683 4 0
4 1 43633 5 1

This is my undrstanding of the dataset structure:

  • Users are identified by user_id in the orders.csv file. Each row of the orders.csv file represents an order made by a user. Order are identified by order_id.
  • Each order of a user is characterized by an order_number which specifies when it has been made with respect to the others of the same user.
  • Each order consists of a set of product each characterized by an add_to_cart_order feature representing the sequence in which they have been added to the cart in that order.
  • For each user we may have n - 1 prior orders and 1 train order OR n - 1 prior orders and 1 test order in which we have to state what products have been reordered.
order_prior = pd.merge(prior,orders,on=['order_id','order_id'])
order_prior = order_prior.sort_values(by=['user_id','order_id'])
order_prior.head()
order_id product_id add_to_cart_order reordered user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
4089398 431534 196 1 1 1 prior 5 4 15 28.0
4089399 431534 12427 2 1 1 prior 5 4 15 28.0
4089400 431534 10258 3 1 1 prior 5 4 15 28.0
4089401 431534 25133 4 1 1 prior 5 4 15 28.0
4089402 431534 10326 5 0 1 prior 5 4 15 28.0
products = pd.read_csv('products.csv')
products.head()
product_id product_name aisle_id department_id
0 1 Chocolate Sandwich Cookies 61 19
1 2 All-Seasons Salt 104 13
2 3 Robust Golden Unsweetened Oolong Tea 94 7
3 4 Smart Ones Classic Favorites Mini Rigatoni Wit... 38 1
4 5 Green Chile Anytime Sauce 5 13
aisles = pd.read_csv('aisles.csv')
aisles.head()
aisle_id aisle
0 1 prepared soups salads
1 2 specialty cheeses
2 3 energy granola bars
3 4 instant foods
4 5 marinades meat preparation
print(aisles.shape)
(134, 2)
_mt = pd.merge(prior,products, on = ['product_id','product_id'])
_mt = pd.merge(_mt,orders,on=['order_id','order_id'])
mt = pd.merge(_mt,aisles,on=['aisle_id','aisle_id'])
mt.head(10)
order_id product_id add_to_cart_order reordered product_name aisle_id department_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order aisle
0 2 33120 1 1 Organic Egg Whites 86 16 202279 prior 3 5 9 8.0 eggs
1 26 33120 5 0 Organic Egg Whites 86 16 153404 prior 2 0 16 7.0 eggs
2 120 33120 13 0 Organic Egg Whites 86 16 23750 prior 11 6 8 10.0 eggs
3 327 33120 5 1 Organic Egg Whites 86 16 58707 prior 21 6 9 8.0 eggs
4 390 33120 28 1 Organic Egg Whites 86 16 166654 prior 48 0 12 9.0 eggs
5 537 33120 2 1 Organic Egg Whites 86 16 180135 prior 15 2 8 3.0 eggs
6 582 33120 7 1 Organic Egg Whites 86 16 193223 prior 6 2 19 10.0 eggs
7 608 33120 5 1 Organic Egg Whites 86 16 91030 prior 11 3 21 12.0 eggs
8 623 33120 1 1 Organic Egg Whites 86 16 37804 prior 63 3 12 3.0 eggs
9 689 33120 4 1 Organic Egg Whites 86 16 108932 prior 16 1 13 3.0 eggs
mt['product_name'].value_counts()[0:10]
Banana                    472565
Bag of Organic Bananas    379450
Organic Strawberries      264683
Organic Baby Spinach      241921
Organic Hass Avocado      213584
Organic Avocado           176815
Large Lemon               152657
Strawberries              142951
Limes                     140627
Organic Whole Milk        137905
Name: product_name, dtype: int64
len(mt['product_name'].unique())
49677
prior.shape
(32434489, 4)

Clustering Customers

We are dealing with 134 types of product (aisle).

len(mt['aisle'].unique())
134

Fresh fruits and fresh vegetables are the best selling goods.

mt['aisle'].value_counts()[0:10]
fresh fruits                     3642188
fresh vegetables                 3418021
packaged vegetables fruits       1765313
yogurt                           1452343
packaged cheese                   979763
milk                              891015
water seltzer sparkling water     841533
chips pretzels                    722470
soy lactosefree                   638253
bread                             584834
Name: aisle, dtype: int64

I want to find a possible clusters among the different customers and substitute single user_id with the cluster to which they are assumed to belong. Hope this would eventually increase the next prediction model performance.

Ths first thing to do is creating a dataframe with all the purchases made by each user.

cust_prod = pd.crosstab(mt['user_id'], mt['aisle'])
cust_prod.head(10)
aisle air fresheners candles asian foods baby accessories baby bath body care baby food formula bakery desserts baking ingredients baking supplies decor beauty beers coolers ... spreads tea tofu meat alternatives tortillas flat bread trail mix snack mix trash bags liners vitamins supplements water seltzer sparkling water white wines yogurt
user_id
1 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 1
2 0 3 0 0 0 0 2 0 0 0 ... 3 1 1 0 0 0 0 2 0 42
3 0 0 0 0 0 0 0 0 0 0 ... 4 1 0 0 0 0 0 2 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 1 0 0
5 0 2 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 3
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 2 0 0 0 ... 0 0 0 0 0 0 0 0 0 5
8 0 1 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 6 0 2 0 0 0 ... 0 0 0 0 0 0 0 2 0 19
10 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 2

10 rows × 134 columns

cust_prod.shape
(206209, 134)

We can then execute a Principal Component Analysis to the obtained dataframe. This will reduce the number of features from the number of aisles to 6, the numbr of principal components I have chosen.

from sklearn.decomposition import PCA
pca = PCA(n_components=6)
pca.fit(cust_prod)
pca_samples = pca.transform(cust_prod)
ps = pd.DataFrame(pca_samples)
ps.head()
0 1 2 3 4 5
0 -24.215659 2.429427 -2.466370 -0.145694 0.269080 -1.432736
1 6.463208 36.751116 8.382553 15.097532 -6.920947 -0.978333
2 -7.990302 2.404383 -11.030064 0.672219 -0.442314 -2.823111
3 -27.991129 -0.755823 -1.921732 2.091887 -0.288229 0.926190
4 -19.896394 -2.637225 0.533229 3.679229 0.612819 -1.624025

I haven plotted several pair of components looking for the one suitable, in my opinion, for a KMeans Clustering. I have chosen the (PC4,PC1) pair. Since each component is the projection of all the points of the original dataset I think each component is representative of the dataset.

from matplotlib import pyplot as plt
%matplotlib inline
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits.mplot3d import proj3d
tocluster = pd.DataFrame(ps[[4,1]])
print (tocluster.shape)
print (tocluster.head())

fig = plt.figure(figsize=(8,8))
plt.plot(tocluster[4], tocluster[1], 'o', markersize=2, color='blue', alpha=0.5, label='class1')

plt.xlabel('x_values')
plt.ylabel('y_values')
plt.legend()
plt.show()
(206209, 2)
          4          1
0  0.269080   2.429427
1 -6.920947  36.751116
2 -0.442314   2.404383
3 -0.288229  -0.755823
4  0.612819  -2.637225
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

clusterer = KMeans(n_clusters=4,random_state=42).fit(tocluster)
centers = clusterer.cluster_centers_
c_preds = clusterer.predict(tocluster)
print(centers)
[[ -0.11868919   0.09644088]
 [-11.26759021  65.248165  ]
 [ -4.71387736 -40.63421033]
 [ 76.82339894  26.26358548]]
print (c_preds[0:100])
[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2
 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 2 0 0 0 0 0 0 0 0 0 0 2 0 0 2 1 0 0 0 0 0 0 0 0 0 0]

Here is how our clusters appear:

%matplotlib inline
fig = plt.figure(figsize=(8,8))
colors = ['orange','blue','purple','green']
colored = [colors[k] for k in c_preds]
print (colored[0:10])
plt.scatter(tocluster[4],tocluster[1],  color = colored)
for ci,c in enumerate(centers):
    plt.plot(c[0], c[1], 'o', markersize=8, color='red', alpha=0.9, label=''+str(ci))

plt.xlabel('x_values')
plt.ylabel('y_values')
plt.legend()
plt.show()
['orange', 'blue', 'orange', 'orange', 'orange', 'orange', 'orange', 'orange', 'orange', 'orange']

We have found a possible clustering for our customers. Let's check if we also manage to find some interesting pattern beneath it.

clust_prod = cust_prod.copy()
clust_prod['cluster'] = c_preds

clust_prod.head(10)
aisle air fresheners candles asian foods baby accessories baby bath body care baby food formula bakery desserts baking ingredients baking supplies decor beauty beers coolers ... tea tofu meat alternatives tortillas flat bread trail mix snack mix trash bags liners vitamins supplements water seltzer sparkling water white wines yogurt cluster
user_id
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
2 0 3 0 0 0 0 2 0 0 0 ... 1 1 0 0 0 0 2 0 42 1
3 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 2 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 1 0 0 0
5 0 2 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 3 0
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 2 0 0 0 ... 0 0 0 0 0 0 0 0 5 0
8 0 1 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 6 0 2 0 0 0 ... 0 0 0 0 0 0 2 0 19 0
10 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 2 0

10 rows × 135 columns

print (clust_prod.shape)
(206209, 135)
%matplotlib inline
f,arr = plt.subplots(2,2,sharex=True,figsize=(15,15))

c1_count = len(clust_prod[clust_prod['cluster']==0])

c0 = clust_prod[clust_prod['cluster']==0].drop('cluster',axis=1).mean()
arr[0,0].bar(range(len(clust_prod.drop('cluster',axis=1).columns)),c0)

c1 = clust_prod[clust_prod['cluster']==1].drop('cluster',axis=1).mean()
arr[0,1].bar(range(len(clust_prod.drop('cluster',axis=1).columns)),c1)

c2 = clust_prod[clust_prod['cluster']==2].drop('cluster',axis=1).mean()
arr[1,0].bar(range(len(clust_prod.drop('cluster',axis=1).columns)),c2)

c3 = clust_prod[clust_prod['cluster']==3].drop('cluster',axis=1).mean()
arr[1,1].bar(range(len(clust_prod.drop('cluster',axis=1).columns)),c3)
plt.show()

Let's check out what are the top 10 goods bought by people of each cluster. We are going to rely first on the absolute data and then on a percentage among the top 8 products for each cluster.

c0.sort_values(ascending=False)[0:10]
aisle
fresh fruits                     12.997293
fresh vegetables                 11.264617
packaged vegetables fruits        6.532016
yogurt                            4.838682
packaged cheese                   3.754675
milk                              3.303355
water seltzer sparkling water     3.168569
chips pretzels                    2.782964
soy lactosefree                   2.349505
bread                             2.279440
dtype: float64
c1.sort_values(ascending=False)[0:10]
aisle
fresh fruits                     84.445473
yogurt                           62.984685
packaged vegetables fruits       28.129081
water seltzer sparkling water    25.795860
fresh vegetables                 22.891787
milk                             22.726523
chips pretzels                   19.449680
packaged cheese                  19.042915
energy granola bars              19.022383
refrigerated                     16.012959
dtype: float64
c2.sort_values(ascending=False)[0:10]
aisle
fresh vegetables                 96.941836
fresh fruits                     51.419980
packaged vegetables fruits       27.925411
fresh herbs                      11.318104
packaged cheese                  10.646082
yogurt                            9.926398
soy lactosefree                   8.805224
milk                              8.353379
frozen produce                    7.815187
water seltzer sparkling water     6.770039
dtype: float64
c3.sort_values(ascending=False)[0:10]
aisle
baby food formula             90.031453
fresh fruits                  72.334056
fresh vegetables              50.059111
packaged vegetables fruits    34.557484
yogurt                        33.242950
packaged cheese               24.305315
milk                          23.996746
bread                         12.200651
chips pretzels                11.457701
crackers                      11.247831
dtype: float64

A first analysis of the clusters confirm the initial hypothesis that:

  • fresh fruits
  • fresh vegetables
  • packaged vegetables fruits
  • yogurt
  • packaged cheese
  • milk
  • water seltzer sparkling water
  • chips pretzels

are products which are genereically bought by the majority of the customers.

What we can inspect here is if clusters differ in quantities and proportions, with respect of these goods, or if a cluster is characterized by some goods not included in this list. For instance we can already see cluster 3 is characterized by 'Baby Food Formula' product which is a significant difference with respect to the other clusters.

from IPython.display import display, HTML
cluster_means = [[c0['fresh fruits'],c0['fresh vegetables'],c0['packaged vegetables fruits'], c0['yogurt'], c0['packaged cheese'], c0['milk'],c0['water seltzer sparkling water'],c0['chips pretzels']],
                 [c1['fresh fruits'],c1['fresh vegetables'],c1['packaged vegetables fruits'], c1['yogurt'], c1['packaged cheese'], c1['milk'],c1['water seltzer sparkling water'],c1['chips pretzels']],
                 [c2['fresh fruits'],c2['fresh vegetables'],c2['packaged vegetables fruits'], c2['yogurt'], c2['packaged cheese'], c2['milk'],c2['water seltzer sparkling water'],c2['chips pretzels']],
                 [c3['fresh fruits'],c3['fresh vegetables'],c3['packaged vegetables fruits'], c3['yogurt'], c3['packaged cheese'], c3['milk'],c3['water seltzer sparkling water'],c3['chips pretzels']]]
cluster_means = pd.DataFrame(cluster_means, columns = ['fresh fruits','fresh vegetables','packaged vegetables fruits','yogurt','packaged cheese','milk','water seltzer sparkling water','chips pretzels'])
HTML(cluster_means.to_html())
fresh fruits fresh vegetables packaged vegetables fruits yogurt packaged cheese milk water seltzer sparkling water chips pretzels
0 12.997293 11.264617 6.532016 4.838682 3.754675 3.303355 3.168569 2.782964
1 84.445473 22.891787 28.129081 62.984685 19.042915 22.726523 25.795860 19.449680
2 51.419980 96.941836 27.925411 9.926398 10.646082 8.353379 6.770039 5.795979
3 72.334056 50.059111 34.557484 33.242950 24.305315 23.996746 10.527657 11.457701

The following table depicts the percentage these goods with respect to the other top 8 in each cluster. It is easy some interesting differences among the clusters.

It seems people of cluster 1 buy more fresh vegetables than the other clusters. As shown by absolute data, Cluster 1 is also the cluster including those customers buying far more goods than any others.

People of cluster 2 buy more yogurt than people of the other clusters.

Absolute Data shows us People of cluster 3 buy a Lot of 'Baby Food Formula' which not even listed in the top 8 products but mainly characterize this cluster. Coherently (I think) with this observation they buy more milk than the others.

cluster_perc = cluster_means.iloc[:, :].apply(lambda x: (x / x.sum())*100,axis=1)
HTML(cluster_perc.to_html())
fresh fruits fresh vegetables packaged vegetables fruits yogurt packaged cheese milk water seltzer sparkling water chips pretzels
0 26.720216 23.158130 13.428710 9.947504 7.718970 6.791135 6.514038 5.721298
1 29.581621 8.019094 9.853741 22.063813 6.670817 7.961201 9.036403 6.813309
2 23.611072 44.513837 12.822815 4.558012 4.888477 3.835712 3.108672 2.661403
3 27.769415 19.217949 13.266795 12.762139 9.330935 9.212474 4.041622 4.398670

I think another interesting information my come by lookig at the 10th to 15th most bought products for each cluster which will not include the generic products (i.e. vegetables, fruits, water, etc.) bought by anyone.

c0.sort_values(ascending=False)[10:15]
aisle
refrigerated      2.168735
ice cream ice     2.082699
frozen produce    2.001447
eggs              1.778217
crackers          1.766064
dtype: float64
c1.sort_values(ascending=False)[10:15]
aisle
soy lactosefree    13.437227
bread              11.515146
crackers           10.998149
cereal              9.971054
candy chocolate     9.347526
dtype: float64
c2.sort_values(ascending=False)[10:15]
aisle
eggs                        6.176555
canned jarred vegetables    6.099542
bread                       6.015169
chips pretzels              5.795979
refrigerated                5.281124
dtype: float64
c3.sort_values(ascending=False)[10:15]
aisle
soy lactosefree                  11.003254
frozen produce                   10.577007
water seltzer sparkling water    10.527657
refrigerated                      8.530369
eggs                              8.318330
dtype: float64

As you can note by taking into account more products clusters start to differ significantly.