instacart-notebooks

by khanhnamle1994

notebooks/Instacart-Simple-Data-Exploration.ipynb/

Introduction

The Dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. The goal is then to predict which previously purchased products will be in a user’s next order. Now let's jump straight into the data and do some exploratory analysis.

# import the needed librairies
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

%matplotlib inline
import matplotlib.pyplot as plt  # Matlab-style plotting

import seaborn as sns
color = sns.color_palette()

import warnings
warnings.filterwarnings('ignore') #Supress unnecessary warnings for readability and cleaner presentation

pd.set_option('display.float_format', lambda x: '%.3f' % x) #Limiting floats output to 3 decimal points
# Now let's get and put the data in  pandas dataframe
order_products_train = pd.read_csv('order_products__train.csv')
order_products_prior = pd.read_csv('order_products__prior.csv')
orders = pd.read_csv('orders.csv')
products = pd.read_csv('products.csv')
aisles = pd.read_csv('aisles.csv')
departments = pd.read_csv('departments.csv')

Orders Products

Firstly, let's explore order_products_train and order_products_prior files. These files specify which products were purchased in each order. More specifically, order_products_prior contains previous order contents for all customers and order_products_train contains the last orders for some customers only.

print("The order_products_train size is : ", order_products_train.shape)
print("The order_products_prior size is : ", order_products_prior.shape)
The order_products_train size is :  (1384617, 4)
The order_products_prior size is :  (32434489, 4)
# display first five rows.
order_products_train.head(5)
order_id product_id add_to_cart_order reordered
0 1 49302 1 1
1 1 11109 2 1
2 1 10246 3 0
3 1 49683 4 0
4 1 43633 5 1
# display first five rows. 
order_products_prior.head(5)
order_id product_id add_to_cart_order reordered
0 2 33120 1 1
1 2 28985 2 1
2 2 9327 3 0
3 2 45918 4 1
4 2 30035 5 0

Now, let's concatenate them in the same order_products dataframe.

order_products_all = pd.concat([order_products_train, order_products_prior], axis=0)
print("The order_products_all size is : ", order_products_all.shape)
The order_products_all size is :  (33819106, 4)
# display first five rows.
order_products_all.head(5)
order_id product_id add_to_cart_order reordered
0 1 49302 1 1
1 1 11109 2 1
2 1 10246 3 0
3 1 49683 4 0
4 1 43633 5 1

Missing Data

total = order_products_all.isnull().sum().sort_values(ascending=False)
percent = (order_products_all.isnull().sum()/order_products_all.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total Missing', 'Percent'])
missing_data
Total Missing Percent
reordered 0 0.000
add_to_cart_order 0 0.000
product_id 0 0.000
order_id 0 0.000

There is no missing data in order_products_all dataset.

# Check the number of unique orders and unique products
orders_Unique = len(set(order_products_all.order_id))
products_Unique = len(set(order_products_all.product_id))
print("There are %s orders for %s products" %(orders_Unique, products_Unique))
There are 3346083 orders for 49685 products

Number of products that people usually order:

grouped = order_products_all.groupby("order_id")["add_to_cart_order"].aggregate("max").reset_index()
grouped = grouped.add_to_cart_order.value_counts()

sns.set_style('whitegrid')
f, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='vertical')
sns.barplot(grouped.index, grouped.values)

plt.ylabel('Number of Orders', fontsize=13)
plt.xlabel('Number of products added in order', fontsize=13)
plt.show()

We can observe that people usually order around 5 products.

Most ordered Products

Now let's identify which products are ordered the most.

grouped = order_products_all.groupby("product_id")["reordered"].aggregate({'Total_reorders': 'count'}).reset_index()
grouped = pd.merge(grouped, products[['product_id', 'product_name']], how='left', on=['product_id'])
grouped = grouped.sort_values(by='Total_reorders', ascending=False)[:10]
grouped
product_id Total_reorders product_name
24849 24852 491291 Banana
13173 13176 394930 Bag of Organic Bananas
21134 21137 275577 Organic Strawberries
21900 21903 251705 Organic Baby Spinach
47205 47209 220877 Organic Hass Avocado
47762 47766 184224 Organic Avocado
47622 47626 160792 Large Lemon
16794 16797 149445 Strawberries
26206 26209 146660 Limes
27842 27845 142813 Organic Whole Milk

Fruits like banana, strawberries...are the most ordered products.

grouped  = grouped.groupby(['product_name']).sum()['Total_reorders'].sort_values(ascending=False)

sns.set_style('darkgrid')
f, ax = plt.subplots(figsize=(12, 10))
plt.xticks(rotation='vertical')
sns.barplot(grouped.index, grouped.values)
plt.ylabel('Number of Reorders', fontsize=13)
plt.xlabel('Most ordered Products', fontsize=13)
plt.show()

Do people usually reorder the same previous ordered products?

grouped = order_products_all.groupby("reordered")["product_id"].aggregate({'Total_products': 'count'}).reset_index()
grouped['Ratios'] = grouped["Total_products"].apply(lambda x: x /grouped['Total_products'].sum())
grouped
reordered Total_products Ratios
0 0 13863746 0.410
1 1 19955360 0.590

59% of ordered products are previously ordered by customers.

grouped  = grouped.groupby(['reordered']).sum()['Total_products'].sort_values(ascending=False)

sns.set_style('whitegrid')
f, ax = plt.subplots(figsize=(5, 8))
sns.barplot(grouped.index, grouped.values, palette='RdBu_r')
plt.ylabel('Number of Products', fontsize=13)
plt.xlabel('Reordered or Not Reordered', fontsize=13)
plt.ticklabel_format(style='plain', axis='y')
plt.show()

Most Reordered Products

Which products are usually reordered?

grouped = order_products_all.groupby("product_id")["reordered"].aggregate({'reorder_sum': sum,'reorder_total': 'count'}).reset_index()
grouped['reorder_probability'] = grouped['reorder_sum'] / grouped['reorder_total']
grouped = pd.merge(grouped, products[['product_id', 'product_name']], how='left', on=['product_id'])
grouped = grouped[grouped.reorder_total > 75].sort_values(['reorder_probability'], ascending=False)[:10]
grouped
product_id reorder_sum reorder_total reorder_probability product_name
2074 2075 84 90 0.933 Serenity Ultimate Extrema Overnight Pads
27737 27740 94 102 0.922 Chocolate Love Bar
35601 35604 93 104 0.894 Maca Buttercups
38248 38251 99 111 0.892 Benchbreak Chardonnay
36798 36801 88 99 0.889 Organic Blueberry B Mega
10233 10236 114 131 0.870 Fragrance Free Clay with Natural Odor Eliminat...
20595 20598 99 114 0.868 Thousand Island Salad Snax
5455 5457 78 90 0.867 Classic Carbonated Natural Mineral Water
35493 35496 394 457 0.862 Real2 Alkalized Water 500 ml
9289 9292 2580 2995 0.861 Half And Half Ultra Pasteurized
grouped  = grouped.groupby(['product_name']).sum()['reorder_probability'].sort_values(ascending=False)

sns.set_style('darkgrid')
f, ax = plt.subplots(figsize=(12, 10))
plt.xticks(rotation='vertical')
sns.barplot(grouped.index, grouped.values)
plt.ylim([0.85,0.95])
plt.ylabel('Reorder probability', fontsize=13)
plt.xlabel('Most reordered products', fontsize=12)
plt.show()

Orders

Let's explore now the orders.csv file.

print("The orders data size is : ", orders.shape)
print("Columns in orders data are: ", orders.columns.values)
The orders data size is :  (3421083, 7)
Columns in orders data are:  ['order_id' 'user_id' 'eval_set' 'order_number' 'order_dow'
 'order_hour_of_day' 'days_since_prior_order']
# display first five rows of our dataset.
orders.head(5)
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
0 2539329 1 prior 1 2 8 nan
1 2398795 1 prior 2 3 7 15.000
2 473747 1 prior 3 3 12 21.000
3 2254736 1 prior 4 4 7 29.000
4 431534 1 prior 5 4 15 28.000

Missing Data

orders_na = (orders.isnull().sum() / len(orders)) * 100
orders_na = orders_na.drop(orders_na[orders_na == 0].index).sort_values(ascending=False)
orders_na
days_since_prior_order   6.028
dtype: float64

The only feature with missing values is days_since_prior_order with 6.028% missing.

Time of orders

Time at which people usually order products. Hours of Order in a Day:

grouped = orders.groupby("order_id")["order_hour_of_day"].aggregate("sum").reset_index()
grouped = grouped.order_hour_of_day.value_counts()

sns.set_style('darkgrid')
f, ax = plt.subplots(figsize=(15, 10))
sns.barplot(grouped.index, grouped.values)
plt.ylabel('Number of orders', fontsize=13)
plt.xlabel('Hours of order in a day', fontsize=13)
plt.show()

People mostly order between 8 and 19 (anonimyzed hours and probably between 8 a.m and 7 p.m.)

Days of Orders in a week:

grouped = orders.groupby("order_id")["order_dow"].aggregate("sum").reset_index()
grouped = grouped.order_dow.value_counts()

f, ax = plt.subplots(figsize=(10, 10))
sns.barplot(grouped.index, grouped.values)
plt.ylabel('Number of orders', fontsize=13)
plt.xlabel('Days of order in a week', fontsize=13)
plt.show()

People usually order at days 0 and 1 (anonimyzed days and probably the week end)

Period of Reorders:

grouped = orders.groupby("order_id")["days_since_prior_order"].aggregate("sum").reset_index()
grouped = grouped.days_since_prior_order.value_counts()

from matplotlib.ticker import FormatStrFormatter
f, ax = plt.subplots(figsize=(15, 10))
sns.barplot(grouped.index, grouped.values)
ax.xaxis.set_major_formatter(FormatStrFormatter('%.0f'))
plt.ylabel('Number of orders', fontsize=13)
plt.xlabel('Period of reorder', fontsize=13)
plt.show()

People usually reorder either after 1 week or after 1 month.

Orders in the whole dataset

Number and ratio of orders from the three datasets (prior, train, test).

grouped = orders.groupby("eval_set")["order_id"].aggregate({'Total_orders': 'count'}).reset_index()
grouped['Ratio'] = grouped["Total_orders"].apply(lambda x: x /grouped['Total_orders'].sum())
grouped
eval_set Total_orders Ratio
0 prior 3214874 0.940
1 test 75000 0.022
2 train 131209 0.038
grouped  = grouped.groupby(['eval_set']).sum()['Total_orders'].sort_values(ascending=False)

sns.set_style('whitegrid')
f, ax = plt.subplots(figsize=(8, 8))
sns.barplot(grouped.index, grouped.values, palette='coolwarm')
plt.ylabel('Number of Orders', fontsize=13)
plt.xlabel('datasets', fontsize=13)
plt.show()

Customers in the whole dataset

Let's check the total number of unique customers in the three datasets (prior, train, test).

print("Number of unique customers in the whole dataset: ",len(set(orders.user_id)))
Number of unique customers in the whole dataset:  206209
grouped = orders.groupby("eval_set")["user_id"].apply(lambda x: len(x.unique()))

plt.figure(figsize=(7,8))
sns.barplot(grouped.index, grouped.values, palette='coolwarm')
plt.ylabel('Number of users', fontsize=13)
plt.xlabel('Eval set', fontsize=13)
plt.title("Number of unique customers in each dataset")
plt.show()

Orders made by each customer

Let's check the number of orders made by each costumer in the whole dataset.

grouped = orders.groupby('user_id')['order_id'].apply(lambda x: len(x.unique())).reset_index()
grouped = grouped.groupby('order_id').aggregate("count")

sns.set_style("whitegrid")
f, ax = plt.subplots(figsize=(15, 12))
sns.barplot(grouped.index, grouped.user_id)
plt.ylabel('Numbers of Customers')
plt.xlabel('Number of Orders per customer')
plt.xticks(rotation='vertical')
plt.show()

We can observe that most customers made 4 orders.

Now let's explore the items datasets (products, departments and aisles files).

products.head(5)
product_id product_name aisle_id department_id
0 1 Chocolate Sandwich Cookies 61 19
1 2 All-Seasons Salt 104 13
2 3 Robust Golden Unsweetened Oolong Tea 94 7
3 4 Smart Ones Classic Favorites Mini Rigatoni Wit... 38 1
4 5 Green Chile Anytime Sauce 5 13
departments.head(5)
department_id department
0 1 frozen
1 2 other
2 3 bakery
3 4 produce
4 5 alcohol
aisles.head(5)
aisle_id aisle
0 1 prepared soups salads
1 2 specialty cheeses
2 3 energy granola bars
3 4 instant foods
4 5 marinades meat preparation

Now let's combine them in a single dataframe.

items = pd.merge(left =pd.merge(left=products, right=departments, how='left'), right=aisles, how='left')
items.head()
product_id product_name aisle_id department_id department aisle
0 1 Chocolate Sandwich Cookies 61 19 snacks cookies cakes
1 2 All-Seasons Salt 104 13 pantry spices seasonings
2 3 Robust Golden Unsweetened Oolong Tea 94 7 beverages tea
3 4 Smart Ones Classic Favorites Mini Rigatoni Wit... 38 1 frozen frozen meals
4 5 Green Chile Anytime Sauce 5 13 pantry marinades meat preparation

Most important Departments (by number of products)

grouped = items.groupby("department")["product_id"].aggregate({'Total_products': 'count'}).reset_index()
grouped['Ratio'] = grouped["Total_products"].apply(lambda x: x /grouped['Total_products'].sum())
grouped.sort_values(by='Total_products', ascending=False, inplace=True)
grouped
department Total_products Ratio
17 personal care 6563 0.132
20 snacks 6264 0.126
16 pantry 5371 0.108
3 beverages 4365 0.088
10 frozen 4007 0.081
7 dairy eggs 3449 0.069
11 household 3085 0.062
6 canned goods 2092 0.042
9 dry goods pasta 1858 0.037
19 produce 1684 0.034
2 bakery 1516 0.031
8 deli 1322 0.027
14 missing 1258 0.025
12 international 1139 0.023
4 breakfast 1115 0.022
1 babies 1081 0.022
0 alcohol 1054 0.021
18 pets 972 0.020
13 meat seafood 907 0.018
15 other 548 0.011
5 bulk 38 0.001
grouped = grouped.groupby(['department']).sum()['Total_products'].sort_values(ascending=False)

sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(12, 15))
plt.xticks(rotation='vertical')
sns.barplot(grouped.index, grouped.values)
plt.ylabel('Number of products', fontsize=13)
plt.xlabel('Departments', fontsize=13)
plt.show()

Most important Aisles in each Department (by number of Products)

grouped = items.groupby(["department", "aisle"])["product_id"].aggregate({'Total_products': 'count'}).reset_index()

grouped.sort_values(by='Total_products', ascending=False, inplace=True)

fig, axes = plt.subplots(7,3, figsize=(20,45), gridspec_kw =  dict(hspace=1.4))

for (aisle, group), ax in zip(grouped.groupby(["department"]), axes.flatten()):
    g = sns.barplot(group.aisle, group.Total_products , ax=ax)
    ax.set(xlabel = "Aisles", ylabel=" Number of products")
    g.set_xticklabels(labels = group.aisle,rotation=90, fontsize=12)
    ax.set_title(aisle, fontsize=15)

Most important Aisles over all Departments (by number of Products)

grouped = items.groupby("aisle")["product_id"].aggregate({'Total_products': 'count'}).reset_index()
grouped['Ratio'] = grouped["Total_products"].apply(lambda x: x /grouped['Total_products'].sum())
grouped = grouped.sort_values(by='Total_products', ascending=False)[:20]
grouped
aisle Total_products Ratio
85 missing 1258 0.025
18 candy chocolate 1246 0.025
71 ice cream ice 1091 0.022
130 vitamins supplements 1038 0.021
133 yogurt 1026 0.021
25 chips pretzels 989 0.020
125 tea 894 0.018
93 packaged cheese 891 0.018
59 frozen meals 880 0.018
31 cookies cakes 874 0.018
42 energy granola bars 832 0.017
67 hair care 816 0.016
122 spices seasonings 797 0.016
75 juice nectars 792 0.016
32 crackers 747 0.015
118 soup broth bouillon 737 0.015
4 baby food formula 718 0.014
28 coffee 680 0.014
110 refrigerated 675 0.014
26 cleaning products 655 0.013
grouped  = grouped.groupby(['aisle']).sum()['Total_products'].sort_values(ascending=False)

f, ax = plt.subplots(figsize=(12, 15))
plt.xticks(rotation='vertical')
sns.barplot(grouped.index, grouped.values)
plt.ylabel('Number of products', fontsize=13)
plt.xlabel('Aisles', fontsize=13)
plt.show()