Experimenting with a Dataset of E-commerce Transactions Using Dask on Saturn Cloud
By: Dan Roth, New York Times
Part 1: Exploratory Data Analysis
The current state of the data science industry requires practitioners to employ a wide variety of tools to achieve their aims. These range from mainstays like SQL and Python to a slew of machine learning libraries like scikit-learn as well as open-source libraries such as the advanced models produced by HuggingFace. It is at times, dizzying to sort through the many potential approaches a data analyst can take to achieve their end-goals. Thankfully, tools like Saturn Cloud seek to consolidate rather than complicate the current web of offerings. Saturn Cloud currently allows users to host Jupyter Labs on local machines or AWS instances (with more cloud providers planned to be added later).
Saturn Cloud’s true power lies in its implementation of Dask as a well-integrated framework for distributing computational loads across preset clusters. A lot of the system infrastructure is already set in place by the Saturn Cloud platform, which helps to cut out a lot of the minutiae that accompany preparing containerized environments using services like Docker and Kubernetes.
Dask does not simply expedite computation, it also empowers the user to stay within the Pythonic workflow when they might be torn between SQL and Spark, which are typically used in tandem to speed up database querying and ETL pipelines. There is a lot to be gained when a data practitioner can comfortably stay within one language throughout their entire workflow and still feel computationally efficient.
I am currently a data reporting analyst that primarily works with subscriber data. My biggest concerns typically involve the transformation of raw transactional data into actionable insights through the use of well-defined metrics and strong visualizations. Along that line of thought, I decided to test out the capabilities of Saturn Cloud using a dataset of over 110 million unique user events on a multi-category e-commerce website. The data includes monthly tables from the time period of October 2019 – January 2020. This experiment uses the October 2019 dataset for simplicity but additional months can be appended along the same lines demonstrated here.
I begin in the Saturn Cloud interface where I configure and launch a new Jupyter Lab instance. You can assign a name, disk space, and cluster size among other controls like auto-shutoff and choice of image version. I chose 40GB of memory on an 2XLarge cluster with 8 cores and 64 GB RAM; users can even deploy the largest cluster, a 16XLarge that comprises 64 cores and 512 GB RAM. There is a pricing estimate listed below the configuration options so you know exactly what costs you may incur! I then launched the cluster and proceeded to attach Dask to the cluster in another tab of the Saturn Cloud interface. Just like that, I now have a cloud instance of Jupyter Lab enabled with my chosen specifications and already equipped to take advantage of Dask.
I then proceed to prepare the materials I will need for my data exploration. It is very simple to use the Jupyter Lab interface to upload files to the cloud instance. If a user requires greater flexibility, it is easy to access the Python console within the Lab in order to use command line functions. I can not resist testing some of the speed I have gained by using Dask, so I decide to time the simple act of loading the data into a Dask dataframe. This dataset is very large, so I place it into an S3 bucket and then load it directly into Dask from the S3 url.
It is immediately apparent that I have gained a great amount of computational speed by using Saturn Cloud’s built-in Dask implementation, and we have not even gotten to the more complicated aggregations and analyses! Take note that Pandas takes well over a minute to load the data on average as opposed to Dask’s 124 millisecond load time.
It is always fruitful to begin data work with exploratory data analysis (EDA); I perform a series of statistical aggregations and calculations and start to derive a better understanding for the e-commerce data. I start with a deep dive into customer behavior as well as some of the attributes of the product brands on the e-commerce website. Some simple mean calculations yield that users will purchase a little over 2 items and the average price of purchased items is around $309.56.
The most popular product categories in terms of views are smartphones, clocks, laptops, televisions, and headphones. Smartphones have an overwhelming majority of the product views overall while the other leading categories all have similar view counts. Smartphones are clearly the most popular item on the website by a large margin. Interestingly, this order is slightly shifted when examining purchase patterns instead of view statistics. Smartphones remain dominant, but the order of ranking following is altered: headphones, televisions, clocks and then washers make up the top purchase categories. Headphones may be the most popular accessory to buy with the website’s popular smartphones. Laptops completely drop out of the top categories list; they may be unappealing on this website compared to smartphones despite their high view counts.
Products like clocks that have far greater views than purchases may indicate that while these products are desirable and worth a curious look, there is not enough to compel someone to buy these items. In the case of clocks, it may be related to the fact that clocks are now vanity items as they have been replaced largely by smartphones, although smartwatches are an outlier in this case. Upon further examination, I find that clock prices are about $294.45 on average, which is lower than the average price of a purchased item. This indicates that factors like price may not be what is inhibiting clock sales. There may be more nuanced factors at play such as the selection and appeal of the clocks themselves or a certain market orientation that the e-commerce website has taken towards gadgets like smartphones rather than physical clocks.
I drill down further into product analysis by examining behavior for the popular product brands on the website. While it may appear obvious to assume that the more views a product has the more likely it is to be purchased, it is always a good idea to confirm with data insights. There does seem to be some linearity between views and purchases across brands as is confirmed by the following plots. The 5 brands that are the top in views (Samsung, Apple, etc.) are the same 5 brands that dominate in product sales.
This may indicate that either these brands are strong predictors of a consumer’s interest in a product or that products with higher views have a greater chance of being purchased; perhaps both are true. It also makes sense that these are the top brands as smartphones are the most popular product category on the website. It is notable that while the distribution of views among brands is a bit more equitable, there is a larger disparity when it comes to product purchases. Samsung and Apple take a stronger majority of the purchase transactions, likely due to their strong brand positioning in the smartphone market. Users may be viewing alternate brands to properly vet their options but may end up returning to reliable and well known smartphone producers.
Lastly, the transaction timestamps can be used to understand some of the time-based components of customer interactions. Pandas can easily be used to convert these timestamps into weekdays and hours to gather general insights. After some simple aggregations, I learn that Tuesday, Wednesday, and Thursday are the most popular days for all user transactions (filtering by view or purchase transactions yields the same three days although in slightly altered orders).
The most popular hours for user activity range from 3 to 5 PM, with 4 PM coming out slightly ahead. There is a gradual rise in transactions throughout the morning with a dip during lunch hours, culminating in the peak during the mid-afternoon. It appears users are primarily using the website in the middle of the week towards the end of the workday. Users might be burning time at the end of work by shopping online; these insights can greatly help the business understand the profiles and habits of their primary clientele.
Our exploration of this e-commerce dataset demonstrates the flexibility of working in a Saturn Cloud environment. It is easy to maintain consistent, readable code by staying in one language, Python, for the entirety of the notebook. This is in spite of the fact that the dataset we are using is quite large (5 GB+). We are able to rely on common API’s that are familiar to many data scientists throughout the entirety of the data experiment such as pandas and matplotlib/seaborn for visualization. The code here may appear straightforward, but that is mostly thanks to Saturn Cloud’s highly accessible environment for large-scale data science work. Typically, exploring datasets this large would require much more complicated processing pipelines that can now be swept aside due to the convenience of Saturn Cloud’s preset infrastructure.
It is very straightforward to publish my notebook and share my work with other collaborators if the need arises using Saturn Cloud’s collaboration features. In fact, the notebook I used is linked below; feel free to look through the code for more thorough documentation of the examples we explored here.
While my professional expertise largely lies in the realm of data analysis, I would be remiss if I did not explore some of the powerful machine learning capabilities that Saturn Cloud can provide. Keep an eye out for the second part of this series where we will walk through a machine learning workflow that uses this e-commerce dataset and many of the insights we derived here.
View the published Jupyter Notebook here.