Guest post: Vin Vashishta
Data wrangling. We all do it. In Python, we mostly do it with Pandas. It’s the go-to because it allows for smoother use, analysis, and transformation of data. We don’t get a lot of clean data nor data formatted for our purposes. If you’re not familiar with Pandas, here’s a good article to kickstart your learning.
In a perfect world, we wouldn’t need data loading, wrangling, or memory management to be part of our daily work but that’s not what most of us get. Access to data/machine learning engineers is growing. However, using local or cloud resources without relying on additional staff and heavy infrastructure is part of the data scientist workload.
Under the hood, Pandas creates objects that are stored in memory for quick access and manipulation. Waiting for operations to complete isn’t ideal and Pandas manages that. However, most data scientists don’t have a strong understanding of what’s happening under the hood. A lot of those hidden components have a large impact on performance and memory usage.
That’s where best practices for Pandas become useful. We’re often working with data locally and even systems with a lot of memory have their limits. Reducing the number of projects or proof of concepts that require cloud resources makes sense.
Most projects will require cloud resources at some point, either in later stages of development or maintenance. We work with terabyte+ size datasets. It’s unavoidable. Dask combined with Pandas makes using cloud resources more efficient. Dask scales Pandas with a simple to use framework that is integrated with both Python and Pandas so the learning curve is minimal. Saturn Cloud is an Amazon Marketplace platform that makes the data engineering side of Dask and Pandas easier.
Pandas Best Practices
Few dive into how Pandas allocates memory because it’s usually presented in an ugly, overly granular way. I’ll be diving deep but avoiding information overload.
Know your Python implementation. Are you using Anaconda? Many of us are. It’s big from an installation standpoint but there are a lot of reasons why it’s popular. CPython is the (sort of) default implementation and it has a smaller installation footprint with many of the optimization benefits.
PyPy is one to avoid because it’s a memory hog with imperfect handling of the data science Python stack. I’ve just made a lot of people very angry with that statement. However, it’s important to understand the implementation can be problematic with regards to its interaction with NumPy and by extension, Pandas.
I won’t dive too deeply into others but here’s a good overview of each Python implementation if you’re looking to really dive into this topic.
The easiest best practice is to know your data. Prune before loading when possible. It isn’t always easy to look at importing a very small sample of the larger dataset first using nrows. Prune the fields after reviewing them with skiprows or usecols. Get to know your import parameters.
You can also specify the data types to use for each column during the loading process. This is typically where things get too granular and I’m going to avoid that.
The DataFrame is a well-understood concept and Pandas will do a good job of assigning variable types on its own. Numerical data types are stored leveraging NumPy. There are additional data types that are implemented or extended by Pandas. To see what data types are used, dtypes or info(), more detailed, provide that information. A detailed tutorial about converting data types in Pandas can be found here.
Looking at integer and float values, it’s worthwhile to optimize these when their data column contains millions+ of records. Reducing datatypes to the smallest memory footprint possible (changing int64 to int16 or using smaller unsigned ints for columns that are always positive) drops the memory required to hold the DataFrame. There are smaller gains here which, over a large number of rows or being applied across a large number of columns can add up. It’s always a good idea to weigh the level of effort against the value of returns.
Changing strings to categories is where the largest gains can be realized. In Pandas, categories use a lot less memory than strings. Fields like product types, job titles, regional locations, etc. are common candidates for the category data type.
What About Bigger Datasets? Dask DataFrames
Dask DataFrames are an extension of Pandas DataFrames with most of the same functionality and syntax. You import the Dask DataFrame instead of the Pandas DataFrame. That makes usage simple and requires little additional learning because the differences between the two are minimal. Dask adds a few pieces of functionality that allow data scientists to leverage distributed/parallel resources.
Dask can run on a single machine and provide memory and compute optimization the same way it does in a distributed environment. A Dask DataFrame is partitioned differently than a Pandas DataFrame providing improvements when it comes to the overhead of managing the DataFrame. That’s all you really need to know from a memory management standpoint.
There are notable gains in processing times when dealing with the NumPy aspects of Pandas DataFrames but Pandas itself is constructed to be (mostly) limited to a single thread. Dask’s biggest advantage is the introduction of multithreading to reduce execution times.
It gets complicated here and a deeper review is not necessary. In the most recent version of Python, the single thread limitation has a good workaround but, again, let’s avoid getting too far into the weeds. Suffice to say that the processing gains from Dask with Pandas are about to get a lot higher which is another reason to look at Dask DataFrames.
Where Does Saturn Cloud Fit?
Deployment of Dask into a distributed environment like AWS, can require a lot of work. Pandas best practices are in the data scientist’s wheelhouse. DevOps really shouldn’t be. Dask provides simple access to using resources more efficiently. Saturn Cloud provides simple access to deploying and managing the distributed environment.
Saturn Cloud is an inexpensive Amazon Marketplace add-on that provides infrastructure management and end to end analytics. Just like Dask works under the covers to reduce the level of effort, so does Saturn Cloud. There’s no additional setup required to start using Saturn Cloud.
As you can see, Pandas best practices are technical but straightforward. Starting to explore the deeper workings of Python with respect to optimization is where topics often become more software development technicals than data scientists need to understand. It’s easy to get lost in the weeds and spend more time optimizing than analyzing data or developing models.
Pandas best practices reduce the need for a high-end development environment. They will make your laptop’s resources stretch farther. As memory and computer become limiting factors, Dask DataFrames are an easy way to stretch those resources further. Dask’s ability to scale in a distributed environment with little additional coding is useful for larger datasets. Saturn Cloud manages the distributed environment, so all the data scientist has to focus on is data science and machine learning.
You may also be interested in: Your Practical Guide to Dask