Guest post: Vin Vashishta
Captain Obvious statement, deep learning models are becoming more complex. So how do we manage that complexity? Python has significant limitations when it comes to parallelizing training and inference. That means looking for performance boosts needs to rely on more than a small toolset.
I’ve seen what happens when a team of machine learning data scientists run complex training, either locally or distributed on AWS without a stack to manage resources. A lot of single threads running on expensive hardware. Training takes longer than it should because each iteration isn’t using the available memory and processing power in the best possible way.
There is a lot of productivity to be gained from solving this problem. I talked about how tools like Dask can manage parallelization on CPUs in my last post. However, there’s more to be gained by switching complex models to GPUs.
GPUs provide a real performance boost to deep learning model training and inference. There are many machine learning data scientists who still train and deploy models relying on CPUs only. We’ve all read about the improvements that come from GPUs. However, implementation can be a barrier to adoption. How much will GPUs really improve training times? They’re expensive. High cost and learning curve make justifying GPUs difficult.
In this post, I’m going to discuss using a simple stack that provides GPU support without straying too far from the familiar Python machine learning libraries. Dask, in some cases with the help of RAPIDS, can accomplish this goal. Saturn Cloud manages deployment without additional coding or DevOps work. Working as a stack, these tools reduce the number of hours GPU instances are needed for training and the level of effort required to manage those clusters.
I’ve done a detailed write up of Dask for CPUs in my last post. This will be more focused on implementing Dask with GPUs.
What Can You Gain From GPUs?
Like many topics around optimizing deep learning, explanations become over-complicated quickly. However, it’s important to be able to explain why GPUs provide value. Getting buy-in requires more than a vague slide so here’s a simple explanation for why there’s a performance boost.
There is a foundational difference between how a GPU works and how a CPU works. You’d expect the processing difference, but memory usage/architecture is different as well. Both sides play into the benefits of GPUs. They also explain why the Dask, RAPIDS, Saturn Cloud stack provides value.
GPUs handle math operations better than a CPU. The first benefit comes from the number of cores on a GPU. There are thousands. While those cores don’t handle a wide range of processing tasks, they are made to handle complex math operations. That capability works in tandem with the CPU. The repetitive math can be offloaded to the GPU which frees up the CPU to handle the more diverse instructions that go along with deep learning training and inference.
Memory is the next difference. The speed that memory is fed to the GPU is much higher than the speed of memory being fed to the CPU. GPU memory is dedicated, on the same card and optimized for being accessed repeatedly. GPUs are architected, although not originally intended, for deep learning.
Dask and Saturn Cloud
Parallel operations are necessary to realize the gains from GPU instances or the GPU on your local environment. This is where Spark typically enters the conversation. Spark isn’t perfect for every team or use case:
There’s a learning curve with Spark for deep learning.
Debugging failed jobs requires java knowledge.
Existing models require recoding to use Spark.
Spark’s a great tool that can be DevOps heavy. Dask’s data frames work in almost the same way as Pandas do so there’s a lower learning curve and little recoding for existing models.
Companies like Walmart and Capital One use Dask for GPU optimization because it provides their teams with a simple way to train their models. They’ve added RAPIDS to handle the conversion from Pandas or NumPy to work on GPUs. Both Dask and RAPIDS use an API that’s similar enough to the originals that the learning curve is minimal.
Dask also works along with PyTorch or TensorFlow to manage distributed deep learning with GPUs. For TensorFlow, Dask can handle data preparation and cleaning as well as setting up the TensorFlow network. The larger the dataset, the greater the gains from Dask.
While Dask works with either GPU or CPU, the greatest compute gains come on distributed GPU environments for complex deep learning model training. Training time on some models has been reduced by over 90% using Dask on GPUs. That optimization allows for more iterations or a faster project delivery time.
Saturn Cloud comes in to manage the distributed infrastructure. A lot of money can be wasted by GPU instances which are unnecessarily spun up or poorly managed. Saturn Cloud manages the cluster. There’s no additional code required, and the DevOps piece is handled without taking up additional time.
AWS P3/G4 GPU instances are expensive. A data scientist’s time is more expensive. With Saturn Cloud, the GPU cluster run time is optimized with no additional expense in labor. The Dask, Saturn Cloud stack makes justifying GPU usage for deep learning model training a lot easier.
You may also be interested in: Best Practices: Optimizing Pandas and Dask for Machine Learning