Saturn Cloud Hosted Has Launched: GPU Data Science for Everyone!

October 16, 2020

Saturn Cloud Hosted Has Launched: GPU Data Science for Everyone!

October 16, 2020

GPU computing is the future of data science. Packages such as RAPIDS, TensorFlow, and PyTorch enable lightning-fast processing for all facets of data science: data cleaning, feature engineering, machine learning, deep learning, and more. The challenge with taking advantage of GPU computing is that it requires investment for on-premise hardware or infrastructure build-outs to utilize GPUs on the cloud.

Saturn Cloud Hosted Has Launched: GPU Data Science for Everyone!

Fall is the unofficial season of Data Science & Machine Learning

Today, Saturn Cloud is announcing the launch of Saturn Cloud Hosted, a cloud-hosted solution for end-to-end GPU data science that fits the needs of all startups, small teams, students, researchers, and tinkering data scientists.

TL;DR: Saturn Cloud Hosted is live and allows anyone to sign up and launch machines for GPU-enabled data science with the click of a button. You can get started with a free trial today!

Within seconds of signing up, you can spin up a JupyterLab instance with pre-configured environments for the most popular GPU data science packages, backed by an NVIDIA T4 or V100 GPU. When your data size exceeds that of a single GPU, you can easily scale out to a cluster of multiple GPU machines. Hundreds even! Saturn Hosted takes care of all the hardware provisioning, environment setup, and cluster communication challenges so data scientists can get straight to work.

Saturn Cloud Hosted Has Launched: GPU Data Science for Everyone!

Change the world with GPUs on Saturn Cloud Hosted

The vision of Saturn Cloud Hosted with NVIDIA GPUs is to bring the world’s fastest data science and machine learning capabilities to everyone, regardless of budget, resources, and time. Whereas GPU-accelerated tooling was a luxury before, the dropping prices over time, plus cloud availability and infrastructure provided by Saturn Cloud make it a powerful tool for everyday users.

Faster random forest on GPUs

Let’s explore implementations of distributed random forest training on clusters of CPU machines using Apache Spark and compare that to the performance of training on clusters of GPU machines using RAPIDS and Dask.

TLDR: We trained a random forest model using 300 million instances: Spark took 37 minutes on a 20-node CPU cluster, whereas RAPIDS took 1 second on a 20-node GPU cluster. That’s over 2000x faster with GPUs.

You can read about this benchmark in more depth here. We trained a random forest model on 300,700,143 instances of NYC taxi data on Spark (CPU) and RAPIDS (GPU) clusters. Both clusters had 20 worker nodes and approximately the same hourly price. Here are the results for each portion of the workflow.

https://gist.github.com/rikturr/b881bef327716e0140937dc13b237916

Saturn Cloud Hosted Has Launched: GPU Data Science for Everyone!

That’s 37 minutes with Spark vs. 1 second for RAPIDS!

GPUs crushed it — that’s why you’re going to be so thrilled to have them now. Think about how much faster you can iterate and improve your model when you don’t have to wait over 30 minutes for a single fit. Once you add in hyperparameter tuning or testing different models, each iteration can easily add up to hours or days.

Need to see it to believe it? You can find the notebooks here! Or continue reading to see how to set up a project in Saturn Cloud Hosted and run it for yourself.

Accelerating data science with GPUs on Saturn Cloud Hosted

It’s easy to get started with GPUs on Saturn Cloud Hosted, and we’ll walk through the above random forest model training exercise using a sample of the data. The example uses NYC Taxi data to train a random forest model that classifies rides into “high tip” or “low tip” rides. The notebooks come pre-loaded into an “examples-gpu” project when you create an account, or you can grab the notebooks yourself here.

We start by loading a CSV file into a dataframe, but since we’re using the RAPIDS cudf package, the dataframe gets loaded into GPU memory:

https://gist.github.com/rikturr/d811a321efac5d8a8b28110c81927e73

Then after some feature processing, we train our random forest model

https://gist.github.com/rikturr/b9694fe7a9f0dca8c494ca9e9bae5909

The great part about Saturn Cloud Hosted is that this code “just works”. The environment is set up for you, the GPU is hooked up properly, and now you can focus on training a model.

If your dataset is large, using a single GPU may not be enough, because the dataset and subsequent processing must fit into GPU memory. That’s where a Dask cluster on Saturn Cloud Hosted comes in! You can define a cluster from the UI or from within a notebook, making sure to choose a GPU size for the Dask workers:

Saturn Cloud Hosted Has Launched: GPU Data Science for Everyone!

https://gist.github.com/rikturr/8b5bcead7c4cbd6674c48b28fbfa2c77

Then it’s a matter of importing the proper RAPIDS modules and sub-modules for distributed GPU processing:

https://gist.github.com/rikturr/d0a20653ad251d52c5227a3d33ce2128

Do you want an easy way to get super-fast GPU data science?

Yes! You can get going on a GPU cluster in seconds with Saturn Cloud Hosted. Saturn Cloud handles all the tooling infrastructure, security, and deployment headaches to get you up and running with RAPIDS right away. Click here for a free trial of Saturn Cloud Hosted!

If you are part of a company that requires a virtual private cloud solution, Saturn Cloud also offers an Enterprise solution that you can find here.