Tutorials

Step-by-step walkthroughs that demonstrate how to use Saturn Cloud.

Overview

The tutorials in this section are intended to be used as step-by-step guides to doing things in Saturn Cloud. They are accompanied by the sample code in https://github.com/saturncloud/examples.


Loading Data with Dask

Our large dataset for this notebook will be NYC taxi data from all of 2019 Rather than load the data with pandas’ pd.read_csv, we will use Dask’s dd.read_csv method. We’ll also look at how to load messier data.

Hyperparameter Tuning with Scikit-Learn and Dask

Hyperparameter searching is an example of a compute-bound workload. The data fits comfortably into memory of the Jupyter Server, but the grid search still takes some time to execute. Let’s take this workflow and parallelize it with Dask!

Scheduled Data Pipelines

Scheduled deployments run on a cron schedule. They are typically (but do not have to be) prefect flows. They can also be paired with a Dask cluster.

Training with Large Datasets

Use Dask ML to train a linear model to predict tip_fraction and save out the model for later use.

Fault-Tolerant Data Pipelines with Prefect Cloud

Prefect is an open source workflow orchestration framework written in Python. It can integrate with Dask to speed up data processing pipelines by taking advantage of parallelism. Prefect Cloud is a high-availability, fault-tolerant hosted offering that handles orchestration of these pipelines.

Predict Over Large Datasets

We can’t load a large dataset into one pandas DataFrame, if we needed to predict over a large dataset we could batch it and collect the results. It’s easier to use the same pandas interface with a dask.DataFrame.

XGBoost Training with Dask

XGBoost is an open source library which implements a custom gradient-boosted decision tree (GBDT) algorithm. It has built-in distributed training which can be used to decrease training time or to train on more data. This article describes distributed XGBoost training with Dask.