Hyperparameter searching is an example of a compute-bound workload. The data fits comfortably into memory of the Jupyter Server, but the grid search still takes some time to execute. Let’s take this workflow and parallelize it with Dask!
The tutorials in this section are intended to be used as step-by-step guides to doing things in Saturn Cloud. They are accompanied by the sample code in https://github.com/saturncloud/examples.
LightGBM is an open source library which implements a custom gradient-boosted decision tree (GBDT) algorithm. It has built-in distributed training which can be used to decrease training time or to train on more data. This article describes distributed LightGBM training with Dask.
Our large dataset for this notebook will be NYC taxi data from all of 2019 Rather than load the data with pandas’
pd.read_csv, we will use Dask’s
dd.read_csv method. We’ll also look at how to load messier data.
Use Dask ML to train a linear model to predict
tip_fraction and save out the model for later use.
Prefect is an open source workflow orchestration framework written in Python. It can integrate with Dask to speed up data processing pipelines by taking advantage of parallelism. Prefect Cloud is a high-availability, fault-tolerant hosted offering that handles orchestration of these pipelines.
We can’t load a large dataset into one pandas DataFrame, if we needed to predict over a large dataset we could batch it and collect the results. It’s easier to use the same pandas interface with a dask.DataFrame.
Scheduled deployments run on a cron schedule. They are typically (but do not have to be) prefect flows. They can also be paired with a Dask cluster.
XGBoost is an open source library which implements a custom gradient-boosted decision tree (GBDT) algorithm. It has built-in distributed training which can be used to decrease training time or to train on more data. This article describes distributed XGBoost training with Dask.