Connect to Dask from SageMaker
At Saturn Cloud, one of our passions is to help users build their productivity and accelerate their machine learning from whatever working environment they prefer. This poses lots of interesting challenges for us, of course, but we really believe in making the experience of our customers as convenient as possible. Many of us are data scientists ourselves, and have struggled with having great tools that just don’t work for our practice.
As a result, we let you use Jupyter Lab in our cloud product, SSH from your IDE into Jupyter Lab, or let you just create and use machine clusters directly from your local IDE, no Jupyter server required.
This last functionality is brilliant, because it opens up so many possibilities for connecting with powerful Dask resource clusters in so many other tools and workspaces. In this post, I’m going to show you how you can combine Saturn Cloud with AWS Sagemaker to get all the power of Dask clusters in the Sagemaker environment. If you’re a regular Sagemaker user, but want to add Dask parallelism to your workflow, read on!
If you’re not familiar with Dask or cluster computing, here’s a brief overview. Dask allows parallelization of Python code, including across many machines in clusters.
As this diagram illustrates, the pieces in the gray box constitute a machine cluster, and in this example, that’s what will be hosted on Saturn Cloud. Instead of the pink box (the Client) being a Jupyter server also on Saturn Cloud, this will be your Sagemaker instance. Your code will be transmitted from Sagemaker to the cluster Scheduler, which will distribute tasks to the workers.
Log in to your Sagemaker environment and open a Jupyter instance. For this example, I’m using Sagemaker Studio, as shown in the screenshot below.
Inside Sagemaker Studio, open a new Notebook, and you’re ready to begin! You’ll be asked to select a kernel, and for this we recommend the “Python 3 (Data Science)” kernel.
This kernel won’t be complete for our needs, however. Whenever you use our direct machine cluster access functionality, you’ll want to pay attention to the working environments. If your local workspace has a different image, including different packages or versions, than the Saturn resources, you’ll need to resolve that before running Dask code or using your cluster.
To fix this easily, the first thing we recommend is checking that your Sagemaker notebook has the same versions of certain key libraries that your Saturn Cloud cluster image does, after you get things set up as shown below. These are the libraries that ought to be installed or updated if you use the Sagemaker “Python 3 (Data Science)” kernel.
- pandas: upgrade to 1.2.3 or better
- dask: install 2.30.0 or better
- distributed: install 2.30.1 or better
- dask-saturn: install 0.2.2 or better
pandas will likely be installed, but the version may be quite old in the kernel. Upgrading this is vital for Dask to work well for you.
All of this can be done with
pip. In Sagemaker Jupyter Notebooks, you can use the
%pip magic in regular code chunks to run these commands, so for me, it looks like the first chunk in this screenshot.
To find out about some conflicts early, you can run
client.get_versions(check=True) after you set up your Saturn client object. (I’ll explain that in a moment!) But that check won’t tell you about pandas conflicts, so don’t forget pandas!
Connect to a Saturn Cloud project
If you have not yet created a Saturn Cloud account, go to saturncloud.io and click “Start For Free” on the upper right corner. It’ll ask you to create a login.
Once you have done so, you’ll be brought to the Saturn Cloud projects page. Click “Create Custom Project”.
Give the project a name (ex: “sagemaker-demo”), but you can leave all other settings as their defaults. Then click “Create”.
After the project is created you’ll be brought to that project’s page. At this point you’ll need to retrieve two ID values:
project_id- the id for this particular project. You can get this from the URL of the project page. For example:
https://app.community.saturnenterprise.io/dash/projects/a753517c0d4b40b598823cb759a83f50has the project_id:
user_id- the ID that identifies you as a valid user in Saturn Cloud. Go to https://app.community.saturnenterprise.io/api/user/token and save the page as
token.json, then upload that file to the Sagemaker Studio workspace. Do not share this file with others.
Protect your user token, as it allows access to your account!
You can now load the token inside Sagemaker Studio in a notebook, as shown.
# Load token import json with open('../config.json') as f: data = json.load(f)
Connect to your Project
Now you are ready to connect your Sagemaker Studio workspace to your Saturn Cloud project, allowing you to interact with it from this notebook. Your
user_id is required (here shown as
data['token']), as well as the
project_id discussed earlier.
from dask_saturn.external import ExternalConnection from dask_saturn import SaturnCluster import dask_saturn from dask.distributed import Client, progress conn = ExternalConnection( project_id=project_id, base_url='https://app.community.saturnenterprise.io', saturn_token=data['token'] ) conn #> dask_saturn.external.ExternalConnection at 0x7f04d067e0d0>
Set Up Cluster
Finally, you are ready to set up a cluster in this project! You’ll see info messages logging here until the cluster is started and ready to use.
If you have a cluster already created on the project, here you can just start it up without creating a new one, using this same code. You can also ask it to change size using
cluster.scale(). For more details, we have documentation about managing clusters.
cluster = SaturnCluster( external_connection=conn, n_workers=4, worker_size='8xlarge', scheduler_size='2xlarge', nthreads=32, worker_is_spot=False)
Create Client Object
This lets us connect from our Sagemaker environment to this new cluster, and when we call the object, it gives us a link to the Dask Dashboard for that cluster. We can watch at this link to see how the cluster is behaving.
client = Client(cluster) client.wait_for_workers(4) client
At this point, you are able to do load data and complete whatever analysis you want. You can monitor the performance of your cluster at the link described earlier, or you can log in to Saturn Cloud and see the Dask dashboard, logs for the cluster workers, and other useful information.
Need help, or have more questions? Contact us at:
- On Intercom, using the icon at the bottom right corner of the screen