Dispatching Jobs for Research Workflows

Dispatching Jobs to Saturn Cloud for Research Workflows

Saturn Cloud Jobs are a set of code that can be set to run in one of four ways:

By pressing the “start button” within Saturn Cloud
By running on a preset schedule
Via an HTTP POST request to Saturn Cloud for programmatic running
Triggering it via the Saturn Cloud cli (which sends the same HTTP POST mentioned above)

Most people think about jobs in the context of productionizing data science - for example ETL jobs, or model re-training jobs. However jobs are also useful in interactive research.

Users may opt to do research and development on smaller cheaper machines, and then from there dispatch jobs to more powerful ones.
Users may opt to dispatch a job, and then shut down their development machine while the job runs over night, or over the weekend.
Users may opt to dispatch parallel jobs, in order to run parameter scans, or run many experiments and simulations.

This article discusses workflows around dispatching jobs in support of interactive research.

Note:

This article makes use of the saturn cloud cli, which you can install with pip install saturn-client

Create the new job

Saturn Cloud recipes are the recommended approach for working with Jobs for research. A basic job recipe looks like this:

type: job
spec:
  name: hello-world
  description: ''
  image: community/saturncloud/saturn-python:2023.09.01
  instance_type: large
  environment_variables: {}
  working_directory: /home/jovyan/workspace
  start_script: ''
  git_repositories: []
  secrets: []
  shared_folders: []
  start_dind: false
  command: echo "hello world"
  scale: 1
  use_spot_instance: false
  schedule: null

This recipe dispatches a job that executes echo "hello world" using the saturn-python:2023.09.01 image (The image is community/saturncloud/saturn-python:2023.09.01 because we are on the community instance of Saturn Cloud.

Save the above to recipe.yaml and then you can submit the job as follows:

$ sc apply recipe.yaml --start

This will create the Job in Saturn Cloud, and start it. After this you can view the job and the logs from the Saturn Cloud UI, but you can also work with it from the command line (more on this later).

Note:

In most cases, you will not be writing recipes from scratch - though of course you always can.

Creating jobs by cloning other resources

If you’re a Saturn Cloud user it’s generally easier to clone an existing workspace rather than creating a Job recipe from scratch. Usually you already have a workspace where you can run your job code interactively. We always recommend people start running jobs interactively before trying to deploy them.

You can clone a workspace as a job with the following command:

$ sc clone workspace ops-devel job my-job --command "echo 'hello-world'"

It is often useful to write that recipe to a file, so that you can modify it:

$ sc get job my-job > /tmp/recipe.yaml

Afterwards, you can submit the job via:

$ sc start job my-job

If you have modified the recipe and you would like to apply it:

$ sc apply /tmp/recipe.yaml --start

Source code and data used by jobs

Most Saturn Cloud resources get their code from Git repositories you have configured in Saturn Cloud.. Most Saturn Cloud resources load data from networked resources such as S3, shared folders or databases like Snowflake and Redshift.

For research, it can be convenient to be able to synchronize files from your development environment to the job. This may be because you have code changes that you aren’t ready to push to Git, or because you have data files locally that don’t exist on networked storage.

This workflow supports synchronizing arbitrary files with your job.

$ sc apply /tmp/recipe.yaml --sync /home/jovyan/workspace/my-repo --sync /home/jovyan/my-data

This command will archive /home/jovyan/workspace/my-repo and /home/jovyan/my-data into tar.gz files, and upload them to internally hosted networked storage (SaturnFS), and generate start script commands in your job that will download and extract the files to the appropriate location. For example after applying the above command, the resulting recipe includes this additional block:

spec:
  start_script: >
    ### BEGIN SATURN_CLIENT GENERATED CODE

    saturnfs cp
    sfs://internal/hugo/ops-devel-run/home/jovyan/workspace/my-repo/data.tar.gz
    /tmp/data.tar.gz

    mkdir -p /home/jovyan/workspace/my-repo

    tar -xvzf /tmp/data.tar.gz -C /home/jovyan/workspace/my-repo

    saturnfs cp
    sfs://internal/hugo/ops-devel-run/home/jovyan/my-data/data.tar.gz
    /tmp/data.tar.gz

    mkdir -p /home/jovyan/my-data/

    tar -xvzf /tmp/data.tar.gz -C /home/jovyan/my-data/

    ### END SATURN_CLIENT GENERATED CODE

Job output

Saturn Cloud jobs are dispatched to new machines, and these machines are automatically torn down after the job completes. The only output Saturn Cloud captures are job logs. All other output files your job produces should be saved to a network location.

Job status and logs

The following command tells me the current status for the job.

$ sc list job hello-world

owner            name           resource_type    status     instance_type    scale    id
----------------------------------------------------------------------------------------------------------------------
internal/hugo    hello-world    job              pending    large            1        a387e542a27a4e689f16a0fac48901de

The following command gives me all invocations (pods) for this job.

$ sc pods job hello-world
pod_name                                                           status       source        start_time                   end_time
----------------------------------------------------------------------------------------------------------------------------------------------------
id-hugo-hello-world-a387e542a27a4e689f16a0fac48901de-nz-0-hplcq    pending      live          2024-04-01T15:29:36+00:00
id-hugo-hello-world-a387e542a27a4e689f16a0fac48901de-wc-0-4ptfv    completed    historical    2024-04-01T01:29:46+00:00    2024-04-01T01:32:31+00:00
id-hugo-hello-world-a387e542a27a4e689f16a0fac48901de-hd-0-w5qx4    completed    historical    2024-04-01T01:15:48+00:00    2024-04-01T01:18:34+00:00

You can then request the logs for each pod. Note, Saturn Cloud captures live and historical logs. Live logs are stored on the machine where the job is running. These disappear when the machine is torn down. Historical logs are an archive of the live logs, but there may be a few minute delay before logs end up in the historical log store. As a result, the CLI lets you specify which source you would like to choose for logs. If you omit the source, the client attempts to figure out the best source.

$ sc logs job hello-world id-hugo-hello-world-a387e542a27a4e689f16a0fac48901de-nz-0-hplcq
$ sc logs job hello-world id-hugo-hello-world-a387e542a27a4e689f16a0fac48901de-nz-0-hplcq --source live
$ sc logs job hello-world id-hugo-hello-world-a387e542a27a4e689f16a0fac48901de-nz-0-hplcq --source historical