Prefect Cloud

Saturn Cloud offers a built-in integration with Prefect Cloud, a hosted version of prefect. Prefect is an open-source workflow management framework written in Python.

Prefect Cloud

This page describes the integration between Saturn Cloud and Prefect Cloud. For a step-by-step tutorial using this integration, see “Fault-Tolerant Data Pipelines with Prefect Cloud”.

The same team that maintains the prefect core library runs a cloud service called Prefect Cloud. Prefect Cloud is a hosted, high-availability, fault-tolerant service that handles all the orchestration responsibilities for running data pipelines.

Prefect Orchestration Concepts

Before continuing, it’s important to have a high-level understanding of the key components in Prefect Cloud. For much, much more detail on these topics, see the Prefect “Orchestration” docs.

Flows

A “flow” is a container for multiple tasks which understands the relationship between those tasks. Tasks are arranged in a directed acyclic graph (DAG), which just means that there is always a notion of which tasks depend on which other tasks, and those dependencies are never circular.

For more, see the prefect docs.

Flow Runs

Each time the code in a flow is executed, that represents one “flow run”.

Prefect Core Server

A service that keeps track of all your flows and knows how to run them. This server also has responsibility for keeping track of schedules. If you set up a flow to run once an hour, Prefect Core Server will make sure that that happens.

Flow Versions

When a flow changes, a new “flow version” is created. Example changes include:

  • some tasks have been added or removed
  • the dependencies between tasks have changed
  • the flow is using a different execution mechanism (like a Dask cluster instead of a local Python process)

Prefect Core Server keeps track of all these versions, and knows to link all versions of the same flow into one “flow group”.

Prefect Agents

A Prefect Agent is a small service responsible for running flows and reporting their logs and statuses back to Prefect Core Server. Prefect Agents are always “pull-based”…they are configured to point at an instance of Prefect Core Server, and every few milliseconds they ask Prefect Core Server hey is there anything you want me to do? hey is there anything you want me to do?.

When Prefect Core Server responds and says “yes, please run this flow”, the agent is responsible for inspecting the details of the flow and then kicking off a flow run.

It looks at these details:

  • storage: where can the flow store be retrieved from?
    • In most cases, “the flow” means a binary file which can be turned into a Python object (prefect.Flow) using cloudpickle
  • environment: what infrastructure needs to be set up to run the flow?
  • executor: what engine will be used to run all the Python code in the flow?

Saturn Cloud + Prefect Cloud Architecture

Using Saturn Cloud and Prefect Cloud together looks like this:

  1. Using credentials from your Prefect Cloud account, you create an Agent running in Saturn Cloud.
    • NOTE: Saturn does not charge you for this.
  2. You create a Saturn Cloud “project” which defines all the dependencies your code needs.
  3. In a Jupyter server with all those dependencies set up, you write flow code in Python using the prefect library
  4. In your Python code, you use the prefect-saturn library to “register” your flow with Saturn, and the prefect library to register it with Prefect Cloud.
  5. prefect-saturn adds the following features to your flow code:
  6. When Prefect Cloud tells your Prefect Agent in Saturn to run the flow, it looks for the right Dask cluster and creates it if it doesn’t find it. Then your flow runs in that Dask cluster.

Features of this Design

  1. All flow runs for one flow run in the same Dask cluster
  2. flow runs from different flows run in their own Dask cluster
  3. All of your sensitive data, code, and credentials stays within Saturn…Prefect Cloud only gets a minimal description of the flow without any sensitive information

Division of Responsibilities

In using this integration, you’ll write code with the prefect library which talks to Saturn Cloud and Prefect Cloud. Their responsibilities are as follows:

  • prefect library
    • describe the work to be done in a flow
    • tell Prefect Cloud about the flow, including when to run it (on a schedule? on demand?)
    • store that flow somewhere so it can be retrieved and run later
  • Saturn Cloud
    • provide a hosted Jupyter Lab experience where you can author and test flows, and a library for easily deploying them (prefect-saturn
    • run an Agent that checks Prefect Cloud for new work
    • when Prefect Cloud says “yes run something”, retrieve flows from storage and run them
    • automatically start up a Dask cluster to run your flow, making sure that every Dask worker:
      • is the size you asked for
      • has a GPU your code can take advantage of, if you need that
      • has the exact same environment as the Jupyter notebook where you wrote your code
      • has all of the code for your project (like other libraries you wrote)
      • has all of the credentials and secrets you’ve configured (like AWS credentials or SSH keys)
    • send logs and task statuses back to Prefect Cloud, so you have all the information you need to react if anything breaks
  • Prefect Cloud
    • keep track of all the flows you’ve registered
    • when it’s time to run those flows (either on demand or based on a schedule), tell Agents to run them
    • display a history of all flow runs, including success / failure of individual tasks and logs from all tasks
    • allow you to kick off a flow on-demand using a CLI, Python library, or clicking buttons in the UI

Learn and Experiment!

Now that you’ve read this overview, try it out yourself with the tutorial “Fault-Tolerant Data Pipelines with Prefect Cloud”.