Setting Up Your Data Science & Machine Learning Capability in Python

Python is a great language to base your DS/ML framework on, and allows you to avoid being locked into one vendor specific framework.

Why Python?

Python is the clear winning programming language in data science & machine learning (DSML). With its rich and dynamic open-source software ecosystem, Python stands unmatched in how adoptable, reliable, and functional it is. If you disagree with this premise, then please take a quick detour here.

The Python logo in the sky

The Purpose of Your Data Science & Machine Learning Capability

Your goal as a lead of a DSML team is to deliver the best return on investment to the business. The business invests in the DSML capability with a budget for staff and resources, while your job is to deliver the maximum business impact you can.

Your business impact can be measured in many ways. The most high-level objectives are cost optimization, risk optimization, and revenue growth. You may focus on a variety of specific metrics within each objective, such as customer acquisition cost optimization, churn prediction, fraud detection, patient health outcomes, or personalized product recommendations.

Python uses by industry

Anything diverting goal-setting, budget, and execution from this purpose drives down the ROI your team can deliver. Where the attention goes, the energy flows, to quote a self-improvement guru.

Renting vs. Owning

This a re-framing of the classic Buy vs. Build discussion, in context of many DSML platforms offering “pay as you go” pricing now, much like Amazon Web Services. I feel it’s necessary to rephrase the discussion, because unlike “Buying” where you pay a fixed cost whether or not you use it; “Renting” implies that you only pay for when you use it. This is much more convenient for the end-user.

As you begin to set up your DSML platform in Python, you can own the internal architecture or you can rent it from a vendor. I’ll use Saturn Cloud as the primary vendor, because I am expectedly biased.

The Hidden Cost of Owning

Owning a DSML capability carries inherent “scope creep” issues that are not in plain view from outset. It is all too easy to expect owning the capability as simplifying integrating your favorite open source tools together: Jupyter, Dask or PySpark, Prefect or Airflow, Kubernetes, NVIDIA RAPIDS, Bokeh, Plotly, Streamlit, etc.

Here is a short list of “scope creep” dealbreakers we hear from our customers who have previously tried to own a DSML capability:

  • Setting up and managing cloud hosting and support for AWS, Azure, GCP, or on-premise
  • Ensuring enterprise-grade security of code and data; even more burdensome if you are in a highly regulated industry
  • Configuration: executing work on the proper infrastructure which exposes the appropriate resources and libraries for the task at hand
  • Monitoring e.g. ensuring minimal downtime
  • User management: managing employee access to systems and information
  • Access control: controlling what users can do and see within an application
  • Managing existing OSS package versioning and integrating new OSS packages
  • Support for end-users; managing consultations with OSS experts

Each of these bullets has a list of further burdens that may not be attractive. In fact, some of it is so painful that our Saturn Cloud co-founder and CTO, Hugo Shi, wrote an article on Kubernetes just to vent.

The Obvious Cost of Owning

Here are the cost components of ownership that you need to consider as you build your DSML capability.

Example 1: Owning Results in Higher Total Cost

Your team is tasked with developing a customer churn model. If you could predict churn, sales could take proactive measures to retain more accounts. Your company generates $100M in annual sales, and there’s an opportunity to reduce churn from 10% to 5%, or by $5M annually. To keep it simple, we’ll assume you’re a SaaS company with 100% gross margins.

Figure 1: Renting = Automated DevOps

Renting vs automated DevOps Assumes FTE cost of $150K

Given the cost savings in automating DevOps, the renting scenario generates higher ROI due to less total spend.

Example 2: Owning Carries High Opportunity Cost

Now let’s assume in both scenarios your team is 9 FTEs, but in the renting scenario, all 9 are dedicated to Data Science & ML. A team of 9 FTEs can produce 50% more output than a team of 6 FTEs, so with the spare capacity, you take on a second project around customer personalization. Let’s assume this project could result in 5% higher software sales in year 1.

Figure 2: Renting = Force Multiplier

Renting vs owning

Notice that in the renting scenario, you’re actually spending more money, but with the same team size, you can generate higher ROI. By shifting labor spend to Data Science & ML from DevOps, your team is more efficient and can tackle more positive ROI projects in the same time period. The owning scenario carries an inherent opportunity cost, which is not inherent in the renting scenario.

In both scenarios, the ROI of renting outperforms that of owning a DSML capability. It is also worth noting that cloud computing pricing has dropped significantly over the past decade, whereas labor costs for data science, machine learning, and DevOps have increased significantly.

A Cautionary Tale

Not every organization needs to rent DSML architecture. But, it is much easier and less risky to rent first before you own.

“Rent before you own”

I have spoken with hundreds of DSML leaders in the past couple of years. A good portion of them lead their teams into owning DSML architecture without renting, and without assessing the obvious and hidden costs of owning. All too often, they turn back half way, realizing renting is cheaper, easier, more flexible, and allows them to stay focused. Furthermore, many developers on the teams expected they would be only part of building the architecture upfront, but later had to serve in full-time support roles, spending much less time on interesting scientific projects they joined the company for!

It’s Somebody Else’s Problem Now

…is what you’ll be saying when you rent the architecture. Yes, all the integration of open source tools, open-source version management, building state-of-the-art security around data and code, building enterprise administration architecture, cloud hosting, support services, open-source expert consultations — say it with me — 👏 somebody 👏 else’s 👏 problem!

Not only is that offloaded, but you get some pretty great benefits from a dedicated team working on it.

  • Greater Performance: Saturn’s tooling offers up to 100x faster runtime than Apache Spark, Pandas and other data processing tools
  • Instant Delivery: You subscribe, you have it immediately in your virtual private cloud
  • Expert Support: Leading committers of Python OSS available to support you.
  • Smooth Experience: Immediate integration and updating of open source tools
  • Native Integrations: Amazon Web Services, Snowflake, and other cloud services
  • Seamless Teamwork Tools: Interactive and Collaborative DSML Capabilities
  • Automation: Data Pipelines and Workflow Orchestration with Prefect
  • Beautiful: Intuitive, State-of-the-art User Interface
  • Flexibility: Pay As You Go and Cancel Whenever

Concluding: Your Pythonic DSML Capability

Ownership Model: Team and budget are divided in using DSML capability to create value and supporting DSML capability.

Ownership model diagram

Rent Model: Entire team and budget are streamlined towards using rented DSML capability to create value.

Rent model diagram

The purpose of your DSML capability is to maximize its ROI. You want as much of your budget going towards that target: whether the endpoint is faster stock market trading decision-making, recommending new marketing investment, running more drug discovery models, and so on.

My advice is:

  • Choose Python for its unmatched open source ecosystem
  • Choose to rent before you buy

If you want an easy way to scale Python and get super-fast GPU data science…

Saturn Cloud Hosted handles all the tooling infrastructure, security, and deployment headaches to get you up and running with RAPIDS right away. Click here for our free version.

If you are part of a company that requires a virtual private cloud solution, Saturn Cloud also offers an Enterprise solution.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Join today and get 150 hours of free compute per month.