Focusing on data science instead of DevOps

Jupyter in the Cloud

image

Introduction

When starting out in data science, DevOps tasks are the last thing you should be worrying about. Trying to master all (or most) aspects of data science requires a tremendous amount of time and practice. Nevertheless, if you should happen to attend a boot camp or some other type of school, it is very likely that you are going to have to complete group projects sooner or later. However, coordinating these without any DevOps knowledge can prove to be quite the challenge. How do we share code? How do we deal with very expensive computations? How do we make sure everyone is using the same environment? Questions like these can easily stall the progress of any data science project.

To someone familiar with GitHub and cloud computing, these questions might seem very straight-forward. Nonetheless, based on my personal experience from attending the NYC Data Science Academy’s 12-week boot camp, these questions are extremely relevant to people getting their start in data science trying to either deal with extremely expensive computations or working in groups. While it is true that one could just use AWS, Azure, GCP (Google Cloud Platform), or any of the other cloud service providers for most of these problems, mastering data science and cloud computing at the same time is an unnecessarily difficult task. Why not use familiar tools with extensions instead?

Jupyter Notebooks in the Cloud

Jupyter Notebooks are extremely popular when starting out in data science, and rightfully so. They provide users with a neat interface and are easy to use. Basically, they are letting users focus just on what they want to do: write some code and execute it right away. Nevertheless, there are limitations. For one, Jupyter Notebooks run on your local machine, making the computational power available to you entirely dependant on your computer’s CPU/GPU/RAM/etc. specs. While most laptops are more than enough for the basic tasks one encounters when starting out in data science, you might quickly hit barriers once you start doing machine learning, especially deep learning, tasks on your local machine. Furthermore, when working on group projects, one needs to find a way to share Jupyter Notebooks with teammates. While using GitHub enables you to share your code, it does require you to know how to use GitHub and it does not enable you to just share a link to give your teammates the ability to see and edit your work. As a side note, GitHub also requires all of your teammates to remember to commit their newest versions to GitHub, which can easily be forgotten if you have never worked on data science group projects before.

Therefore, one either needs to know how to work with one of the cloud services providers and GitHub or there are going to be delayed group projects and very hot laptops.

Saturn Cloud

Saturn Cloud allows you to quickly spin up Jupyter notebooks in the cloud and scale them according to your needs. Basically, it lets you run your Jupyter Notebook on a VM inside AWS, Azure, or GCP without you having to know how to appropriately set up and use these services. It also has a few very nice features that distinguish it from other offerings out there, such as giving you the ability to specify a conda environment, requirements.txt, or docker image in order to standardize environments across all teammates. You can also share your Juypter Notebooks with the public or teammates using links. This eliminates the need to understand how to work with GitHub for basic data science projects. If you do know how to use GitHub, it still offers a fast and convenient way of testing and developing code with others. As a result, (aspiring) data scientists can focus on data science instead of DevOps and finish their projects more quickly than they would have been able to do otherwise.

Besides that, Saturn Cloud enables you to deploy a Spark or Dask cluster with just one click. This simplifies issues of dealing with very expensive computations by making distributed computing available to you with a single click. Saturn Cloud also automates version control for you, obviating potential issues arising from teammates not committing their newest versions.

Since I personally prefer demonstrations to descriptions, I am going to quickly show you how to spin up a Jupyter Notebook using Saturn Cloud.

Spinning up a Jupyter Notebook on Saturn CloudFirst, you name your notebook and define the VM (virtual machine) you would like to utilize. All you need to do is specify the disk space and RAM. Saturn Cloud automatically terminates your VM if you do not use it for a default of 10 minutes. You could, however, change that to whatever time period you would like. Using the advanced options, you can define the environment you would like your team to use. After hitting create, your Jupyter Notebook in the cloud will be launched.

Once it is up and running, you will see the following:

myjupyter-status

image

By clicking on ‘Go To Jupyter Notebook’, you can access your Jupyter Notebook in the cloud and start coding. And that is basically all you need to do for most simple tasks. If you would like to use more advanced options, such as deploying a Spark/Dask cluster or using GPU’s for your workloads, you can just do so within your existing Jupyter Notebook by clicking on the blue button above and customizing your VM(s) as follows:

Adding a GPU to your VM

Adding a GPU to your VM

Why would you want to add a GPU (Graphics Processing Unit) to your VM? To put it very simply: think of the algorithm you want to train (e.g. a neural network) as a series of mathematical calculations. Now, with a GPU, you are essentially doing all the computations at the same time whereas, with a CPU, you would do them one after another. That is, in essence, why GPU’s are the better choice for expensive computations, especially in relation to machine learning.

As you can see, getting started with Jupyter Notebooks in the cloud is very intuitive using Saturn Cloud. Once your notebook is running, you can also easily share it from within the notebook with the public or just your teammates. To demonstrate this, I have published a notebook visualizing rat sightings in NYC that you can access using the following link:

https://www.saturncloud.io/published/lksfr/rats-in-nyc/Rats/rats_for_nbviewer_only.ipynb?source=lf-1

Conclusion

DevOps can be a real difficulty when trying to get data science group projects off the ground. Hosting Jupyter Notebooks with Saturn Cloud while also taking care of things like versioning and having the ability to scale in or out as needed can tremendously simplify your life. Personally, the ability to quickly and easily share notebooks is truly helpful. First and foremost, however, you should carefully evaluate and define your needs before starting any project.