When it comes to data science solutions, there’s always a need for fast prototyping. Be it a sophisticated face recognition algorithm or a simple regression model, having a model that allows you to easily test and validate ideas is incredibly valuable. Many data science problems out there require specially crafted solutions due to their complicated nature. This means that the data scientists working on these problems will eventually need to improvise on the issue. Not having to wait to calculate some additional feature column on the dataset every time you execute your script becomes a crucial gain in terms of productivity. Since this has been a long-standing problem in programming circles, the community has actually built a solution to this problem: enter Project Jupyter.
Source: Project Jupyter
What is Jupyter?
Jupyter, growing out of the IPython project, is an interactive programming environment that is mostly focused on data science and scientific programming. Even though Project Jupyter includes many components that are very useful for certain situations, the most famous Jupyter component is Notebooks. Jupyter Notebook is an open-source web application that allows combining code with text, visualization, equations, and analysis. It has proven to be useful in various cases, such as sharing data analysis results with embedded interactive gadgets or building tutorials around various complex topics.
Advantages of Jupyter Notebooks
Thanks to their simple interface, notebooks provide a great starting point for many complex projects to evolve. A notebook essentially consists of cells that are called “input”, and each input cell in the notebook has a corresponding cell called “output”. The following image is displaying how it looks in a very basic form:
A very simple Jupyter Notebook cell.
Fast data exploration
One of the most important advantages of Jupyter Notebooks is coming from the cell-by-cell nature of the notebooks, which means splitting every possible logical step allows exploring the data in hand in a very interactive manner:
– Every small cell can be deleted or modified without any impact on the rest of the analysis.
– Any change on the cell can be reverted, which allows quick movement without thinking too much about the consequences of the code.
– Any input and output of a certain cell is highly visible from the previous steps and can be easily displayed before starting to explore the data.
Thanks to these advantages, it is incredibly simple to laser-focus on a special part of the analysis without dealing with the external impact or input-output matching.
Each of the cells being responsible solely for themselves allows the automatic keeping of cell data for future reference without the need of running the whole script to get to a certain point. Imagine you make a slow API call to an external service, and then you do a bunch of data cleaning operations before you can start working on the data. You can pull the data in Cell 3, work on it until you get the data cleaning fully done in Cell 4 without re-downloading the data from the API, and still be able to refer back to the cleaned-up data from Cell 18. Being able to do this increases productivity so drastically that once you get used to it, you’ll start looking for it everywhere.
Jupyter Notebooks are essentially HTML pages; which means that in the common case when you need to share the output of your work, the resulting analysis Notebook file can easily be rendered to an HTML file. Once you have the file you can either share it with the recipients directly, or you can serve the file through a remote host-no need to set up Python or any programming environment in the remote host.
Various export options of Jupyter Notebooks
In addition to the HTML format, Jupyter Notebooks can be exported in many other formats such as AsciiDoc, LaTeX or Markdown.
One of the very simple but effective features of Jupyter Notebook is its simple documentation pop-up. Once you have a function and open the parenthesis, hitting Shift + Tab opens a popup that displays the documentation of the function.
Popup documentation of Jupyter Notebook
Just like any other tool, Jupyter Notebooks also have best practices that make the life of a data scientist easier and enable working on complex projects without any trouble. As always, these are not hard rules and may change based on the project; they are there to provide guidance.
Use version control systems
This might be a very basic best practice for a regular software development lifecycle, but it seems like there is still low adoption of version control systems like Git among data scientists. This might be one of the very low hanging fruits that would definitely provide productivity boosts. Version control systems:
- allow you to time-travel between different versions of the code.
- let you change stuff without being afraid of losing code.
- allow you to review projects with co-workers.
- track changes in a timely manner with ownership.
Due to these advantages, version control systems have been one of the essentials in software development projects. They should also be utilized while developing Jupyter Notebook analyses.
Try to Use New Variables
Jupyter Notebooks enable working with data in a cell-by-cell structure, which means you can use the variables defined in a previous cell and build on top of those values, but one thing to be careful about is not rewriting on that variable. This best practice mainly depends on the problem in hand and may not be applicable for working with a big dataset that would not fit in the memory multiple times, but it is a good practice in general.
Consider downloading a CSV file from a remote server from your notebook, and storing this in a variable called data_csv. At this point, you want to make some modifications to this data and change its structure so that it is more suitable for your analysis. There are two ways you can go about this: you can transform the data and store everything on the data_csv variable again, or you can run all your transformations on another variable, formatted_data maybe. The advantage of the first one would be more efficient memory usage since you’ll probably need the formatted data, not the raw CSV data further in the analysis. However, this approach has an important drawback: every time you make a mistake and need to reset the data, you need to re-download the CSV and run all the steps again until you reach to the current point in the notebook. If you had followed the second approach, you would only reset the assignment on your formatted_data variable and you would be good to go. No need to download the CSV again. No need to run all the cells from scratch.
Learn Keyboard Shortcuts
Learning your tool always goes a long way in any discipline, and it is no different in data science. In addition to knowing the general capabilities of Jupyter Notebooks, it is also very useful to know about the keyboard shortcuts. The more you know the faster you’ll get with working on the notebooks.
In order to see the keyboard shortcuts, use the Help > Keyboard Shortcuts menu item, which will open a popup that displays the shortcuts. In addition to that, you can also change the shortcuts through the Help > Edit Keyboard Shortcuts menu item.
Example from the keyboard shortcuts popup
Document Your Analysis
Notebooks are supposed to be an interactive environment where you can combine code with plain text. Notebook cells allow you to write markdown as well. This enables you to write all the documentation you’d like to, and then render it in a nice text format. While writing the documentation, follow a logical hierarchy supported by proper headings, explanations, code-blocks and examples.
Before render, this is how you write markdown in a cell.
The markdown mode of the cell can be enabled by selecting the Markdown option in the Cell > Cell Type menu.
And this is how it looks when the cell is executed.
Documenting the analysis will allow easily following it as well as providing further guidance for modifications.
JupyterLab Instead of Plain Notebooks
JupyterLab is the next-generation web-based interface for Jupyter. In addition to Notebooks, it also supports various file types, text editors, terminals and other custom viewers. Think of it as a notebook environment on steroids: it has file explorer, allows you to split your work into tabs, and lets you run further analysis and execution right from your browser without leaving your notebook.
In order to run Jupyter Lab, just run Jupyter lab in your terminal, which will launch the Lab interface.
Keep the Notebook Simple
Even though the notebooks themselves are supposed to be full analyses, it is wise to keep utilizing general software development practices, especially in order to manage complexity. The analysis over a certain topic might get quite complex, including various data fetching operations, manipulations, clean-ups, and visualizations, but it is important to keep the notebook clean and organized. Notebooks encourage exploration, but any serious analysis or model needs to be properly tested and organized. Here are some suggestions you may want to follow:
- Keep your code organized with Python modules. Ideally, notebooks should not contain complex logic, and they should be easy to follow. Instead of having all the code in the notebook, extract them into modules and use those modules from your notebook.
- Write tests. Writing tests is not an easy habit to gain, but once you get the hang of it, you’ll notice how far it improves the quality of your work. You don’t have to start with tests, but once you have a stable function or a module, writing a simple test to cover the behavior will go a long way. It will enable the use of functionality in the future as well, including going live with a model.
- Remove dead code. This dead code may be forgotten print statements, a commented-out version of an old model, or an unused loop. Regardless of what it is, remove the code if it is not relevant anymore. Every line of code in the notebook has a mental overhead for the data scientist to keep track of, which means the less code there is, the less mental load you will have.
- Utilize widely-accepted coding standards like PEP8. Having a standard in place will allow you to navigate easily between various projects of yours, as well as reduce the cognitive load of trying to decide what style to use here and there.
Even though the suggestions here are generic enough to be implemented in many projects, your own style of work will determine which you decide to utilize. Some of them might be useful for you while some of them may slow you down. In general, by utilizing the power of Jupyter Notebooks, you can save yourself a great amount of time to focus more on the core of your business rather than small details.
Guest post: Burak Karakan
Stay up to date with Saturn Cloud on LinkedIn and Twitter.
Further Reading Suggestions:
Created by Project Jupyter, the Jupyter Notebook is an open-source web application that enables the creation and sharing of documents which contain equations, visualizations, narrative text, and live code. The Jupyter Notebook can be used for machine learning, statistical modeling, data visualization, numerical simulation, data cleaning and transformation, and much more.
Jupyter Notebooks are available for free installation. JupyterLab, or the classic Jupyter Notebook, can be installed using conda or pip. Python 3.3 or greater, or Python 2.7, are required for installation.
- In 2014, Project Jupyter was announced as a spin-off project from IPython. The notebook interface was moved to Project Jupyter, along with other language-agnostic aspects of IPython. Before this, all notebooks existed in IPython.
Jupyter Notebook is an IDE, and allows for markdown, addition of HTML components from media, data cleaning and transformation, numerical simulation, statistical modelling, data visualization, and much more. This IDE is a great platform for beginners and experts alike.
- You can run your Jupyter Notebook locally by opening your terminal, navigating to the directory where you want to save the notebook, and then typing the command “jupyter notebook”. This will initiate a local server.