Jupyter = Julia + Python + R
If you are working as a data scientist, you are likely recording your complete analysis process daily, much in the same way other scientists use a lab notebook to record tests, progress, results, and conclusions. What tools are you using to do this? I am using Jupyter Notebook every day-let me introduce it briefly to you.
- What is a Jupyter Notebook?
- Why is it useful for data analysis?
- What are the features of Jupyter Notebook?
- Perform simple data analysis in Machine Learning.
Introduction to Jupyter Notebooks
What is a Jupyter Notebook?
Jupyter Project¹ is a spin-off project from the I-Python project, which initially provided an interface only for the Python language and continues to make available the canonical Python kernel for Jupyter. The name Jupyter itself is derived from the combination of Julia, Python, and R.
Why is it useful?
Project Jupyter exists to develop an open-source platform, open standards, and services for interactive computing across many programming languages such as Python, R, and MATLAB.
Jupyter is available as a web application on a cloud ecosystem from a number of places, such as Saturn Cloud². It can also be used locally over a wide variety of installations which contain live code, equations, figures, interactive apps, and Markdown text.
Features of Jupyter Notebooks
A Jupyter Notebook is fundamentally a JSON file with a number of annotations. There are three main parts of the Notebook:
- Metadata: a data dictionary of definitions used to set-up and display the notebook.
- Notebook format: version numbers of the software used to create the notebook. The version number is used for backward compatibility.
- List of cells: there are three different types of cells — markdown (display), code (to excite), and output.
How will we work with Jupyter notebooks?
There are four steps:
- First step: Create a new notebook for data analysis.
- Second step: Add your analysis steps, coding, and output.
- Third step: Surround your analysis with organizational and presentational markdown to communicate an entire story.
- Last step: Interactive notebooks will then be used by others to modify parameters and data to note the effects of their changes.
Getting Jupyter Notebooks with Saturn Cloud
One of the quickest ways to get a Jupyter Notebook is to register an account on Saturn Cloud. It allows you to quickly spin up Jupyter notebooks in the cloud and scale them according to your needs.
- It deploys in your cloud so there’s no need to migrate your data. Use the whole Python ecosystem via Jupyter.
- Easily build environments and import packages (Pandas, NumPy, SciPy, etc).
- You can publish notebooks and easily collaborate on cloud-hosted Jupyter.
- Scalable Dask from laptop to server to cluster.
See further: https://www.saturncloud.io
Can we convert a Jupyter Notebook to a Python script?
Yes, you can convert a Jupyter Notebook to a Python script. This is equivalent to copying and pasting the contents of each code block (cell) into a single .py file. The markdown sections are also included as comments.
The conversion can be done in the command line:
jupyter nbconvert --to=python notebook-name.ipynb
An example of using Jupyter Notebooks for ML
Let assume that you are a doctor evaluating data for ten people and predicting if somebody could get coronavirus.
We will go step by step to evaluate our algorithm by calculating metrics such as TP, TN, FP, FN, TPR, TNR, PPV, NPV, FPR and ACC. Let us assume that you are familiar with those metrics (if not, read further here⁴).
First of all, we create a new Jupyter Notebook file.
You predict six people will get coronavirus.
y_pred = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
By the end of the season, you find only five people had coronavirus.
y_true = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
We create a confusion matrix and display it.
from sklearn.metrics import confusion_matrix tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()print('Here is the confusion matrix') print(confusion_matrix(y_true, y_pred, labels=[1, 0]))
And here is the confusion matrix: TP = 5, TN = 4, FP = 1, FN = 0.
We calculate the percentage of sick people who are correctly identified as having the condition (also called sensitivity).
sensitivity = tp / (tp+fn)print('The percentage of sick people who are correctly identified as having the condition') print('Sensitivity : %7.3f %%' % (sensitivity*100),'\n')
The result shows that you can 100% correctly predict the people who will get coronavirus. That means sensitivity is 100%.
We also calculate the percentage of healthy people who are correctly identified as not having the condition (also called specificity).
specificity = tn / (tn+fp)print(‘The percentage of healthy people who are correctly identified as not having the condition’) print(‘Specificity : %7.3f %%’ % (specificity*100))
The result shows that you can correctly predict at a rate of 80% that people will not get coronavirus, that is, specificity is 80%.
Next, we calculate the precision of this algorithm.
from sklearn.metrics import precision_score print('The ratio of properly predicted positive clarifications to the total predicted positive clarifications.') print(precision_score(y_true, y_pred, average=None))
The algorithm can 100% predict ‘no-coronavirus’ but only 80% ‘coronavirus’ cases correctly.
We calculate the probability that records with a negative predicted result truly should be negative (called NPV metric).
npv = tn / (tn+fn) print(‘The probability that records with a negative predicted result truly should be negative: %7.3f %%’ % (npv*100))
It shows that NPV = 100%, which is very good.
We calculate the proportion of positives that yield negative prediction outcomes with the specific model (also called miss rate or FNR).
fnr = fp / (fn+tp) print(‘The proportion of positives that yield negative prediction outcomes with the specific model: %7.3f %%’ % (fnr*100))
It shows that 20% of the predicted positives are negative, meaning 1 in 5 predicted negative outputs are positive. This is not good, as you will miss 1 in 5 patients.
Then, we also calculate the false positive rate (also called FPR).
fdr = fp / (fp+tp) print(‘False discovery rate: %7.3f %%’ % (fdr*100))
It shows that nearly 17% of the predicted negative is positive, meaning 17 in 100 predicted positive outcomes are negative.
Finally, we calculate statistical biases, as these cause a difference between a result and a “true” value.
acc = (tp + tn) / (tp + tn + fp + fn) print(‘Accuracy: %7.3f %%’ % (acc*100))
This will be reported as 90% accuracy. This is a good outcome for our coronavirus model.
We learned how to get Jupyter Notebook on the cloud with Saturn Cloud. We also were exposed to the notebook structure, and saw the typical workflow used when developing a notebook. Lastly, we did some simple data analysis in ML.
Guest Post: Trung Anh Dang
You may also be interested in: Best Practices for Jupyter Notebooks
- Jupyter homepage: https://jupyter.org
- The Jupyter notebook file: https://github.com/housecricket/notebooks/blob/master/coronavirus.ipynb
- Metrics to Test the Accuracy of Machine Learning Algorithms https://medium.com/datadriveninvestor/metrics-to-test-the-accuracy-of-machine-learning-algorithms-67adf367f60