With various attempts to clamp down the effect of COVID-19 on the world, various research works and innovative measures depend on insights gained from the right data. Most of the data required to aid innovations may not be available via Application Programming Interface (API) or file formats like ‘.csv’ waiting to be downloaded, but can only be accessed as part of a web page. All code snippets can be found here.

Web scraping is a term used to describe the use of a program or algorithm to extract and process large amounts of data from the web. Whether you are a data scientist, engineer, or anybody who analyzes large amounts of datasets, the ability to scrape data from the web is a useful skill to have.

Worldometers has credible sources of COVID-19 data around the world. In this article, we will learn how to scrape COVID-19 data depicted below from a web page to a Dask dataframe from the site using python.

Scraping COVID-19 Data to Dask Dataframe Using Python

COVID19 cases report from Worldometer

Why Dask dataframe?

Scraping COVID-19 Data to Dask Dataframe Using Python

Pandas have been one of the most popular data science tools used in Python programming language for data wrangling and analysis. Pandas have their own limitations when it comes to big data due to their algorithm and local memory constraints.

However, Dask is an open-source, freely-available Python library. Dask provides ways to scale Pandas, Scikit-Learn, and Numpy in terms of performance and scalability. In the context of this article, the dataset is bound to be constantly increasing, making Dask the ideal tool to use.

Elements of a web page

Before we delve into web scraping properly, let’s clear up the difference between a webpage and website. A web page can be considered as a single entity, whereas a website is a combination of web pages. Web pages are accessed through a browser while in website HTTP, and DNS protocols are used to access it. The content in a website changes according to the web page while a web page contains more specific information.

There are Four(4) basic elements of a webpage, which are:

  1. Structure
  2. Function
  3. Content
  4. Aesthetics

The above-listed elements are provided by programmable components such as HTML — which contains the main content of the page, CSS — which adds styling to make the page look nicer, and lastly JS — JavaScript files which add interactivity to web pages.

When we perform web scraping, we’re interested in the extraction of information from the main content of the web page, which makes a good understanding of HTML important.

HyperText Markup Language (HTML)

Scraping COVID-19 Data to Dask Dataframe Using Python

HTML logo from Logolynx

HyperText Markup Language (HTML) is a language that web pages are created in. HTML isn’t a programming language, like Python — instead, it’s a markup language that tells a browser how to layout content.

Let’s take a quick tour through HTML so we know enough to scrape effectively. HTML consists of elements called tags.Scraping COVID-19 Data to Dask Dataframe Using Python

Scraping COVID-19 Data to Dask Dataframe Using Python

How the HTML code looks

This is the basic syntax of an HTML webpage. Every <tag> serves a block inside the webpage. The head tag contains data about the title of the page, while the visible part of the HTML document is between the body tag. The p tag defines a paragraph, and any text inside the tag is shown as a separate paragraph. Lastly, from our code snippet, we added two a tags that enabled inputting links and tell the browser to render a link to another web page. The href property of the tag determines where the link goes. For a full list of tags, look here.

Also, HTML tags sometimes come with id and class. The id attribute specifies a unique id for an HTML tag and the value must be unique within the HTML document. The class attribute is used to define equal styles (CSS operation) for HTML tags with the same class.

Downloading the Web page for web scraping

The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library. Before we can use it, we have to install it. On the terminal, the following codes will install this library.

pip install requests

The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us.

Scraping COVID-19 Data to Dask Dataframe Using PythonAfter running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully. A status_code of 200 means that the page downloaded successfully. A status code starting with a 2 generally indicates success. To know more on status code, link here.

HTML Parsing

Parsing simply means breaking up sentence structure into components under the direction of grammar. So, ‘HTML parsing’ means taking in HTML code and extracting relevant information from its various tags. A computer program that parses content is called a parser. In this article, we will be using the ‘BeautifulSoup’ library to parse the HTML document we have downloaded and aid proper extraction. Before we can use it, we have to install it. On the terminal, the following codes will install this library.

pip install beautifulsoup4

Once the ‘BeautifulSoup’ package is installed, you can commence to parsing the HTML document via the BeautifulSoup object.

Scraping COVID-19 Data to Dask Dataframe Using PythonHere, the ‘lxml’ parser was used since it works with broken html and is widely used.

Inspecting HTML Webpage

To adequately extract contents from a webpage, we have to inspect the webpage to identify its attributes and tags. The inspection of the webpage is done by right-clicking anywhere on the webpage and selecting “Inspect.” In the context of this article, we are looking out for attributes and tags related to the updated table on ‘reported coronavirus cases’. This is what the result looks like.

Scraping COVID-19 Data to Dask Dataframe Using Python

HTML inspection

Extraction of Table

<table id="main_table_countries_today" class="table table-bordered table-hover main_table_countries dataTable no-footer" style="width: 100%; margin-top: 0px !important;">

Succeeding the inspection, the attributes — ‘id’ was identified and will be used to filter the HTML document to get the required table elements.

Scraping COVID-19 Data to Dask Dataframe Using PythonGetting text out of the extracted table

It can be observed that the extracted table has essential HTML tags embedded. The goal of this article is to take a table from the webpage and convert it into a dataframe for easier manipulation using Python. To achieve this, we get the desired data(text) on a row-wise basis in list form first and then convert that list into a dataframe.

It is worthy to noted that, tag td ,tr and th represent table column, table rows and table headers respectively.

Scraping COVID-19 Data to Dask Dataframe Using PythonConverting to a Dask dataframe

The next step is to convert the list into a Dask dataframe to enable data manipulation and cleaning. As pointed out earlier on, since the data is increasing daily, it is advisable to use a Dask dataframe which handles big data more efficiently. We would create a pandas dataframe and convert it to a dask dataframe for scalability. The resulting table requires some formatting to be lucid.

Scraping COVID-19 Data to Dask Dataframe Using PythonExporting to csv for further use

Comma-separated values (CSV) are widely used file formats that store tabular data (numbers and text) as plain text. Their popularity and viability are due to the fact that a great deal of programs and applications support csv files.

Scraping COVID-19 Data to Dask Dataframe Using PythonThe resulting ‘csv’ occurs when view shows a successful extracted table from the Worldometers conoravirus cases report webpage.

Scraping COVID-19 Data to Dask Dataframe Using PythonI hope this article aids in the furthering of research works and innovations with means to scrape data to curb the COVID-19 pandemic. Thanks for reading and stay safe!

Guest post: Aboze Brain John Jr.

Stay up to date with Saturn Cloud on LinkedIn and Twitter.

You may also be interested in: Advanced Analytics with Dask.