With various attempts to clamp down the effect of COVID-19 on the world, various research works and innovative measures depend on insights gained from the right data. Most of the data required to aid innovations may not be available via Application Programming Interface (API) or file formats like ‘.csv’ waiting to be downloaded, but can only be accessed as part of a web page. All code snippets can be found here.
Web scraping is a term used to describe the use of a program or algorithm to extract and process large amounts of data from the web. Whether you are a data scientist, engineer, or anybody who analyzes large amounts of datasets, the ability to scrape data from the web is a useful skill to have.
Worldometers has credible sources of COVID-19 data around the world. In this article, we will learn how to scrape COVID-19 data depicted below from a web page to a Dask dataframe from the site using python.
Why Dask dataframe?
Pandas have been one of the most popular data science tools used in Python programming language for data wrangling and analysis. Pandas have their own limitations when it comes to big data due to their algorithm and local memory constraints.
However, Dask is an open-source, freely-available Python library. Dask provides ways to scale Pandas, Scikit-Learn, and Numpy in terms of performance and scalability. In the context of this article, the dataset is bound to be constantly increasing, making Dask the ideal tool to use.
Elements of a web page
Before we delve into web scraping properly, let’s clear up the difference between a webpage and website. A web page can be considered as a single entity, whereas a website is a combination of web pages. Web pages are accessed through a browser while in website HTTP, and DNS protocols are used to access it. The content in a website changes according to the web page while a web page contains more specific information.
There are Four(4) basic elements of a webpage, which are:
When we perform web scraping, we’re interested in the extraction of information from the main content of the web page, which makes a good understanding of HTML important.
HyperText Markup Language (HTML)
HyperText Markup Language (HTML) is a language that web pages are created in. HTML isn’t a programming language, like Python — instead, it’s a markup language that tells a browser how to layout content.
Let’s take a quick tour through HTML so we know enough to scrape effectively. HTML consists of elements called tags.
This is the basic syntax of an HTML webpage. Every <tag> serves a block inside the webpage. The head tag contains data about the title of the page, while the visible part of the HTML document is between the body tag. The p tag defines a paragraph, and any text inside the tag is shown as a separate paragraph. Lastly, from our code snippet, we added two a tags that enabled inputting links and tell the browser to render a link to another web page. The href property of the tag determines where the link goes. For a full list of tags, look here.
Also, HTML tags sometimes come with id and class. The id attribute specifies a unique id for an HTML tag and the value must be unique within the HTML document. The class attribute is used to define equal styles (CSS operation) for HTML tags with the same class.
Downloading the Web page for web scraping
The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library. Before we can use it, we have to install it. On the terminal, the following codes will install this library.
pip install requests
The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us.
After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully. A status_code of 200 means that the page downloaded successfully. A status code starting with a 2 generally indicates success. To know more on status code, link here.
Parsing simply means breaking up sentence structure into components under the direction of grammar. So, ‘HTML parsing’ means taking in HTML code and extracting relevant information from its various tags. A computer program that parses content is called a parser. In this article, we will be using the ‘BeautifulSoup’ library to parse the HTML document we have downloaded and aid proper extraction. Before we can use it, we have to install it. On the terminal, the following codes will install this library.
pip install beautifulsoup4
Once the ‘BeautifulSoup’ package is installed, you can commence to parsing the HTML document via the BeautifulSoup object.
Here, the ‘lxml’ parser was used since it works with broken html and is widely used.
Inspecting HTML Webpage
To adequately extract contents from a webpage, we have to inspect the webpage to identify its attributes and tags. The inspection of the webpage is done by right-clicking anywhere on the webpage and selecting “Inspect.” In the context of this article, we are looking out for attributes and tags related to the updated table on ‘reported coronavirus cases’. This is what the result looks like.
Extraction of Table
<table id="main_table_countries_today" class="table table-bordered table-hover main_table_countries dataTable no-footer" style="width: 100%; margin-top: 0px !important;">
Succeeding the inspection, the attributes — ‘id’ was identified and will be used to filter the HTML document to get the required table elements.
Getting text out of the extracted table
It can be observed that the extracted table has essential HTML tags embedded. The goal of this article is to take a table from the webpage and convert it into a dataframe for easier manipulation using Python. To achieve this, we get the desired data(text) on a row-wise basis in list form first and then convert that list into a dataframe.
It is worthy to noted that, tag td ,tr and th represent table column, table rows and table headers respectively.
Converting to a Dask dataframe
The next step is to convert the list into a Dask dataframe to enable data manipulation and cleaning. As pointed out earlier on, since the data is increasing daily, it is advisable to use a Dask dataframe which handles big data more efficiently. We would create a pandas dataframe and convert it to a dask dataframe for scalability. The resulting table requires some formatting to be lucid.
Exporting to csv for further use
Comma-separated values (CSV) are widely used file formats that store tabular data (numbers and text) as plain text. Their popularity and viability are due to the fact that a great deal of programs and applications support csv files.
The resulting ‘csv’ occurs when view shows a successful extracted table from the Worldometers conoravirus cases report webpage.
I hope this article aids in the furthering of research works and innovations with means to scrape data to curb the COVID-19 pandemic. Thanks for reading and stay safe!
Guest post: Aboze Brain John Jr.
You may also be interested in: Advanced Analytics with Dask.