Can areas vulnerable to COVID19 be measured and mapped?
Can we infer important COVID-19 public health risk factors from outdated data? In many countries, census and other survey data may be incomplete or out of date. The objective of this article is to develop a proof-of-concept for how machine learning can help governments more accurately map COVID-19 risk in 2020 using old data, without requiring a new costly, risky, and time-consuming on-the-ground survey.
The context of this publication focuses on the 2011 census, which gives us valuable information for determining who might be most vulnerable to COVID-19 in South Africa. However, the data is nearly 10 years old, and we expect that some key indicators will have changed in that time. Building an up-to-date map showing where the most vulnerable are located will be a key step in responding to the disease. A mapping effort like this requires bringing together many different inputs and tools. For this article, we’re starting small. Can we infer important risk factors from more readily available data?
We will work to predict the percentage of households that fall into a particularly vulnerable bracket — large households who must leave their homes to fetch water — using 2011 South African census data. With machine learning, it is possible to use easy-to-measure stats to identify areas most at risk even in years when census data is not collected.
About the data
South Africa is divided into 4,392 wards. We will aggregate the target indicators (households with 5+ members and no on-premises water) and the other predictive variables from the census across all the households within each ward to create an aggregated value of each indicator per ward.
Some wards are for training, the remainder are the test set. You may not use external data for this competition. The description of the features is outlined below for reference:
- total_households: Total number of households in ward.
- total_individuals: Total number of individuals in ward.
- target_pct_vunerable: Percentage of large households who have to leave their premises for water.
- dw_00: Percentage of dwellings of type: House on a separate stand or yard or on a farm.
- dw_01: Percentage of dwellings of type: Traditional dwellings made of traditional materials.
- dw_02: Percentage of dwellings of type: Flat or apartment in a block of flats.
- dw_03: Percentage of dwellings of type: Cluster house in complex.
- dw_04: Percentage of dwellings of type: Townhouse (semi-detached house in a complex).
- dw_05: Percentage of dwellings of type: Semi-detached house.
- dw_06: Percentage of dwellings of type: House in backyard.
- dw_07: Percentage of dwellings of type: Informal dwelling (shack in backyard).
- dw_08: Percentage of dwellings of type: Informal dwelling (shack not in backyard e.g. in an informal/squatter settlement or on a farm).
- dw_09: Percentage of dwellings of type: Room/flatlet on a property or larger dwelling/servants quarters/granny flat.
- dw_10: Percentage of dwellings of type: Caravan/tent.
- dw_11: Percentage of dwellings of type: Other.
- dw_12: Percentage of dwellings of type: Unspecified.
- dw_13: Percentage of dwellings of type: Not applicable.
- psa_00: Percentage listing present school attendance as: Yes
- psa_01: Percentage listing present school attendance as: No
- psa_02: Percentage listing present school attendance as: Do not know
- psa_03: Percentage listing present school attendance as: Unspecified
- psa_04: Percentage listing present school attendance as: Not applicable
- stv_00: Percentage of households with Satellite TV: Yes
- stv_01: Percentage of households with Satellite TV: No
- car_00: Percentage of households with a car: Yes
- car_01: Percentage of households with a car: No
- lln_00: Percentage listing landline ownership as: Yes
- lln_01: Percentage listing landline ownership as: No
- lan_00: Percentage listing language as: Afrikaans
- lan_01: Percentage listing language as: English
- lan_02: Percentage listing language as: IsiNdebele
- lan_03: Percentage listing language as: IsiXhosa
- lan_04: Percentage listing language as: IsiZulu
- lan_05: Percentage listing language as: Sepedi
- lan_06: Percentage listing language as: Sesotho
- lan_07: Percentage listing language as: Setswana
- lan_08: Percentage listing language as: Sign language
- lan_09: Percentage listing language as: SiSwati
- lan_10: Percentage listing language as: Tshivenda
- lan_11: Percentage listing language as: Xitsonga
- lan_12: Percentage listing language as: Other
- lan_13: Percentage listing language as: Unspecified
- lan_14: Percentage listing language as: Not applicable
- pg_00: Percentage in population group: Black African
- pg_01: Percentage in population group: Coloured
- pg_02: Percentage in population group: Indian or Asian
- pg_03: Percentage in population group: White
- pg_04: Percentage in population group: Other
- lgt_00: Percentage using electricity for lighting
The goal of every computer program is to optimize, so Pandas, which is one of the most popular data wrangling and analysis libraries, has limitations when it comes to big data due to its algorithm and local memory constraints. However, Dask is an open-source and freely available Python library which provides ways to scale Pandas, Scikit-Learn, and Numpy workflows more natively, and with minimal rewriting.
Which machine learning algorithm should we use?
The only way to find the best algorithm for a given problem is to try and test all algorithms.
It is time-costly to try out all possible machine learning algorithms for this project, so for this article, we will be using the CATBoost Algorithm.
What is CATBOOST?
From Catboost website
CatBoost is an algorithm for gradient boosting on decision trees. It is developed by Yandex researchers and engineers and is used for search, recommendation systems, personal assistance, self-driving cars, weather prediction, and many other tasks at Yandex and in other companies, including CERN, Cloudflare, and Careem taxi. It is open-source and can be used by anyone.
The library is laser-focused on:
- Great quality without parameter tuning: Reduce time spent on parameter tuning, because CatBoost provides great results with default parameters.
- Improved accuracy: Reduce overfitting when constructing your models with a novel gradient-boosting scheme.
- Fast predictions: Apply your trained model quickly and efficiently even to latency-critical tasks using CatBoost’s model applier.
- Fast and scalable GPU version: Train your model on a fast implementation of the gradient-boosting algorithm for GPU. Use a multi-card configuration for large datasets.
- Categorical features support: Improve your training results with CatBoost by using non-numeric factors, instead of having to pre-process your data or spend time and effort turning it to numbers.
conda install catboost PyPI installation pip install catboost
Both versions of CatBoost have GPU support out-of-the-box.
Here, we start by importing all the python libraries we will need for this project.
- warning: Warning control for suppressing deprecated functions in python packages.
- numpy: Enables scientific computing
- dataframe& pandas: Dask and pandas libraries for data wrangling and analysis. Daskproviding scalability for Pandas .
- CatBoostRegressor: The machine learning algorithm employed.
- mean_square_error:`Performance metrics used to evaluate the model
Reading and getting an overview of datasets
Preprocessing data and feature engineering
We know that the ‘train’ dataset is made up of 50 features and we have already given a comprehensive description of them. Some features are redundant – dropping those will be ideal for the model building process.
Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. Based on domain knowledge, three features rich, poor and household_size were created respectively, and clustering analysis on the total_household and total_individual was used to develop the cluster feature.
To keep the codes DRY (Don’t Repeat Yourself), the train and test data were combined, so the feature engineering steps will affect both data sets. Now that we have concluded the feature engineering operation, it is ideal to split the combined data back to train and test.
If feature scaling is not done, then a machine learning algorithm tends to weigh greater values higher and consider smaller values as the lower values, regardless of their units. Here, we will be scaling the features with large values with logarithmic transformations. Then we break the training data into a matrix of features — x and target — y.
It can be observed that there is an absence of missing values in the data.
Another useful transformation that would improve the model will be to approximate all the float values to 2 decimal places.
The catboost model is instantiated with already-tuned hyperparameters. Hyperparameter tuning is a very time-costly procedure and was skipped in the article. To avoid overfitting, 15 folds were used for cross-validation.
Model evaluation & prediction
The model predictions had an RMSE score of 5.800504378972998. Lower values of RMSE indicate better fit and the model is looking good! Lastly, the model is applied to data it hasn’t seen to make a submission file.
Thanks for reading, I hope this article breeds more insights on how the pandemic can be curbed with machine learning. Stay safe!
Guest post: Aboze Brain John Jnr.
You may also be interested in: Using Support Vector Machines for Classification and Regression Solutions.