Can areas vulnerable to COVID19 be measured and mapped?

Can we infer important COVID-19 public health risk factors from outdated data? In many countries, census and other survey data may be incomplete or out of date. The objective of this article is to develop a proof-of-concept for how machine learning can help governments more accurately map COVID-19 risk in 2020 using old data, without requiring a new costly, risky, and time-consuming on-the-ground survey.

The context of this publication focuses on the 2011 census, which gives us valuable information for determining who might be most vulnerable to COVID-19 in South Africa. However, the data is nearly 10 years old, and we expect that some key indicators will have changed in that time. Building an up-to-date map showing where the most vulnerable are located will be a key step in responding to the disease. A mapping effort like this requires bringing together many different inputs and tools. For this article, we’re starting small. Can we infer important risk factors from more readily available data?

We will work to predict the percentage of households that fall into a particularly vulnerable bracket — large households who must leave their homes to fetch water — using 2011 South African census data. With machine learning, it is possible to use easy-to-measure stats to identify areas most at risk even in years when census data is not collected.

 

About the data

South Africa is divided into 4,392 wards. We will aggregate the target indicators (households with 5+ members and no on-premises water) and the other predictive variables from the census across all the households within each ward to create an aggregated value of each indicator per ward.

Some wards are for training, the remainder are the test set. You may not use external data for this competition. The description of the features is outlined below for reference:

  1. total_households: Total number of households in ward.
  2. total_individuals: Total number of individuals in ward.
  3. target_pct_vunerable: Percentage of large households who have to leave their premises for water.
  4. dw_00: Percentage of dwellings of type: House on a separate stand or yard or on a farm.
  5. dw_01: Percentage of dwellings of type: Traditional dwellings made of traditional materials.
  6. dw_02: Percentage of dwellings of type: Flat or apartment in a block of flats.
  7. dw_03: Percentage of dwellings of type: Cluster house in complex.
  8. dw_04: Percentage of dwellings of type: Townhouse (semi-detached house in a complex).
  9. dw_05: Percentage of dwellings of type: Semi-detached house.
  10. dw_06: Percentage of dwellings of type: House in backyard.
  11. dw_07: Percentage of dwellings of type: Informal dwelling (shack in backyard).
  1. dw_08: Percentage of dwellings of type: Informal dwelling (shack not in backyard e.g. in an informal/squatter settlement or on a farm).
  2. dw_09: Percentage of dwellings of type: Room/flatlet on a property or larger dwelling/servants quarters/granny flat.
  3. dw_10: Percentage of dwellings of type: Caravan/tent.
  4. dw_11: Percentage of dwellings of type: Other.
  5. dw_12: Percentage of dwellings of type: Unspecified.
  6. dw_13: Percentage of dwellings of type: Not applicable.
  7. psa_00: Percentage listing present school attendance as: Yes
  8. psa_01: Percentage listing present school attendance as: No
  9. psa_02: Percentage listing present school attendance as: Do not know
  10. psa_03: Percentage listing present school attendance as: Unspecified
  11. psa_04: Percentage listing present school attendance as: Not applicable
  12. stv_00: Percentage of households with Satellite TV: Yes
  13. stv_01: Percentage of households with Satellite TV: No
  14. car_00: Percentage of households with a car: Yes
  15. car_01: Percentage of households with a car: No
  16. lln_00: Percentage listing landline ownership as: Yes
  17. lln_01: Percentage listing landline ownership as: No
  18. lan_00: Percentage listing language as: Afrikaans
  19. lan_01: Percentage listing language as: English
  20. lan_02: Percentage listing language as: IsiNdebele
  21. lan_03: Percentage listing language as: IsiXhosa
  22. lan_04: Percentage listing language as: IsiZulu
  23. lan_05: Percentage listing language as: Sepedi
  24. lan_06: Percentage listing language as: Sesotho
  25. lan_07: Percentage listing language as: Setswana
  26. lan_08: Percentage listing language as: Sign language
  27. lan_09: Percentage listing language as: SiSwati
  28. lan_10: Percentage listing language as: Tshivenda
  29. lan_11: Percentage listing language as: Xitsonga
  30. lan_12: Percentage listing language as: Other
  31. lan_13: Percentage listing language as: Unspecified
  32. lan_14: Percentage listing language as: Not applicable
  33. pg_00: Percentage in population group: Black African
  34. pg_01: Percentage in population group: Coloured
  35. pg_02: Percentage in population group: Indian or Asian
  36. pg_03: Percentage in population group: White
  37. pg_04: Percentage in population group: Other
  38. lgt_00: Percentage using electricity for lighting

Why Dask?

The goal of every computer program is to optimize, so Pandas, which is one of the most popular data wrangling and analysis libraries, has limitations when it comes to big data due to its algorithm and local memory constraints. However, Dask is an open-source and freely available Python library which provides ways to scale PandasScikit-Learn, and Numpy workflows more natively, and with minimal rewriting.

Which machine learning algorithm should we use?

The only way to find the best algorithm for a given problem is to try and test all algorithms.

It is time-costly to try out all possible machine learning algorithms for this project, so for this article, we will be using the CATBoost Algorithm.

What is CATBOOST?

From Catboost websiteFrom Catboost website

CatBoost is an algorithm for gradient boosting on decision trees. It is developed by Yandex researchers and engineers and is used for search, recommendation systems, personal assistance, self-driving cars, weather prediction, and many other tasks at Yandex and in other companies, including CERN, Cloudflare, and Careem taxi. It is open-source and can be used by anyone.

Why CATBOOST?

The library is laser-focused on:

  1. Great quality without parameter tuning: Reduce time spent on parameter tuning, because CatBoost provides great results with default parameters.
  2. Improved accuracy: Reduce overfitting when constructing your models with a novel gradient-boosting scheme.
  3. Fast predictions: Apply your trained model quickly and efficiently even to latency-critical tasks using CatBoost’s model applier.
  4. Fast and scalable GPU version: Train your model on a fast implementation of the gradient-boosting algorithm for GPU. Use a multi-card configuration for large datasets.
  5. Categorical features support: Improve your training results with CatBoost by using non-numeric factors, instead of having to pre-process your data or spend time and effort turning it to numbers.

CATBOOST Installation

Installation is only supported by the 64-bit version of Python. It mainly has two dependencies: Numpy and Six. It can be installed by the two popular Python Package managers — Conda and PyPI.

Conda installation:

conda install catboost
PyPI installation
pip install catboost

Both versions of CatBoost have GPU support out-of-the-box.

Importing packages

Covid-19 Vulnerability Mapping Using Machine Learning

Here, we start by importing all the python libraries we will need for this project.

  • warning: Warning control for suppressing deprecated functions in python packages.
  • numpy: Enables scientific computing
  • dataframe& pandas: Dask and pandas libraries for data wrangling and analysis. Daskproviding scalability for Pandas .
  • CatBoostRegressor: The machine learning algorithm employed.
  • mean_square_error:`Performance metrics used to evaluate the model

Reading and getting an overview of datasets

Covid-19 Vulnerability Mapping Using Machine Learning

Preprocessing data and feature engineering

Covid-19 Vulnerability Mapping Using Machine Learning

We know that the ‘train’ dataset is made up of 50 features and we have already given a comprehensive description of them. Some features are redundant – dropping those will be ideal for the model building process.

Covid-19 Vulnerability Mapping Using Machine Learning

Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. Based on domain knowledge, three features rich, poor and household_size were created respectively, and clustering analysis on the total_household and total_individual was used to develop the cluster feature.

Covid-19 Vulnerability Mapping Using Machine Learning

To keep the codes DRY (Don’t Repeat Yourself), the train and test data were combined, so the feature engineering steps will affect both data sets. Now that we have concluded the feature engineering operation, it is ideal to split the combined data back to train and test.

Data transformation

If feature scaling is not done, then a machine learning algorithm tends to weigh greater values higher and consider smaller values as the lower values, regardless of their units. Here, we will be scaling the features with large values with logarithmic transformations. Then we break the training data into a matrix of features — x and target — y.

Covid-19 Vulnerability Mapping Using Machine Learning

It can be observed that there is an absence of missing values in the data.

Covid-19 Vulnerability Mapping Using Machine Learning

Another useful transformation that would improve the model will be to approximate all the float values to 2 decimal places.

Covid-19 Vulnerability Mapping Using Machine Learning

Model building

Covid-19 Vulnerability Mapping Using Machine Learning

Covid-19 Vulnerability Mapping Using Machine Learning

The catboost model is instantiated with already-tuned hyperparameters. Hyperparameter tuning is a very time-costly procedure and was skipped in the article. To avoid overfitting, 15 folds were used for cross-validation.

 

 

Model evaluation & prediction

Covid-19 Vulnerability Mapping Using Machine Learning

The model predictions had an RMSE score of 5.800504378972998. Lower values of RMSE indicate better fit and the model is looking good! Lastly, the model is applied to data it hasn’t seen to make a submission file.

Thanks for reading, I hope this article breeds more insights on how the pandemic can be curbed with machine learning. Stay safe!

Guest post:  Aboze Brain John Jnr.

Stay up to date with Saturn Cloud on LinkedIn and Twitter.

You may also be interested in: Using Support Vector Machines for Classification and Regression Solutions.