The world’s datasphere is growing at an exponential rate, making it pertinent for proper analysis to be executed to get insights that will guide and drive various decisions. The question of ‘where?’ becomes an integral part of decision making, as almost all data points have a geographic location.
To accomplish effective data analysis on ‘where’, we need to carry out proper data engineering. Data engineering refers to the planning, preparation, and processing of data to make it more useful for analysis. In this article, we will be performing data engineering using ArcGIS Pro, ArcGIS notebook, and an open-source library — Dask.
Why choose Dask?
Pandas have been one of the most popular data science tools used in the Python programming language for data wrangling and analysis. Pandas have their own limitations when it comes to big data, due to their algorithm and local memory constraints.
However, Dask is an open-source and freely available Python library. Dask provides ways to scale Pandas, Scikit-Learn, and Numpy workflows more natively, with minimal rewriting. There are three(3) main types of Dask’s user interfaces, namely:
- bag, and
We’ll focus mainly on Dask dataframe in this article. Think of Dask as an extension of Pandas in terms of performance and scalability. What’s even cooler is that you can switch between Dask dataframe and Pandas dataframe to do any data transformation and operation on demand. It is a flexible library for distributed parallel computing in Python. Read more.
Getting ArcGIS Pro ready
ArcGIS Pro is the latest professional desktop Geographic Information System (GIS) application from Esri. With ArcGIS Pro, you can explore, visualize, and analyze data; create 2D maps and 3D scenes; and share your work.
To commence to the data engineering session, we need to do some preliminary operations on ArcGIS Pro, including:
- Starting a new project and a jupyter notebook in ArcGIS Pro
- Installing Dask library on ArcGIS Pro
Starting a new project and a jupyter notebook in ArcGIS Pro: After proper installation, the following steps will aid you in starting your first project.
Installing Dask library on ArcGIS Pro: The python experience has been incorporated into ArcGIS via the ‘Conda’ package manager. ‘Conda’ package manager automates the usage of python libraries and management of the working environment. Let’s clear up the meaning of some terminologies.
- Environment: A folder or directory which contains a collection of ‘Conda’ packages.
- Packages: A compressed file that contains python software.
- Channel: A URL that leads to a repository.
- Repository: A storage location for packages.
The steps required to install Dash in ArcGIS Pro are depicted below:
Data Engineering with Dask
This notebook describes the process of downloading and preparing United States presidential election data. You will address missing values, reformat data types, and restructure the format of a table. The resources of this article can be found here.
Load and prepare data
To download and prepare the election data, you will use ArcPy, the ArcGIS API for Python, and a Pandas and Dask dataframe. First, you will import these modules to use them. Then, you will create a variable for the United States county election data and use this variable to read the data into a Dask dataframe.
Cleaning the data
From the preview of the dataset above, it can be observed that ‘state_po’ is an acronym for the ‘state’ feature. To make the data cleaner, we must remove this redundant feature.
The election data includes records that are missing data in the FIPS, party and candidatevotes fields. These missing data are referred to as null values. We have two ways to work with features with missing values after proper identification:
- Fill them with a value
- Remove that instance in the dataset
The strategy of handling missing values that we will employ here will be replacing the missing values with a valid and representative value. This can be achieved with the Dask dataframe using the fillna method.
The ‘FIPS’ and ‘candidatevotes’ features are both numerical. In this scenario, since the data is continuous, we could use either mean or the median to represent the central tendency of the features. In this case, we will fill the missing values with the mean of those features.
We are left with missing values in party feature. The missing values are quite large, making it critical for us to make a good choice of what to fill it with. Let’s get an overview of the unique values in the feature. As seen below, this depicts the voting parties in the election. To have an unbiased dataset, we will fill the missing values with ‘not recorded’.
Explore and handle data types
In reviewing your data, you notice that the FIPS field is considered a numeric field instead of a string. As a result, leading zeroes in the FIPS values have been removed. The resulting FIPS values only have four characters instead of five. You will determine how many records are missing leading zeroes and add, or append, the missing zero.
Also, fields like year should be integer values rather than float data types.
Reformat the table structure
Currently, each record in the table corresponds to a candidate and their votes in a county. You need to reformat the table so that each record corresponds to each county, with fields showing the votes for different candidates in that election year.
It is possible to do this using the Pivot Table geoprocessing tool or Excel pivot tables, but Python may make it easier to automate and share.
The animation below illustrates the steps in restructuring the table:
The following code cell performs these steps:
Calculate additional columns: Feature Engineering
Here, we will be using the values from the updated table to add additional columns of information, such as the number of votes for a non-major party, the percentage of voters for each party, and so on. Each column is referred to as an attribute of the dataset.
Geoenable the data
You will eventually use this data in a spatial analysis. This means that the data needs to include location information to determine where the data is located on a map. You will geo-enable the data, or add locations to the data, using existing geo-enabled county data.
Join the data
You have a dataframe with election data (df) and a spatially-enabled dataframe of the county geometry data (counties_df). You will merge these datasets into one.
Query and calculate attributes
Because you have the voting age population for 2016, you can now calculate the average voter participation (voter turnout) for 2016. The dataframe includes records from 2010–2016 but only has a voting-age population for 2016. You will need to create a subset dataframe for 2016 before calculating the voter turnout.
Validate the data
Before continuing with other data preparation, you should confirm that the output data has been successfully created.
First, you will validate the values for voter turnout. You will remove null values, and because these values represent a fraction (total votes divided by voting-age population), you will confirm that the values range between 0 and 1.
Update validated data
After reviewing the Census Bureau voting age population data for 2016, you determined that these counties have a low voting age population with a fairly high margin of error. This may be the reason why these counties have a voter turnout rate higher than 100%.
You will recalculate the voter turnout field for these counties using the upper range of their margin of error:
- San Juan County, Colorado: 574
- Harding County, New Mexico: 562
- Loving County, Texas: 86
- McMullen County, Texas: 566
This information was extracted from here.
Convert dataframes to feature classes for Spatial Analysis
You will use the ArcGIS API for Python, imported at the beginning of this script, to export the spatially-enabled dataframe to a feature class.
Note: Executing the following cell may take a few minutes
At the top of the page, click the Data Engineering map tab. Drag the Data Engineering map tab to display as its own window. Review the feature class that was added to the Data Engineering map.
The color of the data will vary every time it is added to the map.
I hope this article will make you appreciate the importance of data engineering for spatial analysis in ArcGIS Pro. Thanks for reading.
Guest Post: Aboze Brain John Jr.
You may also be interested in: Deep Learning AI.