The world’s datasphere is growing at an exponential rate, making it pertinent for proper analysis to be executed to get insights that will guide and drive various decisions. The question of ‘where?’ becomes an integral part of decision making, as almost all data points have a geographic location.

To accomplish effective data analysis on ‘where’, we need to carry out proper data engineering. Data engineering refers to the planning, preparation, and processing of data to make it more useful for analysis. In this article, we will be performing data engineering using ArcGIS Pro, ArcGIS notebook, and an open-source library — Dask.

Why choose Dask?

Pandas have been one of the most popular data science tools used in the Python programming language for data wrangling and analysis. Pandas have their own limitations when it comes to big data, due to their algorithm and local memory constraints.

However, Dask is an open-source and freely available Python library. Dask provides ways to scale Pandas, Scikit-Learn, and Numpy workflows more natively, with minimal rewriting. There are three(3) main types of Dask’s user interfaces, namely:

  1. array,
  2. bag, and
  3. dataframe.

We’ll focus mainly on Dask dataframe in this article. Think of Dask as an extension of Pandas in terms of performance and scalability. What’s even cooler is that you can switch between Dask dataframe and Pandas dataframe to do any data transformation and operation on demand. It is a flexible library for distributed parallel computing in Python. Read more.

Getting ArcGIS Pro ready

ArcGIS Pro is the latest professional desktop Geographic Information System (GIS) application from Esri. With ArcGIS Pro, you can explore, visualize, and analyze data; create 2D maps and 3D scenes; and share your work.

To commence to the data engineering session, we need to do some preliminary operations on ArcGIS Pro, including:

  • Starting a new project and a jupyter notebook in ArcGIS Pro
  • Installing Dask library on ArcGIS Pro

Starting a new project and a jupyter notebook in ArcGIS Pro: After proper installation, the following steps will aid you in starting your first project.

Spatial Data Engineering Using Dask in ArcGIS Pro

ArcGIS Start page

Spatial Data Engineering Using Dask in ArcGIS Pro

Creating a Jupyter notebook

Spatial Data Engineering Using Dask in ArcGIS Pro

Opening the Jupyter Notebook

Spatial Data Engineering Using Dask in ArcGIS Pro

Notebook created

Spatial Data Engineering Using Dask in ArcGIS Pro

Description of the steps

Installing Dask library on ArcGIS Pro: The python experience has been incorporated into ArcGIS via the ‘Conda’ package manager. ‘Conda’ package manager automates the usage of python libraries and management of the working environment. Let’s clear up the meaning of some terminologies.

  • Environment: A folder or directory which contains a collection of ‘Conda’ packages.
  • Packages: A compressed file that contains python software.
  • Channel: A URL that leads to a repository.
  • Repository: A storage location for packages.

The steps required to install Dash in ArcGIS Pro are depicted below:

Spatial Data Engineering Using Dask in ArcGIS Pro

Elements of the ArcGIS Pro Project Page: Installation Procedures

Spatial Data Engineering Using Dask in ArcGIS Pro

Spatial Data Engineering Using Dask in ArcGIS Pro

Elements of the ArcGIS Pro Project Page: Installation Procedures continues

Spatial Data Engineering Using Dask in ArcGIS ProData Engineering with Dask

Spatial Data Engineering Using Dask in ArcGIS Pro

Photo by Kaleidico on Unsplash

This notebook describes the process of downloading and preparing United States presidential election data. You will address missing values, reformat data types, and restructure the format of a table. The resources of this article can be found here.

Load and prepare data

To download and prepare the election data, you will use ArcPy, the ArcGIS API for Python, and a Pandas and Dask dataframe. First, you will import these modules to use them. Then, you will create a variable for the United States county election data and use this variable to read the data into a Dask dataframe.

https://gist.github.com/codebrain001/4c528353e3299aaa49ade3be9370c117

Cleaning the data

https://gist.github.com/codebrain001/4c528353e3299aaa49ade3be9370c117

From the preview of the dataset above, it can be observed that ‘state_po’ is an acronym for the ‘state’ feature. To make the data cleaner, we must remove this redundant feature.

https://gist.github.com/codebrain001/6acb1ca0a3b58dec1122cff6c315fd1c

The election data includes records that are missing data in the FIPS, party and candidatevotes fields. These missing data are referred to as null values. We have two ways to work with features with missing values after proper identification:

  • Fill them with a value
  • Remove that instance in the dataset

The strategy of handling missing values that we will employ here will be replacing the missing values with a valid and representative value. This can be achieved with the Dask dataframe using the fillna method.

The ‘FIPS’ and ‘candidatevotes’ features are both numerical. In this scenario, since the data is continuous, we could use either mean or the median to represent the central tendency of the features. In this case, we will fill the missing values with the mean of those features.

https://gist.github.com/codebrain001/e1c2c976202af531755f8a73531f3d58

We are left with missing values in party feature. The missing values are quite large, making it critical for us to make a good choice of what to fill it with. Let’s get an overview of the unique values in the feature. As seen below, this depicts the voting parties in the election. To have an unbiased dataset, we will fill the missing values with ‘not recorded’.

https://gist.github.com/codebrain001/4563c16a13bd9a3e8a6ed1f1d03ba40a

Explore and handle data types

In reviewing your data, you notice that the FIPS field is considered a numeric field instead of a string. As a result, leading zeroes in the FIPS values have been removed. The resulting FIPS values only have four characters instead of five. You will determine how many records are missing leading zeroes and add, or append, the missing zero.

Spatial Data Engineering Using Dask in ArcGIS ProAlso, fields like year should be integer values rather than float data types.

https://gist.github.com/codebrain001/258d186b7f13f3aade997ba550a91522

Reformat the table structure

Currently, each record in the table corresponds to a candidate and their votes in a county. You need to reformat the table so that each record corresponds to each county, with fields showing the votes for different candidates in that election year.
It is possible to do this using the Pivot Table geoprocessing tool or Excel pivot tables, but Python may make it easier to automate and share.
The animation below illustrates the steps in restructuring the table:

Spatial Data Engineering Using Dask in ArcGIS ProThe following code cell performs these steps:

https://gist.github.com/codebrain001/77c4cdfdfe057cc661e56056eca35a71

Calculate additional columns: Feature Engineering

Here, we will be using the values from the updated table to add additional columns of information, such as the number of votes for a non-major party, the percentage of voters for each party, and so on. Each column is referred to as an attribute of the dataset.

https://gist.github.com/codebrain001/16f70c7c10280ead807d1090d65909fc

Geoenable the data

You will eventually use this data in a spatial analysis. This means that the data needs to include location information to determine where the data is located on a map. You will geo-enable the data, or add locations to the data, using existing geo-enabled county data.

https://gist.github.com/codebrain001/639e6738eddcdb4d5e204d58c5bb1575

Join the data

You have a dataframe with election data (df) and a spatially-enabled dataframe of the county geometry data (counties_df). You will merge these datasets into one.

https://gist.github.com/codebrain001/0baaa2a1051a9833717be9ee4a100091

Query and calculate attributes

Because you have the voting age population for 2016, you can now calculate the average voter participation (voter turnout) for 2016. The dataframe includes records from 2010–2016 but only has a voting-age population for 2016. You will need to create a subset dataframe for 2016 before calculating the voter turnout.

https://gist.github.com/codebrain001/45fcc964a24ad465645fd93e8c71b286

Validate the data

Before continuing with other data preparation, you should confirm that the output data has been successfully created.

First, you will validate the values for voter turnout. You will remove null values, and because these values represent a fraction (total votes divided by voting-age population), you will confirm that the values range between 0 and 1.

https://gist.github.com/codebrain001/b94298a729a0e4521e2ae76de8a70e8b

Update validated data

After reviewing the Census Bureau voting age population data for 2016, you determined that these counties have a low voting age population with a fairly high margin of error. This may be the reason why these counties have a voter turnout rate higher than 100%.

You will recalculate the voter turnout field for these counties using the upper range of their margin of error:

  • San Juan County, Colorado: 574
  • Harding County, New Mexico: 562
  • Loving County, Texas: 86
  • McMullen County, Texas: 566

This information was extracted from here.

https://gist.github.com/codebrain001/b3dab538aa5574577af81c2f79a2d2ce

Convert dataframes to feature classes for Spatial Analysis

You will use the ArcGIS API for Python, imported at the beginning of this script, to export the spatially-enabled dataframe to a feature class.

Note: Executing the following cell may take a few minutes

https://gist.github.com/codebrain001/3d31194806c182d0b7ad80fc5415e9ad

At the top of the page, click the Data Engineering map tab. Drag the Data Engineering map tab to display as its own window. Review the feature class that was added to the Data Engineering map.

Spatial Data Engineering Using Dask in ArcGIS Pro

The color of the data will vary every time it is added to the map.

I hope this article will make you appreciate the importance of data engineering for spatial analysis in ArcGIS Pro. Thanks for reading.

Guest Post: Aboze Brain John Jr.

Stay up to date with Saturn Cloud on LinkedIn and Twitter.

You may also be interested in: Deep Learning AI.