Analyze Big Data With Dask? It’s a solution many have found. Read on for an introductory walkthrough on Dask’s capabilities with a big data set.

Data scientists often use Pandas for working with data frames. While Pandas is perfect for small to medium-sized datasets, larger ones are a bit problematic. Working with large datasets on a local machine may result in heating of machine for the simplest of machine learning tasks. Although Pandas and Numpy are great libraries, they are not always computationally efficient, especially when there are TB’s of data to manipulate. So what can you do to get around this obstacle?

Dask to The Rescue!

Dask is an Open Source framework that provides high-level abstractions for Numpy Arrays, Pandas Dataframes and regular lists, allowing us to deploy parallel operations to completely harness the power of our multi-core machine. Dask is a library for delayed task computation that makes use of directed acyclic graphs at its core to harness the power of parallel computation.

Analyzing Big Data With Dask

Dask’s DAG (Directed Acyclic Graph) in execution

Dask Natively Scales Python… It provides high-level data structures like Dask Array, Bag, and dataframes collections that scales NumPy and Pandas workflows enabling them to operate in parallel on large datasets that don’t fit into the main memory

Dask Data Frames

Dask data frames coordinate many Pandas data frames, partitioned along with the index. It scales the pandas workflow, enabling its application in time series forecasting, business intelligence, and general data munging on big data. The Dask data frame has the same API as pandas data frame, except the aggregations and apply operations that are evaluated lazily, and need to be computed by calling the compute method. It is advised to partition the data frame in as many cores your computer has, or a couple times that number so that each partition will run on a different thread and communication between the threads can be easily carried out.

Analyzing Big Data With Dask

Dask Dataframe incorporating many Pandas Dataframes

 

Getting Dirty — Handson!

In this article, we will analyze the Yelp’s Business Review Dataset available on Kaggle. The dataset is of size 4 GB, it is not possible to use pandas data frame for analyzing this data because pandas data frame stores all the data in the main memory which will result in the crashing of the system. So, we will use the Dask data frame which virtually divides the dataset into small chunks making it feasible to analyze the data on the local machine.

Why Data Will Transform Investment Management | Data Driven Investor
Some have called it “the new oil.” But while it bears little resemblance to the black gold, its ongoing commoditization…www.datadriveninvestor.com

 

Preparing The Data Set

Let’s grab our data for Analysis:

import dask.dataframe as dd
df = dd.read_csv('yelp_business.csv',blocksize=64000000)
df = df.iloc[:,[0,1,4,5,9,10,11]]

Note that the read_csv function is pretty similar to the Pandas one, except here we specify the byte-size per chunks. Our data will look like:

Analyzing Big Data With Dask

Now, we have the dask delayed data frame object which will create the Directed Acyclic Graph on performing operations like group by, aggregation etc. It will get executed through .compute command. Note that ordering column values with Dask isn’t that easy (after all, the data is read one chunk at a time), so we cannot use the sort_values() method as we did in the Pandas example. Instead, we need to use the nlargest() Dask method and specify the number of top values we’d like to determine.

Most Reviewed Businesses

In this step, we figured out the top businesses based on review count and visualize them in the form of a barplot. Finding out the most reviewed businesses will give us an indication of the online presence of a particular business and its popularity. It also gives an indication of solid business outreach and marketing strategy.

##Selecting top 10 businesses based on review count
most_reviewed_business_dask = df[['name', 'review_count', 'city',\ 'stars']].nlargest(10, "review_count")
##Visualizing most reviewed businesses through Barplot
most_reviewed_business_dask.set_index('name',sorted=True).compute()\.plot(kind='barh',stacked=False, figsize=[10,8], legend=False)
#######
plt.title('Most Reviewed Businesses')
plt.xlabel('Review Count')
plt.ylabel('Business Name')
plt.show()

Analyzing Big Data With Dask

Cities with the Most Number of Businesses

In this step, we figured out the cities accommodating the most number of businesses. It signals the presence of a stable local government with policies favoring the ease of doing business within the city. It also indicates the presence of infrastructure and business ecosystem the city provides for the businesses to grow and thrive.

###Arranging cities with decreasing business counts
city_business_counts_dask = df[['city', 'business_id']].groupby(['city'])['business_id'].agg('count')
##Visualizing top 20 cites by businesses listed
city_business_counts_dask.nlargest(20).compute().\
plot(kind='barh', stacked=False, figsize=[10,10], colormap='hsv')
######
plt.title('Top 20 cities by businesses listed')
plt.ylabel('City')
plt.xlabel('Number of Businesses')
plt.show()

Analyzing Big Data With Dask

Cities with Most Reviews

In this step, we have continued the above analysis where we found the cities with the most number of businesses, in addition to that, we have found the top cities on the basis of review count. This analysis indicates the literacy rate of the city, the online presence of its people, and the availability of supporting infrastructure for internet communication.

##Selecting Cities with most reviews
city_business_reviews_dask = df[['city', 'review_count', 'stars']].groupby(['city']).\
agg({'review_count': 'sum', 'stars': 'mean'})
##Visualizing top cities based on reviews
city_business_reviews_dask.nlargest(20,'review_count').compute()\
.plot(kind='barh',stacked=False,figsize=[10,10],colormap='PiYG', legend=False)
#######
plt.title('Top 20 cities by reviews')
plt.ylabel('City')
plt.xlabel('Number of Reviews')
plt.show()

Analyzing Big Data With Dask

Average Ratings for Most Reviewed Cities

In continuation of the above analysis, we have found the average rating for the top five cities on the basis of business reviews. This can act as a standard measure for determining the ease of doing business within a particular city.

city_business_reviews_dask.nlargest(5,'review_count')['stars'].compute()\
.plot(kind='bar',stacked=False,figsize=[10,6],colormap='PiYG', legend=False)
plt.title('Average Rating of cities with most number of reviews')
plt.xlabel('Rating')
plt.ylabel('City')
plt.show()

Analyzing Big Data With Dask

Rating Distribution

In this step, we’ve figured out the rating distribution across the entire Yelp’s business review dataset and plotted it out in the form of a barplot.

import seaborn as sns
plt.figure(figsize=(10,7))
sns.distplot(df['stars'].compute(),kde=False,color='green',\
vertical=False)
plt.xlabel('Rating')
plt.ylabel('Count')
plt.title('Rating Distribution')
plt.show()

Analyzing Big Data With Dask

This was the small introduction on how to analyze big data in Dask using the familiar pandas syntax and harness the power of multiple cores machine by deploying parallel computations. While working with large datasets when pandas say TL;DR → Go for Dask!



References

  1. https://docs.dask.org/en/latest/
  2. https://towardsdatascience.com/trying-out-dask-dataframes-in-python-for-fast-data-analysis-in-parallel-aa960c18a915
  3. https://pythondata.com/dask-large-csv-python/
  4. https://towardsdatascience.com/how-to-handle-large-datasets-in-python-with-pandas-and-dask-34f43a897d55