A Data Scientist’s Guide to Lazy Evaluation with Dask

January 26, 2021

A Data Scientist’s Guide to Lazy Evaluation with Dask

January 26, 2021

What is Dask?

Why Parallelize?

Delaying Tasks

Example 1

def exponent(x, y):
    '''Define a basic function.'''
    return x ** y

# Function returns result immediately when called
exponent(4, 5)
import dask@dask.delayed
def lazy_exponent(x, y):
    '''Define a lazily evaluating function'''
    return x ** y# Function returns a delayed object, not computation
lazy_exponent(4, 5)
# This will now return the computation
lazy_exponent(4,5).compute()
x = lazy_exponent(4, 5)
y = lazy_exponent(x, 2)
z = x * y
z
z.visualize(rankdir="LR")

Image for post

z.compute()

1073741824

Compute

Persist

Distributed Data Objects

Image for post

Example 3

import dask
import dask.dataframe as dd
df = dask.datasets.timeseries()
df
df2 = df[df.y > 0]
df3 = df2.groupby('name').x.std()
df3
computed_df = df3.compute()
type(computed_df)
computed_df.head()

Example 4

import dask
import dask.dataframe as dd
df = dask.datasets.timeseries()
df.npartitions
df2 = df[df.y > 0]
df3 = df2.groupby('name').x.std()
print(type(df3))
df3.npartitions
df4 = df3.repartition(npartitions=3)
df4.npartitions
df4
%%time
df4.persist()
%%time
df4.compute().head()

Conclusion

Image credit: Manja Vitolic on Unsplash

By Stephanie Kirmer
Posted in Blog | January 26, 2021