import pandas as pd import dask.dataframe as dd df = pd.read_csv('2015-01-01.csv') df = dd.read_csv('2015-*-*.csv') df.groupby(df.user_id).value.mean() df.groupby(df.user_id).value.mean().compute()
import numpy as np import dask.array as da f = h5py.File('myfile.hdf5') f = h5py.File('myfile.hdf5') x = np.array(f['/small-data']) x = da.from_array(f['/big-data'], chunks=(1000, 1000)) x - x.mean(axis=1) x - x.mean(axis=1).compute()
import dask.bag as db b = db.read_text('2015-*-*.json.gz').map(json.loads) b.pluck('name').frequencies().topk(10, lambda pair: pair).compute()
from dask import delayed L =  for fn in filenames: # Use for loops to build up computation data = delayed(load)(fn) # Delay execution of function L.append(delayed(process)(data)) # Build connections between variables result = delayed(summarize)(L) result.compute()
from dask.distributed import Client client = Client('scheduler:port') futures =  for fn in filenames: future = client.submit(load, fn) futures.append(future) summary = client.submit(summarize, futures) summary.result()
Dask is convenient on a laptop. It installs trivially with
pip and extends the size of convenient datasets from “fits in memory” to “fits on disk”.
Dask can scale to a cluster of 100s of machines. It is resilient, elastic, data local, and low latency. For more information, see the documentation about the distributed scheduler.
This ease of transition between single-machine to moderate cluster enables users to both start simple and grow when necessary.
Dask represents parallel computations with task graphs. These directed acyclic graphs may have arbitrary structure, which enables both developers and users the freedom to build sophisticated algorithms and to handle messy situations not easily managed by the
map/filter/groupby paradigm common in most data engineering frameworks.
We originally needed this complexity to build complex algorithms for n-dimensional arrays but have found it to be equally valuable when dealing with messy situations in everyday problems.