An Infrastructure That Supports Dask.

If you’re one of the 90% of data scientists that use Python, it’s time to meet Dask. Dask is a flexible library for distributed parallel computing in Python. It provides ways to scale Pandas, Scikit-Learn, and Numpy workflows more natively, with minimal rewriting.

Spark is written in Scala with some support for Python and R.

Spark is more focused on traditional business intelligence operations like SQL and lightweight machine learning.

Spark lacks flexibility for more complex algorithms or ad-hoc systems. It is fundamentally an extension of the Map-Shuffle-Reduce paradigm.

Spark does not include support for multi-dimensional arrays natively (this would be challenging given their computation model)

Spark provides GraphX, a library for graph processing.


Dask is written in Python and interoperates well with C/C++/Fortran/LLVM or other natively compiled code linked through Python.

Dask is a component of the larger Python ecosystem. It couples with and enhances other libraries like NumPy, Pandas, and Scikit-Learn

Dask supports generic distributed graph evaluation: it isn’t limited by what can be done efficiently using Spark’s Map-Shuffle-Reduce paradigm.

Dask implements more sophisticated algorithms and builds more complex bespoke systems.

Source: Dask Comparison To Spark

