What is Dask?
Dask is a flexible library for parallel computing in python with a highly optimized distributed graph execution framework. The community has implemented the tools you love – Pandas, NumPy, and Scikit-Learn, on top of this scalable surface so that you can scale the tools you love without having to learn anything new.
Written in Python
Dask is written in Python and interoperates well with C/C++/Fortran/LLVM or other natively compiled code linked through Python. Spark is written in Scala with some support for Python and R. But really it’s a gateway to having to deal with a lot of Scala and Java. Some may think this is a good thing. They would be wrong.
Dask is a component of the larger Python ecosystem. It couples with and enhances other libraries like NumPy, Pandas, and Scikit-Learn. Anything you can do from Python is fairly easy to do within Dask. Dask lets you work at scale with the tools you already use.
Spark is more focused on traditional business intelligence operations like SQL and lightweight machine learning. Dask is applied more generally, both to business intelligence applications as well as a number of scientific applications, including machine learning and linear algebra. Since Dask supports generic distributed graph evaluation, it isn’t limited by what can be done efficiently using Spark’s Map-Shuffle-Reduce paradigm.