Data science and new analytics are here. To kick off the new year, Saturn Cloud conducted a survey completed by over 200 data scientists across several industries (some from the usual suspects, like tech/software companies, but also others from industries less talked about in the context of data science, such as health care companies and non-profits). The goal was to learn more about how professionals in the industry are thinking about data science and analytics.
Many of the questions posed speak to how organizations are solving questions around analytics instead of what questions they’re solving. As organizations become more capable of running complex analyses and reaching critical insights, the next order question is how they can do it most efficiently and effectively.
At a high level, the survey tackled the following topics related to data science teams:
– Who is producing analyses in the organization (data scientists, DevOps, analysts, etc.)
– How much collaboration exists between data scientists and analysts in their organizations (and is there was an appetite to increase it)
– How proficient are teams in using the tools at their disposal
– What tools are teams using (Python, Jupyter, Airflow, R, TensorFlow etc.)
– How much are teams leveraging cloud computing services (AWS, Azure, Google Cloud Platform, etc.)
– What does workflow automation currently look like? Is there an interest in enhancing automation with a tool like Airflow?
– How is data stored (relational databases, flat files, data warehouse etc.)
– How are analyses deployed (batch files, dashboards, resting APIs, etc.)
We wanted to share some key findings from the survey results.
The People: A Drive to Collaborate
Of the survey participants, 88% said that data scientists and analysts collaborate in their organizations. Of those that collaborate, 1/3 of them responded that “they collaborate, but there are pain points” and only 12% said that “collaboration was seamless.”
When asked if there was a desire for more collaboration between data scientists and analysts, nearly 70% responded yes.
Interestingly enough, when you layer the results from the two questions (do you collaborate and do you want more collaboration), the only group who didn’t answer overwhelmingly yes to more collaboration was the group who identified their collaboration as seamless (46% of seamless collaborators wanted more collaboration). And the group with the highest rate of respondents who want to collaborate more currently “collaborate, but with pain points.”
All of this is to say, groups that have the most difficulty collaborating hope that future work will involve even more collaboration.
The juice must be worth the squeeze. Now, it raises the question of how data scientists and analysts can collaborate more easily. The interest is there, but the gap is in the execution.
The teal circle at the highest point on the viz above (2nd from the left) represents the % of respondents who say they collaborate but there are pain points and they have the desire to collaborate more. 78% of data scientists that collaborate with pain points want to collaborate more with analysts.
The Software: A Data Scientist’s Toolbox
Python is one thing these data scientists seem to agree on. Over 90% of participants use Python. 74% say they use it more than any other programming language. 21% said they use it in addition to other programs.
The most frequently mentioned IDEs across survey participants were Jupyter and PyCharm (66% and 40%, respectively). An IDE is short for Integrated Development Environment. It’s similar to a text editor in that developers can write code there, but differs in that it offers code validation, smart formatting, debugging tools, etc. Different IDEs support different programming languages. For example, PyCharm is an IDE that only supports Python. Further, there has been a trend towards more robust data science environments like Jupyter notebook, for example. Jupyter notebook is an IDE, but it can also be used for education and presentation (oh and it also supports more than 40 programming languages).
In terms of Machine Learning (ML), Scikit-learn was mentioned the most frequently with 40% of respondents including it in their list of development tools. Scikit-learn is an ML library focused on modeling data like regression or clustering. The second most frequently mentioned ML tool was TensorFlow (25%). Stackshare compares the two by outlining what developers tend to cite as the explanation for choosing a particular tool. Developers say that Scikit-learn has sophisticated “scientific computing” whereas TensorFlow has “high performance”.
Honorable mentions in ML (based on the highest % of mentions in the survey):
– R – 17%
– Pandas – 15%
– NumPy – 14%
More than half of the respondents had 3 or more development tools or ML packages listed. In the words of one succinct response: “Jupyter. Too many packages to list.”
The Process: Workflow Automation + Complexity
The relevance of increased automation in data pipelines continues to grow. With constrained resources and fixed budgets, data scientists and analysts need to limit labor-intensive and repetitive processes and our respondents agree. Nearly 80% of survey respondents said they would like to add automation to their current pipeline. There might be the desire to develop Natural Language Processing, Artificial Intelligence, and Machine Learning to stay on the cutting edge of analytics, but the impact of data science in an organization is and will continue to be stunted as long as automations are neglected.
Another important consideration in the conversation around pipeline automation is the complexity of the pipeline itself. On a scale from 1 to 10, 54% of respondents said their current pipeline or workflow from data acquisition to development to deployment has a complexity of 7 or higher. When these responses about perceived data pipeline complexity were added to the desire for adding automation to the pipeline, there wasn’t a clear correlation. In fact, there was consistency across all levels of perceived complexity in that the majority of each group still wanted to see increased automation.
When survey takers were asked if there was a desire for automation, irrespective of how they ranked the complexity of their data pipeline, the majority of people wanted to add automation to it.
For respondents who define their pipeline complexity as 1, half of them would like to add automation to their pipeline. Irrespective of pipeline complexity, half or more of each group would like to add automation.
When asked whether or not there would be an interest in a workflow management product like Airflow, 49% of people responded yes that either there was an interest in it or it was already implemented at their company. Comments from our respondents indicate that the ability to manage increasingly complex workflows and schedule/trigger tasks are what makes this kind of solution appealing.
On the other end of the spectrum, a recurring theme for the lack of interest in a workflow automation tool as robust as Airflow was that their processes didn’t warrant a tool that powerful. (To provide some context into Airflow’s capabilities, Adobe uses Airflow to support its data infrastructure where a requirement is the ability to run thousands of concurrent workflows.)
So there is a desire to streamline tasks and automate data pipelines for a wide array of data science teams, but now it’s more a question of what tool is the right one for the job based on the company’s analytics needs.
Data science is at an inflection point. The insatiable thirst for advanced analytics isn’t going anywhere and data scientists have the skills and tools to tackle some of the most business-critical analytics problems their organizations face today. But now the question is how they’re going to do it.
As data scientists continue to advance analytics in their organizations, they need to tackle arguably more complex challenges than before: 1. they want to collaborate with analysts more, but currently experience friction in this collaboration 2. they need to carefully select the tools that best suit their needs, but the market is flooded with options to choose from 3. they want to add automation to their data pipelines, but need to consider where automation will have the biggest impact and further, how they plan to implement it.
By: Megan Moore
Read our guide to Dask: Enabling High-Speed, Distributed Data Science