Kaggle conducted a worldwide survey in October 2019 of 19,718 data professionals. Their survey included a variety of questions about data science, machine learning, education and more. Kaggle released the raw survey data and many of their members have analyzed the data. In this post, I will be exploring their survey data to answer the most frequently asked questions ranging from the most common machine learning framework to the most used IDE.
There were over 246 Multiple-Choice questions which people have answered. Since the number of questions are incredibly large, it would not be reasonable to expect people to have answered all the questions. In other words, most of our data points are expected to be null.
It’s important that we understand the demographics of our respondents so that we can better interpret the results.
1 – Gender
In our respondents’ survey, about 81.9% are male and 16.3% are female. This reflects a very acute problem in the world of Data Science, and of technology in general. The gender ratio is heavily skewed towards men.
2 – Country
Different countries tend to have different preferences and trends when it comes to things such as technology stacks, age group and education levels of their data scientists. The big picture of the nationalities of our respondents should be able to give us an idea of country biases if there are any. The data frame below displays the 20 countries with the highest number of respondents.
The United States and India account for most respondents and together represent more than one-third of the population. It is therefore expected that answers contained in the dataset will be more relevant and biased towards Americans and Indians.
3 – Age
Next, let’s take a look at the distribution of ages of our respondents.
The majority of our respondents are between the 22 and 34 years old, with the peak group in the 25-29 age bracket.
4 – Level of Education
What level of formal education have our respondents received? Let us have a look at the data to get our answer.
A large fraction of our users has a Master’s Degree which alludes to the importance of formal education in the field of Data Science. Bachelor’s Degree holders come in second. A considerable number of people also have a Doctoral Degree.
5 – Job Title
What are the different job titles that our respondents have?
The majority of our respondents have their job titles as Data Scientist, Student or Software Engineer. Data Analyst and Research Scientist also happen to be frequent occupations. That said, the number of job titles represented in our dataset is fairly diverse with people coming in as Business Analyst, Data Engineer, and Product Manager.
6 – Coding Experience
How experienced are our respondents with writing code in Data Science?
Most of our respondents have less than 5 years of experience in writing code. A considerable chunk has less than 2 years of experience writing code. This might possibly suggest that you don’t have to be too much of an expert in coding to break into the field of Data Science.
7 – Machine Learning Experience
How experienced are our respondents using Machine Learning methods?
As you can see, the majority of our respondents have less than 2 years of using Machine Learning methods. This means that the survey results will be most relevant to Machine Learning newcomers.
8 – Company Size
For employed professionals, let’s take a look at the size of their employers:
Either the companies where the majority of respondents are employed very small (less than 50 employees) or very big (more than 10,000 employees).
9 – Data Science Team Size
For those individuals, approximately how many individuals are responsible for data science workloads at their organizations?
Either the data science team in which the majority of respondents work in is quite big (>20) or quite small (<2).
10 – ML Maturity
This survey also asks this question: “Does your current employer incorporate machine learning methods into their business?” In other words, we can gauge the level of machine learning maturity at these organizations.
The top 2 answers are: (1) We are exploring Machine Learning methods (and may one day put a model into production) and (2) We recently started using Machine Learning methods (models in production for less than 2 years). The reality is that enterprises are still figuring out the best way to extract business value from Machine Learning.
Frequently Asked Questions
We now have a good idea of the demographics of our population. Let us now answer some of the most common questions asked by beginners in the field of Data Science.
1 – What do data scientists do at work?
Survey Question: Select any activities that make up an important part of your role at work
The 2 most important activities that data scientists do at work are: (1) Analyze and understand data to influence product or business decisions; and (2) Build prototypes to explore applying machine learning to new areas. This shows that most data scientists are actually doing work in data analysis/business intelligence, with additional responsibility of exploring machine learning. Some of the less frequent answers resemble the work of data engineer (build and/or run data infrastructure), machine learning engineer (build and/or run a machine learning service), and even research scientist (do research that advances the state-of-the-art of machine learning).
2 – Which media sources should I follow?
Survey Question: Who/what are your favorite media sources that report on data science topics?
The 3 media sources that respondents favor the most are (1) Kaggle, (2) Blogs, and (3) YouTube. Note that these results are biased towards the Kaggle community. I personally would put blogs, podcasts, and journal publications on my top 3 list.
3 – Where should I learn data science?
Survey Question: On which platforms have you begun or completed data science courses?
Coursera stands out at the top (most probably because of Andrew Ng’s popular Machine Learning course), followed by Kaggle courses, Udemy, and traditional university classes. In this list, I am quite surprised that Fast.ai was not mentioned that often, considering that Jeremy Howard and Rachel Thomas are true influencers in the field.
4 – Which IDEs should I use to write code?
Survey Question: Which of the following IDEs do you use on a regular basis?
No brainer here. Jupyter Notebook remains the favorite place where data scientists write their code. The reality is that Jupyter is a great place to execute experiments and build model prototypes but it is not so good to write production-level code. I would recommend you to check out PyCharm, Visual Studio, or Spyder for full-fledged IDEs that support good software engineering practices.
5 – Which notebook products should I use?
Survey Question: Which of the following hosted notebook products do you use on a regular basis?
The majority of people do not use any specific notebook products. For those who do, Kaggle Kernels and Google Colab are the most popular, probably because both give users free cloud GPUs. I can vouch for the usage of Google Colab for its Jupyter-like interface/features and sharing capability.
6 – Which languages should I learn?
Survey Question: Which programming languages do you use on a regular basis?
Enough said! Learn Python (although if you want to pass interviews, most likely you will need to write SQL queries).
7 – How should I visualize data?
Survey Question: What data visualization libraries or tools do you use on a regular basis?
Not too surprising here. Matplotlib and Seaborn are the 2 most used data visualization libraries. R users have ggplot2 to brag about.
8 – Which algorithms do I need to master?
Survey Question: Which of the following ML algorithms do you use on a regular basis?
Machine learning in the industry favors simple solutions. That explains the common use of linear/logistic regression and decision trees/random forests. Slightly more advanced algorithms that are starting to be adopted include Gradient Boosting Machines and ConvNets.
9 – How about machine learning tools?
Survey Question: Which categories of ML tools do you use on a regular basis?
Most respondents do not use any dedicated machine learning tools for a specific part of the model workflow. For those who do, most use automated model selection – choosing the appropriate models given a specific dataset.
10 – Which machine learning frameworks should I familiarize with?
Survey Question: Which of the following ML frameworks do you use on a regular basis?
Scikit-learn remains the most popular machine learning framework. In the second and third positions are TensorFlow and Keras. I’m surprised XGBoost is not higher, given its popularity in Kaggle competitions. Personally, I’d recommend learning PyTorch if you want to get closer to the research side of machine learning.
11 – Which cloud computing platforms should I explore?
Survey Question: Which of the following cloud computing platforms do you use on a regular basis?
Most respondents either do not use any cloud computing platforms or use AWS, GCP, and Azure. This confirms the dominant role that big tech (Amazon, Google, Microsoft) play in the cloud computing service.
12 – How about specific cloud computing products?
Survey Question: Which specific cloud computing products do you use on a regular basis?
Ignoring the people who answer None, then AWS EC2 and Google Compute Engine are the most used cloud computing products. AWS Lambda and Azure Virtual Machines follow suit.
13 – Which big data analytics products are popular?
Survey Question: Which specific big data/analytics do you use on a regular basis?
An overwhelming number of respondents answers None, showing the lack of familiarity with big data and analytics products. For those who do, Google BigQuery is the most common answer, followed by Databricks, AWS Redshift, and Google Cloud Dataflow.
14 – How about machine learning products?
Survey Question: Which of the following ML products do you use on a regular basis?
Most people also do not use any specific machine learning products (you can think of these as services/platforms off-the-shell for an enterprise that want to do machine learning without building in-house). For those who do, the more popular options include Google Cloud Machine Learning Engine, Azure Machine Learning Studio, and Amazon SageMaker.
15 – Which automated machine learning tools are gaining traction?
Survey Question: Which automated ML tools do you use on a regular basis?
Despite the hype about AutoML in the last year, most people do not use them on a regular basis at their work. I think this space is still green, with newcomers such as H20, Databricks, and DataRobot providing automated ML solutions; but it will take time to see how the market responds.
16 – Which relational database are the favorites?
Survey Question: Which of the following relational database products do you use on a regular basis?
This is a pertinent question since most companies in the industry use relational databases for data storage. The top 3 answers are MySQL, PostgresSQL, and Microsoft SQL Server. I can’t comment much on this point since I only have experience with PostgresSQL.
If you decided to skip all of that above and go to the end, here are the most important points:
– The most common task at work is analyzing and understanding data to influence product/business decisions.
– Kaggle, blogs, and YouTube are the favorite media sources for the Kaggle community.
– Most popular MOOCs are Coursera, Kaggle Courses, and Udemy.
– Learn Python.
– Visualize your data with Matplotlib and Seaborn.
– Use Jupyter to build a prototype, but try a standard IDE like PyCharm or Visual Studio to write production-level code.
– Use Kaggle Kernels or Google Colab for their free GPUs.
– Machine learning in the industry favors simple solutions such as linear models and tree-based models.
– Scikit-Learn is the most used machine learning framework.
– Amazon Web Services is the most used cloud computing platform.
– MySQL is the most popular relational database.
– AutoML has not yet been adopted widely.
– Big data/analytics products and specific machine learning tools are not commonly used.
By: James Le
You may also be interested in Your Practical Guide to Dask: Enabling High-Speed, Distributed Data Science