Medium Articles Analysis

by lksfr

medium/Medium.ipynb

Analyzing the Success of Data Science Articles on Medium

This notebook contains the code used to create the visualizations in Lukas Frei's How to Write a Successful Data Science Article on Medium

Importing dependencies

%matplotlib inline
import scipy.stats as stats
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
import seaborn as sns
import re
plt.style.use('seaborn')

Creating a single DataFrame

The data for this article was retrieved through a web scraping script written with Selenium. Unfortunately, the script ran into errors several times, resulting in several separate CSV's that need to be put together before begging to analyze the data.

df = pd.read_csv('medium.csv', header=None)
df1 = pd.read_csv('medium2.csv', header=None)
df2 = pd.read_csv('medium3.csv', header=None)
df3 = pd.read_csv('medium4.csv', header=None)
df4 = pd.read_csv('medium5.csv', header=None)
df5 = pd.read_csv('medium6.csv', header=None)
df6 = pd.read_csv('medium7.csv', header=None)
df7 = pd.read_csv('medium8.csv', header=None)
df8 = pd.read_csv('medium9.csv', header=None)

df = pd.concat([df, df1, df2, df3, df4, df5, df6, df7, df8])

df.columns = ['title', 'text', 'tags', 'date', 'readtime', 'claps']
print('This dataframe has {} columns and {} rows.'.format(df.shape[1], df.shape[0]))
This dataframe has 6 columns and 882 rows.
df.sample(10)
title text tags date readtime claps
109 DeepXmas: AI knows if you are naughty or nice ['Who did this! I say as I look at the eggs, b... ['Artificial Intelligence', 'Christmas', 'Futu... 2018-12-25T14:22:41.334Z 4 min read 14 claps
41 Deciphering Neural Networks ['Artificial Intelligence (AI) is the eye of t... ['Artificial Intelligence', 'Machine Learning'... 2018-12-26T18:01:01.432Z 3 min read 0
198 The easy way to handle large files in pandas ['I have been using a high-performance laptop ... ['Programming', 'Data Science', 'Python Progra... 2018-12-31T10:32:44.397Z 4 min read 7 claps
9 Beneficios de la visualización de datos para l... ['Ya hemos hablado de los datos y la importanc... ['Minority Report', 'Data Visualization', 'Vis... 2018-12-19T23:31:17.819Z 5 min read 12 claps
261 How I became a Data Scientist ['My path to data science was an unusual one. ... ['Data Science', 'Data Scientist', 'Job Interv... 2018-12-30T00:58:02.590Z 2 min read 0
94 10 Christmas Movie Lessons In Analytics ['This is short Christmas article full of less... ['Christmas', 'Movies', 'Analytics', 'Data Sci... 2018-12-25T18:36:31.004Z 4 min read 361 claps
133 Intro to Neural-Networks with Tensorflow ['Intro to Deep Neural-Networks with Tensorflo... [] 2018-12-25T02:05:49.078Z 10 min read 0
0 Exploring Relationship between “Quality of Lif... ['Last year, I did a small piece of analysis o... ['Data Science', 'Machine Learning', 'Sydney',... 2018-12-27T11:28:48.190Z 5 min read 0
293 Grinch is the new black! ['xHamster Reports a 10,000% Surge in “The Gri... ['Christmas', 'Data Science', 'Data Visualizat... 2018-12-29T06:57:00.978Z 1 min read 0
7 A Quick Start of Time Series Forecasting with ... ['2. The Prophet Forecasting Model', '3. Case ... ['Data Science', 'Python3', 'Forecasting', 'Ti... 2019-01-02T22:50:02.794Z 9 min read 0
df.describe()
title text tags date readtime claps
count 839 882 882 882 882 882
unique 718 774 683 768 22 162
top Data Science ['Discretisation is the process of transformin... [] 2018-12-24T00:57:06.589Z 3 min read 0
freq 9 6 56 6 138 352

Data Preparation

df.claps = df.claps.str.replace('claps', '')
df.claps = df.claps.str.replace('clap', '')
all_ = list(df.claps)
for el in all_:
    if bool(re.search('\.', el)) == True:
        print(el)
4.6K 
1.96K 
3.1K 
3.1K 
all_ = list(df.claps)
for el in all_:
    if bool(re.search('\.', el)) == True:
        if len(el) == 4:
            el = el.replace('\.', '')
            el = el.replace('K', '00')
            print(el)
        elif len(el) == 5:
            el = el.replace('\.', '')
            el = el.replace('K', '0')

        
series = pd.Series(all_)
df.reset_index(drop=True, inplace=True)
df.drop('claps', axis=1, inplace=True)
df = pd.concat([df, series.rename('claps')], axis=1)
df
title text tags date readtime claps
0 NaN ['This is incredibly moving, thanks for sharin... ['Data Science', 'Inspiration'] 2019-01-03T01:49:13.884Z 1 min read 1
1 A Lesson on Modern Classification Models ['In machine learning, classification problems... ['Machine Learning', 'Deep Learning', 'Artific... 2019-01-03T01:46:42.705Z 5 min read 5
2 Create reports quickly and increase your busin... ['Sometimes we need to create reports about ou... ['Data', 'Data Science', 'Database', 'Report',... 2019-01-03T01:39:32.072Z 2 min read 0
3 Where To Find Publically Available Genomics Da... ['Gene Expression Omnibus', 'GEO is a public f... ['Deep Learning', 'Genomics', 'Data Science'] 2019-01-03T01:11:44.234Z 3 min read 0
4 3. The Google File System ['The paper we will be studying today is impor... ['Big Data', 'Research', 'Data Science', 'Goog... 2019-01-03T00:23:11.086Z 2 min read 0
5 Doing Data the Right Way ['The notion of ethics in Data Science first o... ['Towards Data Science', 'Data Science', 'Ethi... 2019-01-02T23:11:45.150Z 10 min read 105
6 The final step of a Data Pipeline ['While the Bard may have been right when it c... ['Data Science', 'Data', 'Organizational Cultu... 2019-01-02T23:01:00.845Z 9 min read 0
7 A Quick Start of Time Series Forecasting with ... ['2. The Prophet Forecasting Model', '3. Case ... ['Data Science', 'Python3', 'Forecasting', 'Ti... 2019-01-02T22:50:02.794Z 9 min read 0
8 Review Rating Prediction: A Combined Approach ['The rise in E\u200a—\u200acommerce, has brou... ['Data Science', 'NLP', 'Recommender Systems'] 2019-01-02T22:48:28.648Z 8 min read 6
9 Data Science for my Grandmas ['It’s the beginning of the year and like mill... ['Data Science'] 2019-01-02T22:41:59.881Z 1 min read 0
10 Having confidence in confidence intervals ['When epidemiologists teach precision & accur... ['Data Science', 'Epidemiology', 'Causal Infer... 2019-01-02T22:18:53.288Z 2 min read 0
11 Guarantee Yourself a Data Science Job with thi... ['The data scientist career path is on the ris... ['Data Science', 'Bootcamp', 'Data Scientist',... 2019-01-02T21:55:14.559Z 2 min read 0
12 5 Must-have skills in Python for every Data Sc... ['If you are a data scientist or want to learn... ['Data Science', 'Python', 'Data Scientist', '... 2019-01-02T21:41:20.349Z 2 min read 0
13 Introduction To Neural Networks ['Hi, Everybody. So this post is about one of ... ['Machine Learning', 'Data Science', 'Neural N... 2019-01-02T20:33:29.794Z 5 min read 0
14 Could data costs kill your AI startup? ['This article originally appeared in Ventureb... ['Startup', 'Cost', 'Data', 'Data Science'] 2019-01-02T20:11:40.417Z 7 min read 0
15 Performance Measurement For Classification ['Hi, Everybody. So this is my second post on ... ['Machine Learning', 'Classification', 'Perfor... 2019-01-02T20:00:32.553Z 6 min read 0
16 Kubeflow in 2018: A year in perspective ['The Kubeflow Product Management Team', 'Just... ['Machine Learning', 'Kubernetes', 'Data Scien... 2019-01-02T19:55:38.074Z 4 min read 2
17 Analyze your Gmail Data ['I have been bombrded with high number of pro... ['Email Marketing', 'Data Analysis', 'Data Vis... 2019-01-02T19:51:24.471Z 3 min read 0
18 Web Scraping using Python-Part 2: Writing the ... ['In the previous blog, we learnt about the ba... ['Data Science', 'Web Scraping', 'Python', 'Be... 2019-01-02T19:45:04.588Z 6 min read 0
19 Easily visualize the correlation of your portf... ['As a trader, I’m really stoked about Alpaca.... ['Data Science', 'Open Source', 'Stock Market'... 2019-01-02T19:43:57.772Z 6 min read 62
20 5 storylines that shaped urban innovation in 2018 ['For city hall changemakers, 2018 was a big y... ['Cities', 'Innovation', 'Leadership', 'Data S... 2019-01-02T19:30:38.480Z 7 min read 0
21 Titanic: Machine Learning From Disaster ['We all know about the Titanic Shipwreck, the... ['Data Science', 'Classification', 'Data Explo... 2019-01-02T19:17:00.341Z 6 min read 0
22 Detecting academics’ major from facial images ['A few months ago I read a paper with the tit... ['Machine Learning', 'Deep Learning', 'Python'... 2019-01-02T19:02:50.602Z 11 min read 20
23 New Approaches Apply Deep Learning to Recommen... ['Oliver Gindele is Head of Machine Learning a... ['Machine Learning', 'Deep Learning', 'Recomme... 2019-01-02T19:01:00.734Z 3 min read 1
24 From summer fellow to full-time employee ['Department of Innovation and Technology (DoI... ['Government', 'Data', 'Data Science', 'Analyt... 2019-01-02T18:45:58.926Z 3 min read 0
25 How to plot percentage-filled text using Python ['Several weeks ago, a colleague of mine showe... ['Python', 'Data Visualization', 'Data Science'] 2019-01-02T18:21:36.440Z 2 min read 0
26 From the Deli to Data Science ['When you grow up you tend to think that anyt... ['Mental Health', 'Data Science', 'Machine Lea... 2019-01-02T18:11:48.017Z 5 min read 4
27 Highlight Action Area in Soccer using Tensorflow ['How cool would it be if cameras could be int... ['TensorFlow', 'Deep Learning', 'Artificial In... 2019-01-02T18:09:09.836Z 4 min read 70
28 Data Science Fundamentals — Technical Communic... ['Often times, first‐rate technical work can b... ['Data Science', 'Communication', 'Documentati... 2019-01-02T18:01:00.835Z 8 min read 1
29 Custom TensorFlow Loss Functions for Advanced ... ['In this article, we’ll look at:', 'Links to ... ['Machine Learning', 'TensorFlow', 'Data Scien... 2019-01-02T17:45:33.460Z 6 min read 2
... ... ... ... ... ... ...
852 AutoML | Una biblioteca simple para hacer mode... ['Como científico de datos, siempre intentas o... ['Data Science', 'Machine Learning', 'Data', '... 2018-12-19T17:31:00.724Z 6 min read 0
853 Instalando o Jupyter usando o Anaconda ['O Anaconda Distribution é um gerenciador de ... ['Jupyter Notebook', 'Data Science', 'Anaconda... 2018-12-19T17:16:31.354Z 3 min read 0
854 Andrew Ng’s Machine Learning Course in Python ... ['Continuing from programming assignment 2 (Lo... ['Machine Learning', 'Data Science', 'Andrew N... 2018-12-19T17:10:50.437Z 7 min read 67
855 Reinforcement Learning from Scratch: Designing... ['In this article, I will introduce a new proj... ['Machine Learning', 'Data Science', 'Reinforc... 2018-12-19T17:08:07.797Z 12 min read 39
856 How to Web scrap with RStudio ['Web Scraping (also termed Screen Scraping, W... ['Web Scraping', 'Data Science', 'Data Analysi... 2018-12-19T17:01:53.587Z 3 min read 2
857 Predictive Power (Dictate the Future) ['It’s a new era, whoever possesses the best m... ['Machine Learning', 'Data Science', 'Artifici... 2018-12-19T17:01:00.891Z 5 min read 81
858 Plunging into the Golem (GNT) Decentralized Su... ['The Golem ecosystem is a decentralized super... ['Blockchain', 'Cryptocurrency', 'Data Science... 2018-12-19T16:59:18.195Z 6 min read 0
859 How to calculate a Binary Tree’s height -Part ... ['Data structures and algorithms are the heart... ['Programming', 'Algorithms', 'Ruby', 'Tech', ... 2018-12-19T16:58:45.611Z 10 min read 138
860 A different kind of (deep) learning: part 2 ['In the previous post, we’ve discussed some s... ['Machine Learning', 'Artificial Intelligence'... 2018-12-19T16:51:00.950Z 9 min read 91
861 Understanding Calculus ['Hi My Lovely Readers,', 'Good evening. How a... ['Data Science', 'Calculus', 'Differntiation',... 2018-12-19T16:49:48.247Z 2 min read 0
862 O que é ciência de dados? Conceitos pra quem e... ['Caso você faça parte do universo de TI, ou s... ['Data Science', 'Machine Learning', 'Ciencia ... 2018-12-19T16:42:56.903Z 5 min read 17
863 Ethics of AI: A data scientist’s perspective ['By Stavros Tsalides\u200a—\u200aSenior Data ... ['Data Science'] 2018-12-19T16:42:50.269Z 5 min read 9
864 #3 Walking down the right lane | A Perfect Dat... ['I have been wandering all over the internet ... ['Data Science', 'Dataquest'] 2018-12-19T16:37:19.215Z 4 min read 0
865 Top 10 2018 ['1. GDPR and its influence on digital marketi... ['Big Data', 'Marketing', 'Digital', 'Data Sci... 2018-12-19T16:33:46.033Z 3 min read 1
866 From Content-Based Recommendations to Personal... ['A primary goal of the data science team at U... ['Data Science', 'Machine Learning', 'Recommen... 2018-12-19T16:22:14.688Z 7 min read 141
867 Advantages of using AI tech and industries tha... ['Necessity has most always been the mother of... ['Artificial Intelligence', 'Machine Learning'... 2018-12-19T16:03:40.392Z 4 min read 0
868 RDKit: Utilizing a Custom Library ['Using a custom library to create Molfiles fo... ['Data Science', 'Pharmaceutical', 'Cheminform... 2018-12-19T16:02:15.035Z 6 min read 1
869 A Deep Dive into A/B Testing ['This article is the first in a series which ... ['Data Science', 'Testing', 'Statistics'] 2018-12-19T15:52:18.007Z 6 min read 5
870 Logistic Regression For Facial Recognition ['Facial recognition algorithms have always fa... ['Machine Learning', 'Data Science', 'Ethics',... 2018-12-19T15:51:28.303Z 8 min read 176
871 Classification with KNN Method to Determine Jo... ['The data is below, Check it out…', 'Berikut ... ['Classification', 'Data Science', 'Big Data'] 2018-12-19T15:51:08.661Z 3 min read 0
872 Procesando Datos con Spark (I) — Configurando ... ['Hola a todos y gracias por tomarse un tiempo... ['Apache Spark', 'Apache Zeppelin', 'Español',... 2018-12-19T15:30:58.434Z 6 min read 1
873 Cluster Analysis with Iris Dataset ['In Supervised Learning, we specify the possi... ['Machine Learning', 'Data Science', 'Data Vis... 2018-12-19T15:15:55.596Z 2 min read 0
874 Planet Beehive ['Close your eyes for a second, ignore the rai... ['Data Science', 'Travel', 'Exploring', 'Data ... 2018-12-19T15:06:00.778Z 9 min read 15
875 Machine Learning and Music Classification: A C... ['In my previous blog post, Introduction to Mu... ['Machine Learning', 'Music', 'Data Science', ... 2018-12-19T15:03:46.813Z 7 min read 502
876 Testing for the Data Scientist ['Data collection, modeling, and analysis. The... ['Testing', 'Python', 'Data Science', 'TensorF... 2018-12-19T14:59:50.219Z 5 min read 0
877 A Year in Review: Happy Holidays from Kaiko ['Learn more about our subscription data servi... ['Bitcoin', 'Cryptocurrency', 'Data Science', ... 2018-12-19T14:50:33.925Z 4 min read 61
878 A Year in Review: Happy Holidays from Kaiko ['Learn more about our subscription data servi... ['Bitcoin', 'Cryptocurrency', 'Data Science', ... 2018-12-19T14:50:33.925Z 4 min read 61
879 Get Smarter with Data Science — Tackling Real ... ['The ‘Data Science Strategic Guide\u200a—\u20... ['Data Science', 'Artificial Intelligence', 'T... 2018-12-19T14:41:49.230Z 17 min read 1K
880 Synthetic data generation — a must-have skill ... ['Data is the new oil and truth be told only a... ['Machine Learning', 'Data Science', 'Programm... 2018-12-19T14:41:41.869Z 11 min read 698
881 Synthetic data generation — a must-have skill ... ['Data is the new oil and truth be told only a... ['Machine Learning', 'Data Science', 'Programm... 2018-12-19T14:41:41.869Z 11 min read 698

882 rows × 6 columns

#claps column
df.claps = df.claps.str.replace('claps', '')
df.claps = df.claps.str.replace('clap', '')
df.claps = df.claps.str.replace('.', '')
df.claps = df.claps.str.replace('K', '000')
df.claps = df.claps.str.strip()
df.claps = df.claps.astype(float)
df.claps[df.claps == 196000] = 1960.0
df.claps[df.claps == 46000] = 4600.0
df.claps[df.claps == 31000] = 3100.0
/srv/conda/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
/srv/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
/srv/conda/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
df.claps.describe()
count     882.000000
mean       64.182540
std       249.198342
min         0.000000
25%         0.000000
50%         2.000000
75%        51.000000
max      4600.000000
Name: claps, dtype: float64
#readtime
df.readtime = df.readtime.str.replace('min read', '')
df.readtime = df.readtime.str.strip()
df.readtime = df.readtime.astype(float)
df.readtime.describe()
count    882.000000
mean       5.292517
std        3.324748
min        1.000000
25%        3.000000
50%        5.000000
75%        7.000000
max       27.000000
Name: readtime, dtype: float64
#date 
df.date = pd.to_datetime(df.date, format='%Y-%m-%dT%H:%M:%S.%fZ')
df.date.describe()
count                            882
unique                           768
top       2018-12-24 00:57:06.589000
freq                               6
first     2018-12-19 14:41:41.869000
last      2019-01-04 02:25:16.766000
Name: date, dtype: object
#setting date column as new index
df.index = df.date
#taking a look at data 
df.sample(5)
title text tags date readtime claps
date
2018-12-27 10:43:15.039 Andrew Ng’s Machine Learning Course in Python ... ['This article will look at both programming a... ['Machine Learning', 'Andrew Ng', 'Python', 'D... 2018-12-27 10:43:15.039 8.0 107.0
2019-01-01 14:09:45.373 Supervised Learning — Classification ['“If you’re a lazy and not-too-bright compute... ['Machine Learning', 'Support Ve', 'AI', 'Data... 2019-01-01 14:09:45.373 7.0 1.0
2018-12-27 06:24:40.913 How not to do Fast.ai (or any ML MOOC) ['This post serves as a little guide to the ne... ['Machine Learning', 'Deep Learning', 'Artific... 2018-12-27 06:24:40.913 10.0 606.0
2018-12-23 12:10:56.284 NaN ['Over the last few years, podcasts have grown... ['Artificial Intelligence', 'Big Data', 'Blog'... 2018-12-23 12:10:56.284 2.0 0.0
2018-12-24 10:47:05.928 !!! What‘s that Datathon by Yesbank that is cr... ['The above quote shows exactly the same passi... ['Data Science', 'Machine Learning', 'Hackatho... 2018-12-24 10:47:05.928 7.0 205.0
#dropping all rows with nans
len_before = df.shape[0]
df.dropna(axis=0, how='any', inplace=True)
len_after = df.shape[0]
print('Number of rows before dropping NaNs: {}\nNumber of rows after dropping NaNs: {}'.format(len_before, len_after))
Number of rows before dropping NaNs: 882
Number of rows after dropping NaNs: 839
#dropping all rows with nans
len_before2 = df.shape[0]
df.drop_duplicates(inplace=True)
len_after2 = df.shape[0]
print('Number of rows before dropping duplicates: {}\nNumber of rows after dropping duplicates: {}'.format(len_before2, len_after2))
Number of rows before dropping duplicates: 839
Number of rows after dropping duplicates: 736
#transforming all text to lowercase
df.title = df.title.str.lower()
df.text = df.text.str.lower()
df.tags = df.tags.str.lower()
#taking a look at data 
df.sample(5)
title text tags date readtime claps
date
2018-12-22 09:09:02.860 top analytical and data science blog posts of ... ['from acheron analytics and sdg', 'over the p... ['data science', 'analytics', 'consulting', 'b... 2018-12-22 09:09:02.860 5.0 7.0
2019-01-01 05:01:00.814 my plans for 2019 ['2019 will be my busiest year, or so i fear.'... ['education', 'computer science', 'programming... 2019-01-01 05:01:00.814 4.0 0.0
2019-01-02 05:12:39.743 5 tips for more secure data ['we all handle data in one way or another. wh... ['security', 'data', 'hacking', 'hacks', 'data... 2019-01-02 05:12:39.743 3.0 0.0
2019-01-02 11:59:17.294 a machine learning approach to ibm employee at... ['in an it firm, there are many employee archi... ['data analysis', 'data visualization', 'featu... 2019-01-02 11:59:17.294 6.0 0.0
2018-12-28 00:36:41.280 resistance to becoming a pythonista ['so for more than 1.5 decades, my ‘mother ton... ['python', 'programming', 'csharp', 'machine l... 2018-12-28 00:36:41.280 4.0 123.0
#extracting more information from tags column
df['tags'] = df['tags'].str.replace('[', '')
df['tags'] = df['tags'].str.replace('\'', '')
df['tags'] = df['tags'].str.replace(']', '')
df['tags'] = df['tags'].str.replace('\[]', 'no_tag')
temp = pd.DataFrame(df['tags'].str.split(',').values.tolist())
temp.columns = ['tag1', 'tag2', 'tag3', 'tag4', 'tag5']
df.tags = df.tags.str.split(',')
df['NoOfTags'] = df.tags.apply(len)
df.fillna(value='no_tag', inplace=True)
len(temp)
736
len(df)
736
df.reset_index(inplace=True, drop=True)
df = pd.concat([df, temp], axis=1, ignore_index=True)
df.columns = ['title', 'text', 'tags', 'date', 'readtime', 'claps', 'n_tags', 'tag1', 'tag2', 'tag3', 'tag4', 'tag5']
#adding character length columns for title and text
df['text'] = df['text'].str.replace('[', '')
df['text'] = df['text'].str.replace('\'', '')
df['text'] = df['text'].str.replace(']', '')
df['text_len'] = df.text.apply(len)
df['title_len'] = df.title.apply(len)
#adding clap percentile column
df['clap_precentile'] = df.claps.rank(pct=True)
#looking at final df
df.sample(10)
title text tags date readtime claps n_tags tag1 tag2 tag3 tag4 tag5 text_len title_len clap_precentile
637 logistic regression i hope that you have enjoyed my last blog abou... [machine learning, data science, data, logi... 2018-12-21 19:56:23.061 5.0 10.0 5 machine learning data science data logistic regression regression 5151 19 0.612092
646 logistic regression in the last blog we talked bout linear regress... [machine learning, logistic regression, data... 2018-12-21 17:42:05.605 4.0 26.0 3 machine learning logistic regression data science None None 3260 19 0.703804
136 given data what should we prefer- fit a model ... the above relates to two cultures in analytics... [machine learning, data science, ai] 2019-01-01 06:17:47.050 3.0 4.0 3 machine learning data science ai None None 3181 88 0.535326
550 amo labs ceo沈博士在韩国浦项工科大学发表演讲 amo粉丝们,大家好!, 全球首个基于车辆数据的逆向ico区块链项目amo labs上周参与... [amo, blockchain, it seminar, data science,... 2018-12-24 01:57:43.671 3.0 146.0 5 amo blockchain it seminar data science smart cars 1025 28 0.901495
326 an implication of mongodb in data science field nosql opens a wild world of schema-less possib... [] 2018-12-28 04:30:20.398 5.0 0.0 1 None None None None 635 47 0.200408
639 neural network from scratch using python i have come across many tutorials where the ou... [machine learning, data science, neural netw... 2018-12-21 19:38:36.143 3.0 1.0 3 machine learning data science neural networks None None 915 40 0.441576
636 accelerating cross filtering with cudf rapids is all about enabling data scientists w... [data science, rapids, cudf, visualization,... 2018-12-21 20:01:00.667 3.0 76.0 5 data science rapids cudf visualization data visualization 3477 38 0.831522
308 why great stories fail on medium and what you ... what makes an article go viral on medium? whic... [data science, technology, artificial intell... 2018-12-28 14:16:02.247 8.0 274.0 5 data science technology artificial intelligence writing tips towards data science 8863 62 0.959239
198 make your own model to predict house prices in... machine learning has now become an integral pa... [machine learning, artificial intelligence, ... 2018-12-31 03:22:00.754 5.0 18.0 5 machine learning artificial intelligence data science python deep learning 1941 53 0.669158
714 plunging into the golem (gnt) decentralized su... the golem ecosystem is a decentralized superco... [blockchain, cryptocurrency, data science, ... 2018-12-19 16:59:18.195 6.0 0.0 5 blockchain cryptocurrency data science eos investing 8172 57 0.200408

EDA

#analyzing claps
plt.figure(figsize=(10,8))
sns.distplot(df.claps)
#plt.ylim('density')
plt.title('Density of Claps')
Text(0.5,1,'Density of Claps')
df.claps.describe()
count     736.000000
mean       58.304348
std       239.424673
min         0.000000
25%         0.000000
50%         2.000000
75%        49.000000
max      4600.000000
Name: claps, dtype: float64
#readtime
plt.figure(figsize=(8,6))
sns.distplot(df.readtime)
plt.title('Density of Time It Takes to Read Article')
Text(0.5,1,'Density of Time It Takes to Read Article')
#n_tags
plt.figure(figsize=(8,6))
sns.distplot(df.n_tags)
plt.title('Number of Tags')
plt.xlabel('Tags')
Text(0.5,0,'Tags')
temp = df.n_tags.value_counts()
(temp[1]+temp[2]+temp[3]+temp[4])/temp[5]
0.51440329218107
#title_len
plt.figure(figsize=(8,6))
sns.distplot(df.title_len)
plt.title('Density of Title Length')
plt.xlabel('Number of Characters')
Text(0.5,0,'Number of Characters')
df.title_len.describe()
count    736.000000
mean      46.934783
std       21.693877
min        8.000000
25%       31.000000
50%       44.000000
75%       60.000000
max      170.000000
Name: title_len, dtype: float64
#text_len
sns.distplot(df.text_len)
plt.title('Density of Text Length')
plt.xlabel('Number of Characters')
Text(0.5,0,'Number of Characters')
#tags
temp = pd.concat([df.tag1, df.tag2, df.tag3, df.tag4, df.tag5])
temp = temp.str.strip()
temp2 = temp.value_counts()[:5]
temp2.plot(kind='bar')
plt.title('Top 5 Used Tags')
Text(0.5,1,'Top 5 Used Tags')

Exploring Relationships

#readtime and claps
temp = df[df.clap_precentile < 0.7]
sns.jointplot(x=temp.readtime, y=temp.claps, data=temp)
<seaborn.axisgrid.JointGrid at 0x7fc1bd6f8400>
#text length and claps
temp = df[df.clap_precentile < 0.7]
sns.jointplot(x=temp.text_len, y=temp.claps, data=temp)
<seaborn.axisgrid.JointGrid at 0x7fc1bd60c5f8>
#title length and claps
temp = df[df.clap_precentile < 0.7]
sns.jointplot(x=temp.title_len, y=temp.claps, data=temp)
<seaborn.axisgrid.JointGrid at 0x7fc1bd3bf080>
#correlation
temp = df.select_dtypes(include=['float', 'int'])
def color_corr_red(value):
    
    """
  Colors elements in a dateframe
  green if correlation less than 0.6 and red if
  greater than 0.6. Does not color NaN
  values.
  """
    
    if value < 0.6:
        color = 'green'
        
    elif value > -0.6:
        color = 'red'

    elif value > 0.6:
        color = 'red'
        
    elif value < -0.6:
        color = 'red'
        
    else:
        color = 'black'

    return 'color: %s' % color
#correlation
temp.drop('clap_precentile', axis=1, inplace=True)
corr = temp.corr()
corr.style.applymap(color_corr_red) #!!!
/srv/conda/lib/python3.7/site-packages/pandas/core/frame.py:3940: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)
readtime claps n_tags text_len title_len
readtime 1 0.114466 0.0881561 0.705211 0.221243
claps 0.114466 1 0.096069 0.102251 0.0179233
n_tags 0.0881561 0.096069 1 0.100174 0.0912831
text_len 0.705211 0.102251 0.100174 1 0.0900903
title_len 0.221243 0.0179233 0.0912831 0.0900903 1

80th Percentile And Above Post

ninety = df[df.clap_precentile >= 0.8]
ninety.describe()
readtime claps n_tags text_len title_len clap_precentile
count 150.000000 150.000000 150.000000 150.000000 150.000000 150.000000
mean 6.480000 254.826667 4.726667 6756.333333 47.220000 0.898777
std 3.914729 482.796923 0.684406 5087.029074 19.520036 0.059023
min 1.000000 61.000000 2.000000 440.000000 14.000000 0.800951
25% 4.000000 100.000000 5.000000 3478.500000 31.250000 0.850543
50% 6.000000 143.000000 5.000000 5626.000000 43.000000 0.898777
75% 8.000000 230.750000 5.000000 8372.000000 61.000000 0.949389
max 27.000000 4600.000000 5.000000 37027.000000 110.000000 1.000000
len('How To Write A Viral Data Science Post On Medium')
48
ninety.time = ninety.date.dt.time
/srv/conda/lib/python3.7/site-packages/ipykernel_launcher.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  """Entry point for launching an IPython kernel.
plt.plot(ninety.date.dt.time, ninety.claps)
[<matplotlib.lines.Line2D at 0x7fc1bce165c0>]
ninety['new_col'] = ninety.date.dt.time.astype(str).str[:2].astype(int)
/srv/conda/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
sns.distplot(ninety.new_col)
plt.title('When To Post?')
plt.xlabel('Time')
Text(0.5,0,'Time')
temp = pd.concat([ninety.tag1, ninety.tag2, ninety.tag3, ninety.tag4, ninety.tag5])
temp = temp.str.strip()
temp2 = temp.value_counts()[:5]
temp2 = pd.DataFrame(temp2)
temp2 = temp2.reset_index()
temp2.columns = ['tag', 'count']
sns.barplot(x='count', y='tag', data=temp2)
plt.title('Top 5 Tags')
Text(0.5,1,'Top 5 Tags')
temp2.plot(kind='bar')
plt.title('Top 5 Tags Used >= 80th Percentile')
Text(0.5,1,'Top 5 Tags Used >= 80th Percentile')
ninety
title text tags date readtime claps n_tags tag1 tag2 tag3 tag4 tag5 text_len title_len clap_precentile new_col
4 doing data the right way the notion of ethics in data science first occ... [towards data science, data science, ethics,... 2019-01-02 23:11:45.150 10.0 105.0 5 towards data science data science ethics coding data scientist 12371 24 0.860054 23
18 easily visualize the correlation of your portf... as a trader, i’m really stoked about alpaca., ... [data science, open source, stock market, p... 2019-01-02 19:43:57.772 6.0 62.0 4 data science open source stock market python None 4958 60 0.807065 19
26 highlight action area in soccer using tensorflow how cool would it be if cameras could be intel... [tensorflow, deep learning, artificial intel... 2019-01-02 18:09:09.836 4.0 70.0 5 tensorflow deep learning artificial intelligence data science soccer 3729 48 0.820652 18
44 the tradecraft of deep analytics deep analytics is a broad category of analytic... [data science, tradecraft, data trader, ana... 2019-01-02 15:04:17.802 4.0 349.0 5 data science tradecraft data trader analytics intelligence 4564 32 0.972826 15
52 what is the chinese room argument in artificia... i talked about “what was hidden in the hidden ... [artificial intelligence, data science, mach... 2019-01-02 13:45:23.716 4.0 149.0 5 artificial intelligence data science machine learning chatbots data 4071 61 0.906250 13
55 10 pragmatic expectations for machine learning... every new year brings new expectations and hop... [machine learning, deep learning, data scien... 2019-01-02 13:03:28.448 5.0 161.0 5 machine learning deep learning data science artificial intelligence invector labs 7338 67 0.915761 13
80 iq is largely a pseudoscientific swindle background\xa0: “iq” is a stale test meant to ... [psychology, data science, black swan, raci... 2019-01-02 01:08:04.219 7.0 4600.0 5 psychology data science black swan racism alt right 6462 40 1.000000 1
90 the ai’s impact on the vc investment market statistics is brutal\u200a—\u200a. why is that... [data science, startup, artificial intellige... 2019-01-01 22:22:13.465 5.0 391.0 5 data science startup artificial intelligence ai venture capital 3331 43 0.980978 22
93 data science & machine learning data science is a broad term that blend of var... [machine learning, data science, data mining... 2019-01-01 21:25:48.545 4.0 111.0 5 machine learning data science data mining supervised learning unsupervised learning 2312 31 0.872283 21
99 preventing deaths from heart disease using mac... cancer?, no., injury/accidents?, no. keep tryi... [machine learning, heart disease, data, dat... 2019-01-01 17:49:28.012 6.0 70.0 5 machine learning heart disease data data science healthcare 5592 59 0.820652 17
108 diagnose bias and variance… in my other post we have seen a brief of bias-... [machine learning, towards data science, dat... 2019-01-01 16:03:11.084 6.0 100.0 3 machine learning towards data science data science None None 5138 27 0.850543 16
117 getting started with google colab just let me code, already!, you know it’s out ... [data science, machine learning, ai, coding... 2019-01-01 13:54:35.035 7.0 302.0 5 data science machine learning ai coding tutorial 7002 33 0.966033 13
135 değişkenler, veri tipleri ve aritmetik operatö... herkese merhaba,, bu yazıda sizlere bahsettiği... [python, data science, veri bilimi, algorit... 2019-01-01 06:53:18.707 6.0 110.0 5 python data science veri bilimi algorithms algoritma 7048 50 0.870245 6
140 data science with medium story stats in python medium is a great place to write: no distracti... [data science, towards data science, python,... 2018-12-31 23:52:57.981 12.0 311.0 5 data science towards data science python education data analysis 14306 46 0.968071 23
141 data science for algorithmic trading in this article i plan to give you a glimpse i... [machine learning, data science, stock marke... 2018-12-31 23:29:49.397 16.0 215.0 5 machine learning data science stock market artificial intelligence towards data science 18987 36 0.942935 23
144 10 data science tools i explored in 2018 in 2018, i invested a good amount of time in l... [data science, deep learning, machine learni... 2018-12-31 22:33:53.430 6.0 467.0 4 data science deep learning machine learning python None 6024 40 0.983696 22
151 predicting crash severity for nz road accidents "recently, i finished udacitys machine learnin... [machine learning, new zealand, analytics, ... 2018-12-31 19:47:18.955 15.0 61.0 5 machine learning new zealand analytics data science data analysis 19758 47 0.800951 19
159 data scientists: grow your expertise with thes... the key to moving from a good data scientist t... [artificial intelligence, data scientist, da... 2018-12-31 17:35:07.059 12.0 62.0 5 artificial intelligence data scientist data science career advice machine learning 12590 68 0.807065 17
162 2018’s top 7 libraries and packages for data s... if you follow me, you know that this year i st... [data science, programming, machine learning... 2018-12-31 16:46:00.807 26.0 379.0 5 data science programming machine learning tools heartbeat artificial intelligence 18534 71 0.978261 16
166 madana’s year recap of 2018 the year 2018 is slowly coming to an end. we a... [blockchain, 2018, madana, data science, c... 2018-12-31 15:36:33.909 5.0 159.0 5 blockchain 2018 madana data science cryptocurrency 5635 27 0.913043 15
171 setting up kaggle in google colab i want all the data and i want it now!, you kn... [data science, machine learning, kaggle, tu... 2018-12-31 14:00:15.326 5.0 91.0 5 data science machine learning kaggle tutorial data 4082 33 0.844429 14
174 ai predictions for 2019 artificial intelligence, specifically, machine... [artificial intelligence, machine learning, ... 2018-12-31 12:52:36.168 8.0 69.0 5 artificial intelligence machine learning data science ecommerce economics 10439 23 0.815897 12
184 machine learning from scratch: logistic regres... introduction, after having discussed linear re... [machine learning, data science, logistic re... 2018-12-31 07:45:01.296 5.0 61.0 5 machine learning data science logistic regression python artificial intelligence 5937 50 0.800951 7
185 an nlp view on holiday movies — part ii: text ... continuing on the first part of this blog post... [machine learning, data science, naturallang... 2018-12-31 07:44:05.075 8.0 67.0 5 machine learning data science naturallanguageprocessing nlp keras 4576 78 0.813179 7
190 becoming a data scientist hello, world! here i am, finally fulfilling my... [data science, machine learning, data scient... 2018-12-31 06:07:03.009 4.0 122.0 3 data science machine learning data scientist None None 4717 25 0.880435 6
205 [week #5 — rock or not? ♫] we are defne tunçer & kutay barçin and this is... [machine learning, music, data science] 2018-12-30 20:57:02.662 4.0 250.0 3 machine learning music data science None None 3093 26 0.955163 20
210 running fast.ai course notebooks on kaggle kernel kaggle kernels offer ml optimized docker envir... [data science, machine learning, artificial ... 2018-12-30 19:28:01.298 2.0 84.0 4 data science machine learning artificial intelligence kaggle None 1244 49 0.839674 19
216 deep learning for classical japanese literature this is a paper summary of the paper:deep lear... [machine learning, deep learning, artificial... 2018-12-30 18:04:06.862 5.0 147.0 5 machine learning deep learning artificial intelligence data science dataset 5473 47 0.903533 18
233 how to create snapchat lenses using pix2pix yes, i know subtitle is from my previous artic... [machine learning, snapchat, data science, ... 2018-12-30 04:21:35.972 4.0 230.0 5 machine learning snapchat data science artificial intelligence deep learning 3783 43 0.948370 4
243 understanding compositional pattern producing ... for the last two years, i have been researchin... [data science, deep learning, genetic algori... 2018-12-29 17:46:48.041 15.0 180.0 5 data science deep learning genetic algorithm neural networks towards data science 20653 65 0.927310 17
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
640 chinese food on christmas: an american tradition it may not appear in christmas movies as often... [food, chinese food, christmas, yelp, data... 2018-12-21 19:27:50.300 5.0 86.0 5 food chinese food christmas yelp data science 5401 48 0.841033 19
643 how does the bay area commute? for this project, i wanted to answer a questio... [machine learning, data science, transit, b... 2018-12-21 17:58:28.010 11.0 239.0 5 machine learning data science transit bay area towards data science 10058 30 0.951087 17
644 predictive modeling and its social issues there is no one sided answer regarding the use... [artificial intelligence, predictive analytic... 2018-12-21 17:57:32.335 6.0 107.0 5 artificial intelligence predictive analytics data science data analysis machine learning 8191 41 0.863451 17
645 sentiment analysis on the texts of harry potter i’m greg rafferty, a data scientist in the bay... [machine learning, data science, nlp, harry... 2018-12-21 17:45:40.059 9.0 241.0 5 machine learning data science nlp harry potter python 8416 47 0.952446 17
647 stemming? lemmatization? what? in natural language processing, there may come... [machine learning, data science, nlp, data,... 2018-12-21 17:13:02.004 7.0 78.0 5 machine learning data science nlp data naturallanguageprocessing 6947 30 0.834918 17
649 breaking the curse of small datasets in machin... this is part 1 of breaking the curse of small ... [machine learning, data science, deep learni... 2018-12-21 16:55:36.067 12.0 133.0 4 machine learning data science deep learning python None 15064 64 0.891984 16
652 speaker differentiation using deep learning last week, i presented a conference paper at i... [machine learning, deep learning, data scien... 2018-12-21 16:53:03.502 7.0 117.0 5 machine learning deep learning data science artificial intelligence towards data science 8061 43 0.879076 16
657 zero-copy: ликбез примечание автора: данная статья ставит целью ... [cpu, gpu, hpc, big data, data science] 2018-12-21 15:31:15.509 6.0 100.0 5 cpu gpu hpc big data data science 7422 17 0.850543 15
659 profiles in analytics from ready player one — ... ready player one is a veritable feast of 80’s ... [analytics, data science, videogames, 80s, ... 2018-12-21 15:08:40.306 2.0 348.0 5 analytics data science videogames 80s ready player one 1891 58 0.971467 15
665 dr. alberto pace joins bluenote as technical a... bluenote\u200a—\u200athe energy efficiency pro... [science, blockchain, energy efficiency, da... 2018-12-21 13:31:36.804 2.0 1000.0 5 science blockchain energy efficiency data science energy 1473 52 0.995245 13
667 introducing tapas forecasting the performance of a deep neural n... [machine learning, deep learning, artificial... 2018-12-21 13:15:25.375 4.0 286.0 4 machine learning deep learning artificial intelligence data science None 5607 17 0.960598 13
683 a gentle introduction to pandas i hope you had an exciting intro to numpy beca... [data science, pandas, python, codewars, g... 2018-12-19 22:47:36.676 5.0 100.0 5 data science pandas python codewars ghana 3826 31 0.850543 22
684 building a movement through emerging tech with... in this episode trent lapinski and tor bair di... [podcast, privacy, blockchain, data science... 2018-12-19 22:44:10.296 2.0 99.0 5 podcast privacy blockchain data science hackernoon podcast 440 64 0.846467 22
685 ai for business this is going to be a short, to-the-point arti... [machine learning, data science, technology,... 2018-12-19 22:08:57.641 6.0 130.0 4 machine learning data science technology programming None 3598 15 0.888587 22
690 what is a cause? if you’re interested in causal inference, you’... [data science, causal inference, philosophy,... 2018-12-19 20:54:05.379 9.0 75.0 5 data science causal inference philosophy philosophy of science book review 10304 16 0.829484 20
694 implementation of uni-variate polynomial regre... regression is an example of continuous classif... [machine learning, calculus, data visualizat... 2018-12-19 19:47:46.044 13.0 103.0 5 machine learning calculus data visualization python3 data science 12919 110 0.858016 19
696 progress bars in python just like a watched pot never boils, a watched... [python, jupyter notebook, data science] 2018-12-19 19:26:15.113 2.0 271.0 3 python jupyter notebook data science None None 1463 23 0.957880 19
697 review: deepmask (instance segmentation) this time, deepmask, by facebook ai research (... [machine learning, deep learning, artificial... 2018-12-19 19:25:31.210 7.0 109.0 5 machine learning deep learning artificial intelligence data science convolutional network 5784 40 0.867527 19
698 net upvote prediction and subreddit-based sent... models built from and for the front page of th... [machine learning, data science, reddit, nl... 2018-12-19 19:24:56.304 27.0 135.0 5 machine learning data science reddit nlp neural networks 37027 82 0.894022 19
702 como processar 4 bilhões de eventos em 3 horas... todas (ou quase todas) as quintas-feiras na re... [data, data science, data analysis, big data] 2018-12-19 18:57:12.638 3.0 61.0 4 data data science data analysis big data None 3837 64 0.800951 18
710 andrew ng’s machine learning course in python ... continuing from programming assignment 2 (logi... [machine learning, data science, andrew ng, ... 2018-12-19 17:10:50.437 7.0 67.0 5 machine learning data science andrew ng logistic regression python 6091 98 0.813179 17
713 predictive power (dictate the future) it’s a new era, whoever possesses the best mod... [machine learning, data science, artificial ... 2018-12-19 17:01:00.891 5.0 81.0 5 machine learning data science artificial intelligence ai deep learning 4993 37 0.838315 17
715 how to calculate a binary tree’s height -part ... data structures and algorithms are the heart a... [programming, algorithms, ruby, tech, data... 2018-12-19 16:58:45.611 10.0 138.0 5 programming algorithms ruby tech data science 10133 78 0.895380 16
716 a different kind of (deep) learning: part 2 in the previous post, we’ve discussed some sel... [machine learning, artificial intelligence, ... 2018-12-19 16:51:00.950 9.0 91.0 3 machine learning artificial intelligence data science None None 10129 43 0.844429 16
722 from content-based recommendations to personal... a primary goal of the data science team at ups... [data science, machine learning, recommendat... 2018-12-19 16:22:14.688 7.0 141.0 5 data science machine learning recommendation system personalization travel 6884 65 0.896739 16
726 logistic regression for facial recognition facial recognition algorithms have always fasc... [machine learning, data science, ethics, ar... 2018-12-19 15:51:28.303 8.0 176.0 5 machine learning data science ethics artificial intelligence facial recognition 8462 42 0.925272 15
731 machine learning and music classification: a c... in my previous blog post, introduction to musi... [machine learning, music, data science, tow... 2018-12-19 15:03:46.813 7.0 502.0 4 machine learning music data science towards data science None 8173 77 0.985054 15
733 a year in review: happy holidays from kaiko learn more about our subscription data service... [bitcoin, cryptocurrency, data science, blo... 2018-12-19 14:50:33.925 4.0 61.0 5 bitcoin cryptocurrency data science blockchain startup 4467 43 0.800951 14
734 get smarter with data science — tackling real ... the ‘data science strategic guide\u200a—\u200a... [data science, artificial intelligence, tech... 2018-12-19 14:41:49.230 17.0 1000.0 5 data science artificial intelligence technology towards data science business 22498 67 0.995245 14
735 synthetic data generation — a must-have skill ... data is the new oil and truth be told only a f... [machine learning, data science, programming... 2018-12-19 14:41:41.869 11.0 698.0 5 machine learning data science programming artificial intelligence towards data science 12588 69 0.990489 14

150 rows × 16 columns

#wordcloud
together = " ".join(list(ninety.text))
together = together.replace('\\u200a', '')
from nltk import download 
from nltk.corpus import stopwords
download('stopwords')
cachedStopWords = stopwords.words("english")
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
together = ' '.join([word for word in together.split() if word not in cachedStopWords])
from wordcloud import WordCloud, STOPWORDS
from PIL import Image
import numpy as np
import urllib
import requests
import matplotlib.pyplot as plt
mask = np.array(Image.open(requests.get('https://pageflows.imgix.net/media/logos/medium.jpg?auto=compress&ixlib=python-1.1.2&s=c57a812322117d896d93a63af04b2cbd', stream=True).raw))
STOPWORDS.update(['one', 'need', 'first', 'two', 'see', 'make', 'find', 'help', 'based', 'using', 'great', 'example','new', 'many', 'good', 'well', 'look', 'way','take', 'want','article', 'part', 'us', 'used','thing', 'work', 'important', 'know', 'use', 'let'])
def generate_wordcloud(words, mask):
    word_cloud = WordCloud(width = 8000, height = 4000, background_color='white', stopwords=STOPWORDS, mask=mask).generate(words)
    plt.figure(figsize=(20,12),facecolor = 'white', edgecolor='blue')
    plt.imshow(word_cloud, interpolation="bilinear")
    plt.axis('off')
    plt.tight_layout(pad=0)
    plt.show()
generate_wordcloud(together, mask)