Twitter Sentiment Analysis in Python

May 19, 2020

Twitter Sentiment Analysis in Python

May 19, 2020

Sentiment analysis is the automated process of labeling text as being negative, positive or neutral in sentiment. Social media platforms like Twitter, Facebook and Instagram naturally lend themselves to sentiment analysis. Specifically, businesses can use the sentiment scores associated with their products to reach a broad audience and connect with customers without intermediaries. In this post, we will use the python Twitter API wrapper, Tweepy, in order to retrieve tweets about the movie Uncut Gems. First, we will use Tweepy to pull tweets related to keywords, topics or categories of our choosing. Upon pulling tweets related to the movie Uncut Gems, we will perform sentiment analysis on these tweets using another python library called textblob.

Let’s get started!

First, you’ll need to apply for a Twitter developer account here:

 

After your developer account has been approved, you need to create a Twitter application:

 

The steps for applying for a Twitter developer account and creating a Twitter application are outlined here.

We will be using the free python library tweepy in order to access the Twitter API. Documentation for tweepy can be found here.

  1. INSTALLATION

First, make sure you have tweepy installed. Open up a command line and type:

pip install tweepy
  1. IMPORT LIBRARIES

Next, open up your favorite editor and import the tweepy and pandas libraries:

import tweepy
import pandas as pd
  1. AUTHENTICATION

Next, we need our consumer key and access token:

 

Notice that the site suggests that you keep your key and token private! Here we define a fake key and token but you should use your real key and token upon creating the Twitter application as shown above:

consumer_key = '5GBi0dCerYpy2jJtkkU3UwqYtgJpRd' 
consumer_secret = 'Q88B4BDDAX0dCerYy2jJtkkU3UpwqY'
access_token = 'X0dCerYpwi0dCerYpwy2jJtkkU3U'
access_token_secret = 'kly2pwi0dCerYpjJtdCerYkkU3Um'

The next step is creating an OAuthHandler instance. We pass our consumer key and access token which we defined above:

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

Next, we pass the OAuthHandler instance into the API method:

api = tweepy.API(auth)
  1. TWITTER API REQUESTS

Next, we initialize lists for fields we are interested in analyzing. For now, we can look at the tweet strings, the users, and the time of the tweet. Next, we write a for loop over a tweepy ‘Cursor’ object. Within the ‘Cursor’ object we pass the ‘api.search’ method, set the query string for what we would like to search for and set ‘count’ = 1000 so that we don’t exceed the Twitter rate limit. Here we will search for tweets about ‘Uncut Gems’. We also use the ‘item()’ method to convert the ‘Cursor’ object into an iterable.

In order to simplify the query, we can remove retweets and only include tweets in English. To get a sense of what this request returns, we can print the values being appended to each list as well:

twitter_users = []
tweet_time = []
tweet_string = []
for tweet in tweepy.Cursor(api.search,q='Uncut Gems', count=1000).items(1000):
            if (not tweet.retweeted) and ('RT @' not in tweet.text):
                if tweet.lang == "en":
                    twitter_users.append(tweet.user.name)
                    tweet_time.append(tweet.created_at)
                    tweet_string.append(tweet.text)
                    print([tweet.user.name,tweet.created_at,tweet.text])

For reusability, we can wrap it all up in a function that takes the keyword as input. We can also store the results in a dataframe and return the value:

def get_related_tweets(key_word):

twitter_users = []
    tweet_time = []
    tweet_string = [] 
    for tweet in tweepy.Cursor(api.search,q=key_word, count=1000).items(1000):
            if (not tweet.retweeted) and ('RT @' not in tweet.text):
                if tweet.lang == "en":
                    twitter_users.append(tweet.user.name)
                    tweet_time.append(tweet.created_at)
                    tweet_string.append(tweet.text)
                    print([tweet.user.name,tweet.created_at,tweet.text])
    df = pd.DataFrame({'name':twitter_users, 'time': tweet_time, 'tweet': tweet_string})
    
    return df

When we can call the function with the keywords, ‘Uncut Gems’:

get_related_tweets('Uncut Gems')

We see usernames, dates, and tweets corresponding to the keyword input “Uncut Gems”.

We can also pass in the keyword “Adam Sandler”:

get_related_tweets('Adam Sandler')

And the keyword “Julia Fox”:

get_related_tweets('Julia Fox')

And “Safdie Brothers”:

get_related_tweets('Safdie Brothers')

In order to get sentiment scores, we need to import a python package called textblob. The documentation for textblob can be found here. In order to install textblob, open a command line and type:

pip install textblob

Next import textblob:

from textblob import TextBlob

We will use the polarity score as our measure for positive or negative sentiment. The polarity score is a float with values from -1 to +1.

For example, if we define a textblob object and pass in the sentence “Uncut Gems is the best!”:

sentiment_score = TextBlob("Uncut Gems is the best!").sentiment.polarity
print("Sentiment Polarity Score:", sentiment_score)
Twitter Sentiment Analysis in Python

We also can try “Adam Sandler is amazing!”:

sentiment_score = TextBlob("Adam Sandler is amazing!").sentiment.polarity
print("Sentiment Polarity Score:", sentiment_score)



A flaw I’ve noticed in using textblob is that it puts a heavier weight on the presence of negative words, despite the presence of positive adjectives, which can inflate false negatives. The presence of the word ‘Uncut’ in the movie title significantly reduces the sentiment value. For example, consider the sentiment scores of “This movie is amazing” vs “Uncut Gems is amazing!”:
sentiment_score = TextBlob(“This movie is amazing”).sentiment.polarity
print("Sentiment Polarity Score:", sentiment_score)



sentiment_score = TextBlob(“Uncut Gems is amazing!”).sentiment.polarity
print("Sentiment Polarity Score:", sentiment_score)



We can see that for “Uncut Gems is amazing!”, while the sentiment is still positive, it is significantly lower than the former sentence “This movie is amazing” when they should be close or equal in value. The way we will get around this issue (as a quick fix) is we will remove the word “Uncut” from the tweet and generate sentiment scores from the result.

Let’s get sentiment polarity scores for tweets about “Uncut Gems” and store them in a data frame (before removing the word “Uncut”):
df = get_related_tweets("Tesla Cybertruck")
df['sentiment'] = df['tweet'].apply(lambda tweet: TextBlob(tweet).sentiment.polarity)
print(df.head())

We can also count the number of positive and negative sentiments. Let’s do this by defining two dataframes, one with only positive sentiment tweets and one with only negative sentiment tweets. We can then use the ‘len()’ method to count the number of positive and negative tweets from each corresponding dataframe :

df_pos = df[df['sentiment'] > 0.0]
df_neg = df[df['sentiment'] < 0.0]
print("Number of Positive Tweets", len(df_pos))
print("Number of Negative Tweets", len(df_neg))

As we can see there are significantly more negative tweets about “Uncut Gems” than positive tweets, but again this may be due to the presence of the word “Uncut” in the movie title which may be giving us false negatives.

Let’s modify the dataframe by removing the word “Uncut” from tweets:

df['tweet'] = df['tweet'].str.replace('Uncut', '')
df['tweet'] = df['tweet'].str.replace('uncut', '')
df['tweet'] = df['tweet'].str.replace('UNCUT', '')
df['sentiment'] = df['tweet'].apply(lambda tweet: TextBlob(tweet).sentiment.polarity)
print(df.head())
df_pos = df[df['sentiment'] > 0.0]
df_neg = df[df['sentiment'] < 0.0]
print("Number of Positive Tweets", len(df_pos))
print("Number of Negative Tweets", len(df_neg))

We can see that there are significantly more positive tweets when we remove the word “Uncut”.

For code reuse we can wrap it all up in a function:

def get_sentiment(key_word):
    df = get_related_tweets(key_word)
    df['tweet'] = df['tweet'].str.replace('Uncut', '')
    df['tweet'] = df['tweet'].str.replace('uncut', '')
    df['tweet'] = df['tweet'].str.replace('UNCUT', '')
    df['sentiment'] = df['tweet'].apply(lambda tweet: TextBlob(tweet).sentiment.polarity)
    df_pos = df[df['sentiment'] > 0.0]
    df_neg = df[df['sentiment'] < 0.0]
    print("Number of Positive Tweets about {}".format(key_word), len(df_pos))
    print("Number of Negative Tweets about {}".format(key_word), len(df_neg))

If we call this function with “Uncut Gems”, we get:

get_sentiment(“Uncut Gems”)

It would be convenient if we can visualize these results programmatically. Let’s import seaborn and matplotlib and modify our get_sentiment function:

import seaborn as sns
import matplotlib.pyplot as plt

def get_sentiment(key_word):
    df = get_related_tweets(key_word)
    df['tweet'] = df['tweet'].str.replace('Uncut', '')
    df['tweet'] = df['tweet'].str.replace('uncut', '')
    df['tweet'] = df['tweet'].str.replace('UNCUT', '')
    df['sentiment'] = df['tweet'].apply(lambda tweet: TextBlob(tweet).sentiment.polarity)
    df_pos = df[df['sentiment'] > 0.0]
    df_neg = df[df['sentiment'] < 0.0]
    print("Number of Positive Tweets about {}".format(key_word), len(df_pos))
    print("Number of Negative Tweets about {}".format(key_word), len(df_neg))
    sns.set()
    labels = ['Postive', 'Negative']
    heights = [len(df_pos), len(df_neg)]
    plt.bar(labels, heights, color = 'navy')
    plt.title(key_word)
    
get_sentiment("Uncut Gems")

We can also call the function with “Adam Sandler”:

get_sentiment( “Adam Sandler”)

And “Julia Fox”:

get_sentiment(“Julia Fox”)

And “Kevin Garnett”:

get_sentiment(“Kevin Garnett”)

And “Lakeith Stanfield”:

get_sentiment(“Lakeith Stanfield”)

As you can see, tweets about Uncut Gems and its starring actors have more positive sentiment than negative sentiment.

To recap, in this post we went over how to pull tweets from twitter using the python Twitter API wrapper (Tweepy). We also reviewed the python sentiment analysis package textblob and how we can use it to generate sentiment scores from tweets. Finally, we showed how we can modify tweets by removing the word “Uncut”, which artificially deflated sentiment scores. It would be interesting to collect a few days of data to see how sentiment changes with time. Maybe I will save that for a future post!

Thank you for reading. The code from this post is available on GitHub.

Good luck and Happy Machine Learning!

 

Stay up to date with Saturn Cloud on LinkedIn and Twitter.

You may also be interested in: Linear Models in Python