Sentiment analysis of Thameslink Tweets using Python 3

python Jun 15, 2018

This post is based on the awesome workshop and tutorial from Rodol Foferro, which you can find here.

I have wanted to undertake Twitter Natural Language Processing (NLP) for a while, and with the recent Thameslink debacle (see here and here) it is a great opportunity to explore the Twitter API and NLP. This post is designed to be a tutorial on how to extract data from Twitter and perform basic sentiment of Tweets. You can see the Gist here.

To get started, you need to ensure you have Python 3 installed, along with the following packages:

  • Numpy: This is a library for scientific computing with Python;
  • Pandas: This is a library used for high-performance, easy-to-use data structures and data analysis tools;
  • Tweepy: This is a library for accessing the Twitter API;
  • Matplotlib: This is a library which produces publication quality figures;
  • Seaborn: This is a library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics (3D);
  • Textblob: This is a library for processing textual data.

With that done, let's get on with the fun!

Extracting Twitter data using Tweepy

The first part is to import the required libraries.

The first stage is to import the required libraries, these can be installed via pip.

# Import core libraries
import tweepy           # Obtain Tweets
import pandas as pd     # Store and manage Tweets
import numpy as np      # Number processing

# Setup plotting and visualisation
from IPython.display import display
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

print('Import libraries')

To extract Twitter Tweets, you need a Twitter account, and a Twitter Developer account. You will need to extract the following details from the Twitter Developer pages for the specific app:

  • Consumer Key (API Key)
  • Consumer Secret (API Secret)
  • Access Token
  • Access Token Secret

And save in a file called credentials.py, example as follows:

# Twitter credentials.py

# cconsumer (your) keys
CONSUMER_KEY    = ''
CONSUMER_SECRET = ''

# Access tokens
ACCESS_TOKEN  = ''
ACCESS_SECRET = ''

Why create this extra file I hear you ask? It is important to keep 'hidden' the confidential access token and secret key. Never include this information in the core of your code. Best practise to abstract away these credentials or store them as environment variables.

With the access tokens stored in credentials.py, we can then define a function and import the access tokens (e.g. from credentials import) to enable the request to be authenticated via the Twitter API.

# Import Twitter access tokens
from credentials import * 

# API's authentication by defining a function
def twitter_setup():

    # Authentication and access using keys
    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

    # Obtain authenticated API
    api = tweepy.API(auth)
    return api

So far, we have setup the environment by importing the core libraries and defined the authentication API. It is now time to extract some Tweets!

We do this by defining the function get_user_tweets and call the Tweepy Cursor object. It is important to remember that each element in that list is a tweet object from Tweepy.

It is important to be aware that Twitter API limits you to about 3200 Tweets (the most recent)!

# We create an extractor object (holding the api data) by calling in our twitter_setup() function
extractor = twitter_setup()

def get_user_tweets(api, username):
        """Return a list of all tweets from the authenticated API"""
        tweets = []
        for status in tweepy.Cursor(api.user_timeline, screen_name=username).items():
            tweets.append(status)
        return tweets 

alltweets = get_user_tweets(extractor, 'TLRailUK')

print("Number of tweets extracted: {}.\n".format(len(alltweets)))

Let's look at the first 5 Tweets (most recent).

Tip: It is always wise to explore your data!

# We print the most recent 5 tweets for reference
print("5 recent tweets:\n")
for tweet in alltweets[:5]:
    print(tweet.text)
    print()
5 recent tweets:

@turnipshire @SouthernRailUK @GatwickExpress Hi. Was this a delay repay? ^Neil

@kidgorge0us Hi, there are some running now but the ones that start/terminate at Blackfriars will be extended up th… https://t.co/0NVReOpydB

@richardafraser Sorry about this Richard, where are you trying to get to? ^Nat

@dancechik2 @SouthernRailUK Hi Rachael. Apologies for the ongoing disruption. We are working our hardest to try and… https://t.co/s89ngwbJ4b

@RobYoung_82 Hi Rob, it'll go as far as Kings Cross - my apologies for this. It'll be for operation reasons - I exp… https://t.co/YWLFqVRirm

Create a pandas DataFrame

We now have the Tweet data from @TLRailUK stored in a list. While we could perform analysis directly, it is a lot easier to use pandas DataFrame to allow for easy manipulation.

IPython's display function plots an output in a friendly and easily interpretable way, and the head method of a DataFrame allows us to visualize a select number of elements of the DataFrame, in this case, 10.

# We create a pandas DataFrame as follows
# Note: We loop through each element and add it to the DataFrame
data = pd.DataFrame(data=[tweet.text for tweet in alltweets], columns=['Tweets'])

# We display the first 10 elements of the DataFrame:
display(data.head(10))
Tweets
0 @turnipshire @SouthernRailUK @GatwickExpress H...
1 @kidgorge0us Hi, there are some running now bu...
2 @richardafraser Sorry about this Richard, wher...
3 @dancechik2 @SouthernRailUK Hi Rachael. Apolog...
4 @RobYoung_82 Hi Rob, it'll go as far as Kings ...
5 @skindada Hi. This should be included with you...
6 @worldwidewil80 we are very sorry for the dela...
7 @Jasontart @hselftax @tlupdates we apologise f...
8 @Freewheel1ng @hitchincommuter @GNRailUK we ar...
9 @LiamConnors2 Hi Liam, we have an amended serv...

The DataFrame makes it easier to read the data, but also ensures it is ordered and manageable.

Side note: As Rodol Foferro points out, the Tweet structure (in this case alltweets list) has a number of internal methods we can call, to find them, do the following:

print(dir(alltweets[0]))
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_api', '_json', 'author', 'contributors', 'coordinates', 'created_at', 'destroy', 'entities', 'favorite', 'favorite_count', 'favorited', 'geo', 'id', 'id_str', 'in_reply_to_screen_name', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'parse', 'parse_list', 'place', 'retweet', 'retweet_count', 'retweeted', 'retweets', 'source', 'source_url', 'text', 'truncated', 'user']

You can see from the output the metadata contained in a single Tweet. We can extract any of the attributes listed above by doing the following:

# Print a selection of attributes from the first Tweet
print(alltweets[0].id)
print(alltweets[0].created_at)
print(alltweets[0].source)
print(alltweets[0].favorite_count)
print(alltweets[0].retweet_count)
print(alltweets[0].geo)
print(alltweets[0].coordinates)
print(alltweets[0].entities)
1007646739921293319
2018-06-15 15:31:18
CX Social
0
0
None
None
{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'turnipshire', 'name': 'Sarah Turner', 'id': 18056977, 'id_str': '18056977', 'indices': [0, 12]}, {'screen_name': 'SouthernRailUK', 'name': 'Southern', 'id': 171114552, 'id_str': '171114552', 'indices': [13, 28]}, {'screen_name': 'GatwickExpress', 'name': 'Gatwick Express', 'id': 163816182, 'id_str': '163816182', 'indices': [29, 44]}], 'urls': []}

To ensure a manageable DataFrame, you should only import the attributes that are needed for the task at hand. In this case, we are only looking to analyse the sentiment and popularity of Tweets. Therefore, we only need attributes that hold this information.

Data minimization in the workspace is key here...

# Add attributes of interest
data['len']  = np.array([len(tweet.text) for tweet in alltweets])
data['ID']   = np.array([tweet.id for tweet in alltweets])
data['Date'] = np.array([tweet.created_at for tweet in alltweets])
data['Source'] = np.array([tweet.source for tweet in alltweets])
data['Likes']  = np.array([tweet.favorite_count for tweet in alltweets])
data['RTs']    = np.array([tweet.retweet_count for tweet in alltweets])

# Display of first 10 elements from DataFrame
display(data.head(10))
Tweets len ID Date Source Likes RTs
0 @turnipshire @SouthernRailUK @GatwickExpress H... 78 1007646739921293319 2018-06-15 15:31:18 CX Social 0 0
1 @kidgorge0us Hi, there are some running now bu... 140 1007646348487913472 2018-06-15 15:29:45 CX Social 0 0
2 @richardafraser Sorry about this Richard, wher... 78 1007645784878264320 2018-06-15 15:27:30 CX Social 0 0
3 @dancechik2 @SouthernRailUK Hi Rachael. Apolog... 140 1007645725633675264 2018-06-15 15:27:16 CX Social 0 0
4 @RobYoung_82 Hi Rob, it'll go as far as Kings ... 140 1007645479000330240 2018-06-15 15:26:17 CX Social 0 0
5 @skindada Hi. This should be included with you... 92 1007644426020315142 2018-06-15 15:22:06 CX Social 0 0
6 @worldwidewil80 we are very sorry for the dela... 106 1007643727878348801 2018-06-15 15:19:20 CX Social 0 0
7 @Jasontart @hselftax @tlupdates we apologise f... 124 1007643354199404544 2018-06-15 15:17:51 CX Social 0 0
8 @Freewheel1ng @hitchincommuter @GNRailUK we ar... 89 1007637878602682368 2018-06-15 14:56:05 CX Social 0 0
9 @LiamConnors2 Hi Liam, we have an amended serv... 140 1007637363168858112 2018-06-15 14:54:03 CX Social 0 0

So far, we have obtained the Tweets from Twitter, imported them in to a DataFrame and now it is time to perform some analysis!

Visualisation and basic statistics

Averages and popularity

We first want to calculate some basic statistics, such as the mean of the length of characters used in the Tweets, the Tweet with the most likes and retweets. This allows us to understand, at a top level, the structure of the data, and popularity.

You can obtain the mean using numpy package, which works well with pandas DataFrame.

# We extract the mean of lenghts
mean = np.mean(data['len'])

print("The lenght's average in tweets: {}".format(mean))

This gives us an output of: The lenght's average in tweets: 110.5602465331279

To gain further insights, we can use functions built in to pandas. Such as:

# We extract the tweets which were the most favourited and retweeted
fav_max = np.max(data['Likes'])
rt_max  = np.max(data['RTs'])

fav = data[data.Likes == fav_max].index[0]
rt  = data[data.RTs == rt_max].index[0]

# Max favorited
print("The tweet with more likes is: \n{}".format(data['Tweets'][fav]))
print("Number of likes: {}".format(fav_max))
print("{} characters.\n".format(data['len'][fav]))

# Max retweet
print("The tweet with more retweets is: \n{}".format(data['Tweets'][rt]))
print("Number of retweets: {}".format(rt_max))
print("{} characters.\n".format(data['len'][rt]))
The tweet with more likes is: 
@catherinerusse2 Hi Catherine. The 06.18 is expected to run - https://t.co/bmg0Twjqwt

Appreciate you are frustrate… https://t.co/o8HUJTgLBx
Number of likes: 38
140 characters.

The tweet with more retweets is: 
⚠️ An amended timetable is in operation today. Only services we plan to run will be showing in online journey plann… https://t.co/p0RsxTPi6b
Number of retweets: 9
140 characters.

With that done, we can now begin to visualise some of the data.

Time series

Pandas has its own object to handle time series data. Since, by Twitters very nature, we have time series data (variable created_at). We can construct a plot of all Tweets with respect to Tweets lengths, likes and retweets.

# We create time series by using length, likes and retweets 
tlen = pd.Series(data=data['len'].values, index=data['Date'])
tfav = pd.Series(data=data['Likes'].values, index=data['Date'])
tret = pd.Series(data=data['RTs'].values, index=data['Date'])

And if we want to plot the time series, pandas again has its own method in the object.

# Lenghts using time
tlen.plot(figsize=(16,4), color='r');

Time Series of Tweet Lengths

And to plot the likes versus the retweets in the same chart:

# Likes vs retweets plot
tfav.plot(figsize=(16,4), label="Likes", legend=True)
tret.plot(figsize=(16,4), label="Retweets", legend=True);

Time series fav compared to likes

Pie charts of sources

The next part is to explore how Tweets are being made, as not every Tweet is sent from the same source. We do this by using the source variable.

# We obtain all possible sources from the data
sources = []
for source in data['Source']:
    if source not in sources:
        sources.append(source)

# We print the source list
print("Creation of content sources:")
for source in sources:
    print("* {}".format(source))
Creation of content sources:
* CX Social
* Twitter Web Client
* TweetDeck
* Twitter for iPhone

We can see from the output that @TLRailUK Tweets from four platforms. We can use this information to count the frequency of occurrence and plot it as a pie chart.

# We create a numpy vector and map it to the labels
percent = np.zeros(len(sources))

for source in data['Source']:
    for index in range(len(sources)):
        if source == sources[index]:
            percent[index] += 1
            pass

percent /= 100

# Render the pie chart:
pie_chart = pd.Series(percent, index=sources, name='Sources')
pie_chart.plot.pie(fontsize=11, autopct='%.2f', figsize=(6, 6));

Chart of sources

Sentiment analysis

We can use TextBlob library to perform sentiment analysis. We use the re library, which uses regular expressions. This is performed in two stages:

  • Clean the Tweets which means that any symbol distinct to an alphanumeric value will be re-mapped into a new value;
  • Create the classifier to assess the popularity of each.

This is achieved as follows.

from textblob import TextBlob      #Import NLP package
import re                          #Import re for regex

def clean_tweet(tweet):
    '''
    Utility function to clean the text in a Tweet by removing 
    links and special characters using regex re.
    '''
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

def analize_sentiment(tweet):
    '''
    Utility TextBlob to classify the polarity of a Tweet
    using TextBlob.
    '''
    analysis = TextBlob(clean_tweet(tweet))
    if analysis.sentiment.polarity > 0:
        return 1
    elif analysis.sentiment.polarity == 0:
        return 0
    else:
        return -1

The TextBlob comes pre-packaged with a trained classifier, which means you do not need to train, code or label any data (you can if you want to…). The TextBlob library has been designed to work with different machine learning algorithms (I’ll be posting more about this in future), and can be inter-linked with other data science and NLP packages.

When you pass each Tweet to the classifier, it assesses the sentiment as positive, negative or neutral. Once a score has been defined, a new column is created in the DataFrame.

# We create a column populated with the sentiment score
data['SA'] = np.array([ analize_sentiment(tweet) for tweet in data['Tweets'] ])

# We display the DataFrame with updated score
display(data.head(10))
Tweets len ID Date Source Likes RTs SA
0 @turnipshire @SouthernRailUK @GatwickExpress H... 78 1007646739921293319 2018-06-15 15:31:18 CX Social 0 0 0
1 @kidgorge0us Hi, there are some running now bu... 140 1007646348487913472 2018-06-15 15:29:45 CX Social 0 0 0
2 @richardafraser Sorry about this Richard, wher... 78 1007645784878264320 2018-06-15 15:27:30 CX Social 0 0 -1
3 @dancechik2 @SouthernRailUK Hi Rachael. Apolog... 140 1007645725633675264 2018-06-15 15:27:16 CX Social 0 0 0
4 @RobYoung_82 Hi Rob, it'll go as far as Kings ... 140 1007645479000330240 2018-06-15 15:26:17 CX Social 0 0 1
5 @skindada Hi. This should be included with you... 92 1007644426020315142 2018-06-15 15:22:06 CX Social 0 0 0
6 @worldwidewil80 we are very sorry for the dela... 106 1007643727878348801 2018-06-15 15:19:20 CX Social 0 0 -1
7 @Jasontart @hselftax @tlupdates we apologise f... 124 1007643354199404544 2018-06-15 15:17:51 CX Social 0 0 0
8 @Freewheel1ng @hitchincommuter @GNRailUK we ar... 89 1007637878602682368 2018-06-15 14:56:05 CX Social 0 0 0
9 @LiamConnors2 Hi Liam, we have an amended serv... 140 1007637363168858112 2018-06-15 14:54:03 CX Social 0 0 0

As we can see, the last column contains the sentiment score (SA).

Analysing the results

In this post, I will provide a top-level summary of the results (I'll provide more details in a future post). This will be a percentage of the number of positive, negative or neutral Tweets.

# We determine the score for the Tweet
pos_tweets = [ tweet for index, tweet in enumerate(data['Tweets']) if data['SA'][index] > 0]
neu_tweets = [ tweet for index, tweet in enumerate(data['Tweets']) if data['SA'][index] == 0]
neg_tweets = [ tweet for index, tweet in enumerate(data['Tweets']) if data['SA'][index] < 0]

# We print percentages
print("Percentage of positive tweets: {}%".format(len(pos_tweets)*100/len(data['Tweets'])))
print("Percentage of neutral tweets: {}%".format(len(neu_tweets)*100/len(data['Tweets'])))
print("Percentage of negative tweets: {}%".format(len(neg_tweets)*100/len(data['Tweets'])))
Percentage of positive tweets: 21.10939907550077%
Percentage of neutral tweets: 47.057010785824346%
Percentage of negative tweets: 31.833590138674886%

As you would expect, most of the tweets are neutral. However, they do have several negative Tweets. Possibly related to the timetabling issue they are current experiencing. What is important to remember, we are only dealing with about 3200 Tweets, the most recent ones. Twitter, sadly, does not allow for more historical analysis. This analysis was performed pre-rush hour.

This post was intended as a basic introduction, more blog posts are in the pipeline, which describe more complex analysis.

Tags