I’ve been on a kick lately of learning how to collect, process, and analyze data collected from Twitter. I’m an MMA fight fan, so I decided to set up a stream from the Twitter API on Saturday July 11th during UFC 189. The data I’ll be using throughout the rest of this blog post focuses on the end of Rory MacDonald vs. Robbie Lawler fight, rounds 4-5, and the complete Chad Mendes vs. Conor McGregor bout. The end result of the analysis is an animation along the timeline of the fights showing tweet counts per minute, the most frequently tweeted words, and major moments during the event.
Overview of the Data
The dataset I am working with consisted of 131,148 tweets all containing the search term “UFC189”. The tweets were tokenized and cleaned to remove hashtags, urls, usernames, and stop words including the search term itself. The NLTK module in Python was used to accomplish this task. Once the preprocessing was completed, the dataset was processed in one minute windows to indentify the 25 most frequent words each minute. In addition I manually created a timeline of important events such as the beginning of a fight, end of the fight, and end of rounds based on the tweets of BloodyElbow.com and MMAJunkie.com.
Visualizing the Data
The end goal of this blog post is to create an animation. However the animation will be based on static pictures which I will create using the ggplot, wordcloud, grid, and gridBase packages. To start things off I’m going to import my data.
Below is a small description of the dataframes which were imported:
- ds: This dataframe contains 3 columns: token, count, and time. These are the top 25 tokens which occur during each minute along with the frequency with which the token occurred in that window.
tweets: The dataframe is the original dataset containing all tweets. This is only used to provide a tweet frequency per minute value.
event.timeline: The timeline of major events previously discussed.
The tweets and event.timeline dataframes are joined to provide major events along with the frequency of tweets at each minute.
The timeline of tweet count per minute annotated with major events is created using ggplot2. The word cloud is created with the wordcloud package. To stack the plots over one another in a single window requires the grid and gridBase packages. Below is the print_stacked_plots function which does the following:
- Creates the base timeline plot which contains the hour and minute on the x-axis and the number of tweets during that minute on the y-axis.
- For each unique minute in the data:
- Create a word cloud.
- Create grid for to place word cloud and timeline in.
- Highlight current time on timeline and annotate if a major event is occurring.
- Save the plot to a file.
The functions for drawing the word cloud and base timeline tweet frequency plot can be found below.
The word cloud was set up to only show the 15 most frequent words and all words were set to the same orientation. The settings were tuned in this manner to ensure consistency between plots. The event timeline is a basic ggplot2 line plot. I also use the fte_theme to help style the plot. Below is the plot at the beginning of the Mendes/McGregor fight.
Animating the Timeline
The end result of running the print_stacked_plot function is 83 individual image files which will then be converted into a single gif using the animate package. In my example I will also be the im.convert function which is a wrapper for ImageMagick. You will need to download and install ImageMagick to use the im.convert function.
The resulting animation is shown below. In the animation you will see Lawler’s and MacDonald’s names as the most prominent in the beginning. Once the walkouts start for the McGregor/Mendes fight “Sinead” comes up as frequent word as Sinead O’Connor performed “Foggy Dew” for the McGregor walk out. Once the McGregor/Mendes fight starts their names are the most prominent. After the fight has finished words such as “featherweight”,”McGregor”, and “Champion” become most prominent.