A couple of weeks ago I wanted to collect data some data from Twitter. In the past I’ve used different methods including the TwitteR package in R and IFTTT. However I wanted a more streamlined approach to aggregate data based on a search term. I found a great tutorial on using the Python library Tweepy to set up a Twitter stream listener and import the data into a MySQL database with the MySQLdb library. However I ran into two major problems:
1. The MySQLdb library does not support Python-3.
2. Storing data from Twitter, specifically special characters such as emoji, can cause issues when inserting it into a MySQL database.
The primary goal of this post is to address the above issues. I highly recommend reviewing the PythonProgramming.net tutorials on getting started with MySQL and Tweepy as I am using the code from these tutorials as a template.
MySQL and Python-3
There is a simple solution solution for the compatibility issue between MySQLdb and Python 3. MySQL provides a connector for Python 3 which is similar to the connector used in the MySQLdb library. You can download the connector here. I’ll come back to using the connector later in the post.
The Emoji Problem
When collecting Twitter data you are bound to run into Emoji. MySQL’s UTF8 only allocates 3 bytes per character, but the whole range of of UTF8 characters, including Emoji, requires 4 bytes. MySQL defaults to UTF8 when you create a table, therefore you must specify the character set. The character set we want is utf8mb4. This character set provides 4 bit UTF8 functionality. Below I create a table to allow for storing the time, username, and tweet text information.
Usernames are limited to 15 characters and tweets to 140 characters, but I have had issues with tweet text returned by the listener exceeding these limits. Since memory optimization is not mission critical for this application I allow 255 characters for the username and the text of the tweet just to be safe.
Now we are ready to run our python script. The functions provided in the MySQL connector are nearly identical to the functions in the MySQLdb library. When setting up the connector I make sure to specify the character set as utf8mb4.
Now that the connector is set up we are ready to access twitter. You’ll need to enter your API access information and set up the stream listener. I have it set up to find tweets referencing Philly in this example. In the past I’ve received errors that the field was not present in the data streaming in. To circumvent this issue I added a couple of lines of code to ensure that the text field is present in the data returned by the listener.
Using the MySQL connector in place of the MySQLdb library provides functionality regardless of the version of Python installed. Altering the character set allows us to capture tweets containing emoji and other uncommon characters. Implementing these two changes will allow you to stream tweets directly to your MySQL database.