A Brief Sidetrack - Need More Data | Mastering NLP through Repeated Failure

Okay, it’s been a long time since I worked on this blog, but I’m ready to revisit it. When I left off, I was trying to do some topic modeling on Elon Musk’s Tweets, but the results were decidedly not stellar. I’d been working from a Kaggle data set, but it wasn’t nearly enough data.

In order to gather more data, I decided to scrape Twitter myself instead of trying to find existing datasets. This would have been a short exercise if I’d just gone with an existing library like tweepy, but I wanted to challenge myself and create my own library for scraping Twitter data. I also don’t need nearly the full functionality of the API, so I wanted a lightweight library that only contained the functionality I need. The TL;DR is that the library is available in GitHub.

The first thing I had to do was create an app on the Twitter Developers page. This gave me the credentials I would need to access the public API and actually retrieve the data.

One of the more interesting things about this API compared to other (private, internal) APIs I’ve used in the past is that it is rate limited. A pretty substantial amount of the effort that went into this library was just getting it to respect the assigned rate limits so that I wouldn’t get suspended for slamming the API. Considering you can only pull a couple hundred Tweets with each request, it would be very easy to let the app run away and exceed these limits, especially when doing a search that would return a large number of Tweets.

An important consideration about the rate limits is that they are defined on a per-endpoint basis. So the search endpoint, for example, has a separate limit from the timeline endpoint. With this in mind, I decided to retrieve the current limits when the library first authenticates, and then cache the numbers in memory for each endpoint. The library routes all requests through a single function which also checks the current rate limit of the endpoint to be used, and pauses for 15 minutes if the limit will be exceeded. It’s not perfect by any stretch of the imagination and could be improved to use a dynamic wait time depending on where we are in the window, but for my purposes it’s fine.

I was originally going to just use the search endpoint to simplify everything, but Twitter does some pretty heavy sampling on that one in the free public API, so just doing a search for Tweets from a single account will not give you back anywhere close to the full history of that user. The timeline endpoint has its own limitation of only returning about 3200 recent Tweets, but it’s at (or near) full fidelity so it’s far preferable to searching if your goal is to get Tweets from a specific account.

With the ability to retrieve a fresh timeline, I can combine it with the existing dataset to roughly double the data at my disposal. I will also be able to continually add to my dataset by rerunning the script periodically. This is still not likely to be enough, so I will need to finish the search endpoint option to be able to retrieve Tweets based on a hashtag or user mention.