I like Elon Musk. Or at least, I like the idea of Elon Musk. I don’t know much about the guy himself, but if I had billions of dollars I would also try to send myself to Mars.

I wanted to do something with Elon’s twitter data, and serendipitously came across a dataset on Kaggle that has his tweets already extracted in a csv format. Success! Scraping Twitter is not what I wanted to spend my time on. This is an NLP blog, after all, not an adventure in the Twitter public API.

So anyway, here is what I want to do in this step of my project:

  1. Download
  2. Read
  3. Tokenize
  4. Remove Stopwords
  5. Count token frequencies
  6. Graph the top N most frequent tokens


 

Downloading the Data

Just go to the Kaggle dataset, log in, and download the data. We could spend a couple hours writing python code to log in and retrieve it for us, but I’m pretty sure they don’t want me to do that and I’m really sure I don’t want me to do that. Step 1 complete!


Reading the Data into Python

I’m using pandas for this because I like to practice pandas, and because reading tabular data into a dataframe is a really good idea if you think you might want to do column-based manipulations later (I do).

import pandas as pd
df = pd.read_csv("./data_elonmusk.csv")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 96: invalid start byte

Um…what? Why can’t the utf-8 codec read a utf-8 encoded csv file? There’s no mention of the encoding in the Kaggle page or in the csv file itself. I will throw the chardet hammer at it.

import chardet
chardet.detect(open("./data_elonmusk.csv", 'rb').read())
{'confidence': 0.73, 'encoding': 'ISO-8859-1', 'language': ''}

Cool. That’s a pretty high confidence.

import pandas as pd

df = pd.read_csv("./data_elonmusk.csv", encoding = "ISO-8859-1")
print(df.shape)
df.head()
(3218, 5)
row ID	Tweet	Time	Retweet from	User
0	Row0	@MeltingIce Assuming max acceleration of 2 to ...	2017-09-29 17:39:19	NaN	elonmusk
1	Row1	RT @SpaceX: BFR is capable of transporting sat...	2017-09-29 10:44:54	SpaceX	elonmusk
2	Row2	@bigajm Yup :)	2017-09-29 10:39:57	NaN	elonmusk
3	Row3	Part 2 https://t.co/8Fvu57muhM	2017-09-29 09:56:12	NaN	elonmusk
4	Row4	Fly to most places on Earth in under 30 mins a...	2017-09-29 09:19:21	NaN	elonmusk

Okay! That took longer than I like, but that should be the official slogan of NLP. Now I just want to remove retweets so that we focus only on the things that Elon himself said.

elon_data = df[pd.isnull(df["Retweet from"])]
elon_data.shape
(2693, 5)

Looks like about 1/5 of our data is retweets. Gross.


Tokenizing the Data

I’m going to use NLTK for this because it’s simple and it works. And it also comes with a Twitter-aware tokenizer which is perfect for this use case, because probably Twitter artifacts like @mentions and #hashtags are important.

I like tqdm as a progress bar on long-running tasks.

from tqdm import tqdm

And now we can initialize the tokenizer and tokenize the text. Note that if I had a larger dataset, I would use a generator instead of list comprehension. But this corpus is tiny and using a list lets tqdm infer the number of iterations for the progress bar.

from nltk.tokenize import TweetTokenizer
tokenizer = TweetTokenizer()
tokenized_data = [tokenizer.tokenize(tweet) for tweet in tqdm(elon_data.Tweet)]
print("\n".join([str(x) for x in tokenized_data[:5]]))
100%|██████████| 2693/2693 [00:00<00:00, 19885.74it/s]
['@MeltingIce', 'Assuming', 'max', 'acceleration', 'of', '2', 'to', '3', "g's", ',', 'but', 'in', 'a', 'comfortable', 'direction', '.', 'Will', 'feel', 'like', 'a', 'mild', 'to', 'moder', '?', 'https://t.co/fpjmEgrHfC']
['@bigajm', 'Yup', ':)']
['Part', '2', 'https://t.co/8Fvu57muhM']
['Fly', 'to', 'most', 'places', 'on', 'Earth', 'in', 'under', '30', 'mins', 'and', 'anywhere', 'in', 'under', '60', '.', 'Cost', 'per', 'seat', 'should', 'be', '?', 'https://t.co/dGYDdGttYd']
['BFR', 'will', 'take', 'you', 'anywhere', 'on', 'Earth', 'in', 'less', 'than', '60', 'mins', 'https://t.co/HWt9BZ1FI9']

Removing Stopwords

NLTK’s stopwords is language-dependent (duh) so we need to download the English model. There’s a command-line option, but we can also just do it inline.

from nltk import download as nltk_download
nltk_download("stopwords") #alternatively use the command line tool
[nltk_data] Downloading package stopwords to /home/chris/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.

And now we can remove the stopwords from our text. I know the naming here is ugly but I don’t know what to do about it.

from nltk.corpus import stopwords as nltk_stopwords
stopwords = set(nltk_stopwords.words('english'))
unstopped = [[t for t in tokenized_datum if t not in stopwords] for tokenized_datum in tqdm(tokenized_data)]
print("\n".join([str(x) for x in unstopped[:5]]))
100%|██████████| 2693/2693 [00:00<00:00, 209851.57it/s]
[nltk_data] Downloading package stopwords to /home/chris/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
['@MeltingIce', 'Assuming', 'max', 'acceleration', '2', '3', "g's", ',', 'comfortable', 'direction', '.', 'Will', 'feel', 'like', 'mild', 'moder', '?', 'https://t.co/fpjmEgrHfC']
['@bigajm', 'Yup', ':)']
['Part', '2', 'https://t.co/8Fvu57muhM']
['Fly', 'places', 'Earth', '30', 'mins', 'anywhere', '60', '.', 'Cost', 'per', 'seat', '?', 'https://t.co/dGYDdGttYd']
['BFR', 'take', 'anywhere', 'Earth', 'less', '60', 'mins', 'https://t.co/HWt9BZ1FI9']

This is good because our unstopped data is different from the tokenized data. Also of note is that stopword removal is faster than tokenization by an order of magnitude.


Counting token frequencies

I want to see what words Elon uses most often, so I need to count their frequencies. Python’s collections module has a Counter that will work perfectly for this. I’m also going to use nested for loops because I have already flexed my list comprehension muscles and I am NOT AFRAID TO BE VERBOSE AND EXPLICIT IN MY TIRELESS PURSUIT OF CODE CLARITY HECK YEAH

from collections import Counter
freqs = Counter()
for tweet in unstopped:
    for token in tweet:
        freqs[token] += 1
for entry in list(freqs.items())[:10]:
    print(entry)
('@MeltingIce', 1)
('Assuming', 1)
('max', 16)
('acceleration', 7)
('2', 45)
('3', 89)
("g's", 1)
(',', 1091)
('comfortable', 2)
('direction', 2)

Cool but I want to see the top N things. It took me longer to figure out how to print the “first” 10 things out of that freqs dictionary so it would be nice if there was some clever way to OH LOOK COUNTERS HAVE A WAY TO DO THIS AUTOMAGICALLY so we’re done

for item, count in freqs.most_common(20):
    print(item + "\t" + str(count))
.	2116
,	1091
?	348
Tesla	266
!	246
I	220
&	176
Model	164
(	162
)	154
like	113
S	104
...	101
:	100
good	99
rocket	94
Will	92
3	89
car	84
/	80

Huh. Maybe I should strip punctuation as well…

NEXT: n-grams and graphs