Last time we did some basic data exploration of tweets from Elon Musk. This time we’ll do something slightly more powerful, and we’ll make some pretty(ish) graphs of the data processing we do.



 

After removing stopwords (low-content common words like a, the, and, etc.):

from nltk.corpus import stopwords as nltk_stopwords
stopwords = set(nltk_stopwords.words('english'))
unstopped = [[t for t in tokenized_datum if t not in stopwords] for tokenized_datum in tqdm(tokenized_data)]
print("\n".join([str(x) for x in unstopped[:5]]))

Our data looked like this:

['@MeltingIce', 'Assuming', 'max', 'acceleration', '2', '3', "g's", ',', 'comfortable', 'direction', '.', 'Will', 'feel', 'like', 'mild', 'moder', '?', 'https://t.co/fpjmEgrHfC']
['@bigajm', 'Yup', ':)']
['Part', '2', 'https://t.co/8Fvu57muhM']
['Fly', 'places', 'Earth', '30', 'mins', 'anywhere', '60', '.', 'Cost', 'per', 'seat', '?', 'https://t.co/dGYDdGttYd']
['BFR', 'take', 'anywhere', 'Earth', 'less', '60', 'mins', 'https://t.co/HWt9BZ1FI9']

And our list of most common words looked like this:

.	2116
,	1091
?	348
Tesla	266
!	246
I	220
&	176
Model	164
(	162
)	154
like	113
S	104
...	101
:	100
good	99
rocket	94
Will	92
3	89
car	84
/	80

There are some things we expect there, like Tesla and rocket and other things that obviously relate to Elon’s areas of interest/ventures. But the punctuation is pretty useless as far as I’m concerned, so let’s deal with that.

Python has a list of punctuation in the string module, so let’s just use that instead of reinventing the wheel.

import string
unstopped_and_depunct = [[t for t in unstopped_datum if t not in string.punctuation] for unstopped_datum in tqdm(unstopped)]
print("\n".join([str(x) for x in unstopped_and_depunct[:5]]))
['@MeltingIce', 'Assuming', 'max', 'acceleration', '2', '3', "g's", 'comfortable', 'direction', 'Will', 'feel', 'like', 'mild', 'moder', 'https://t.co/fpjmEgrHfC']
['@bigajm', 'Yup', ':)']
['Part', '2', 'https://t.co/8Fvu57muhM']
['Fly', 'places', 'Earth', '30', 'mins', 'anywhere', '60', 'Cost', 'per', 'seat', 'https://t.co/dGYDdGttYd']
['BFR', 'take', 'anywhere', 'Earth', 'less', '60', 'mins', 'https://t.co/HWt9BZ1FI9']

Okay, our data now just has words (and numbers, which are just as important as we’ll see later). So now our list of most common words looks like this:

Tesla	266
I	220
Model	164
like	113
S	104
...	101
good	99
rocket	94
Will	92
3	89
car	84
Falcon	80
The	80
launch	79
We	78
next	75
:)	70
Just	67
time	67
would	66

That looks way better. I’m more interested in the content words (nouns, verbs) than the stopwords or the punctuation. Maybe later I’ll add punctuation back into the mix because it has some pragmatic value to indicate the “tone” of the post or something like that, but right now I just want to know what he’s posting about.

One other thing I notice is that something like “S” is a very common word by itself, and I think probably that’s because Elon is talking about the Tesla Model S. To check that, I’m going to run this same analysis, but this time I’m going to use n-grams instead of just single words.

NLTK has an ngrams module, so let’s use that. Yes, it’s mostly trivial to get these ourselves, but I always try to use things other people have built whenever possible.

from collections import Counter
from nltk.util import ngrams

ns = [1,2,3]
nGramFreqs = {}
for n in ns:
    nGramFreqs[n] = Counter()

This is just initializing our ngram counters and defining which values of N we are interested in. In my experience, trigrams are usually the best balance of utility and calculation complexity. 4-grams are rarely worth the extra computation they take, and 5-grams are just silly. (please contradict me - I love being wrong)

for tweet in unstopped_and_depunct:
    for n in ns:
        grams = ngrams(tweet, n)
        for gram in grams:
            nGramFreqs[n][gram] += 1

Above is where we actually count our n-gram frequencies.

And finally let’s see what we’ve got.

for n in ns:
    print("Most common %d-grams:" % n)
    for entry, count in nGramFreqs[n].most_common(20):
        print(" ".join(entry) + "\t" + str(count))
Most common 1-grams:
Tesla	266
I	220
Model	164
like	113
S	104
...	101
good	99
rocket	94
Will	92
3	89
car	84
Falcon	80
The	80
launch	79
We	78
next	75
:)	70
Just	67
time	67
would	66

Most common 2-grams:
Model S	83
Model 3	53
Falcon 9	41
Model X	21
next week	20
I love	18
Cape Canaveral	18
I think	18
Tesla Model	18
upper stage	17
Falcon Heavy	16
coming soon	15
Space Station	14
Looks like	14
climate change	14
rocket landing	12
next year	12
much appreciated	10
next month	10
I like	10

Most common 3-grams:
Tesla Model S	16
The Boring Company	8
Falcon 9 rocket	5
Model 3 unveil	5
Just Read Instructions	4
Of Course I	4
Course I Still	4
I Still Love	4
Still Love You	4
Model 3 order	4
Falcon 9 R	4
Need make sure	3
high velocity missions	3
I used one	3
Model X review	3
Vandenberg Air Force	3
Tesla SolarCity merger	3
Will post video	3
Would also like	3
paid 55 hr	3

As I hoped, bigrams and trigrams give us a lot more insight into the specific entities that show up in Elon’s posts. We have Model S and Model 3, as well as The Boring Company and Vandenberg Air Force (one of the rare instances where 4-grams would be more informative - Vandenberg Air Force Base is the location in California where some of the SpaceX launches happen)

Okay okay - let’s look at 4-grams too.

Most common 4-grams:
Of Course I Still	4
Course I Still Love	4
I Still Love You	4
Interplanetary Spaceship Rocket design	2
Tesla Semi truck unveil	2
@beboutside @Hyperloop @SpaceX @paradigm_hype	2
@Hyperloop @SpaceX @paradigm_hype @swissloop	2
Falcon 9 rocket booster	2
Falcon 9 boost stage	2
First draft animation Falcon	2
draft animation Falcon Heavy	2
animation Falcon Heavy three	2
Falcon Heavy three core	2
Heavy three core launch	2
three core launch FH	2
core launch FH twice	2
launch FH twice thrust	2
FH twice thrust next	2
Our love never die	2
cars built since Oct	2

Wait, what happened to Vandenberg? This is kind of weird. Looking through this I realized I forgot the cardinal rule of Twitter Data: always lowercase everything. Elon is better than most about using proper capitalization and punctuation, but usually in topic modeling we don’t care about capital letters. Let’s see what this looks like with all lower case.

unstopped_and_depunct_and_lower = [[t.lower() for t in unstopped_and_depunct_datum] for unstopped_and_depunct_datum in tqdm(unstopped_and_depunct)]
Most common 4-grams:
of course i still	4
course i still love	4
i still love you	4
droneship of course i	3
johnny cash hover slam	3
interplanetary spaceship rocket design	2
tesla semi truck unveil	2
@beboutside @hyperloop @spacex @paradigm_hype	2
@hyperloop @spacex @paradigm_hype @swissloop	2
falcon 9 rocket booster	2
falcon 9 boost stage	2
droneship just read instructions	2
first draft animation falcon	2
draft animation falcon heavy	2
animation falcon heavy three	2
falcon heavy three core	2
heavy three core launch	2
three core launch fh	2
core launch fh twice	2
launch fh twice thrust	2

Fine, whatever. I didn’t want to know about vandenberg anyway.


Okay, now that we’ve got all of our n-gram data, let’s plot it. I’m using bar graphs because I am boring, but also because bar graphs are one of like 4 actually useful graph types in the universe. (the other two being scatter, pie and line)

I’m very sensitive to colors in graphics, something I think a lot of people overlook. There’s a site I particularly like called ColorBrewer created by Cynthia Brewer. She’s done a lot of research on color perception and color blindness, and other things that it’s easy to overlook in making a good visualization. I’m picking one of the palettes in colorbrewer2 that I think looks nice.

import palettable
from palettable.colorbrewer.qualitative import Set3_12 as palette

And now to plot each of our n-gram top lists:

for n in ns:
    data = pd.DataFrame([{"entry": " ".join(entry), "count": count} for entry, count in nGramFreqs[n].most_common(20)])
    data.plot(kind="bar", x="entry", figsize=(15,10), fontsize=20, color=[palette.mpl_colors,])

Note that bar plots are currently slightly broken in Pandas in that the color parameter doesn’t behave properly. To get around that, we wrap the color palette in a seemingly useless outer list. Thanks Github Issue Tracker!

1-grams 2-grams 3-grams 4-grams

NEXT: Topic Modeling