Today I’d like to do something slightly more interesting with Elon’s tweets. I know his main endeavors are Tesla and SpaceX, so I thought it would be neat to try and find a way to separate out the tweets about each company.

The first thing I’m going to try is topic modeling via Latent Dirichlet Allocation, which is a generative model that assumes documents are created by stochastically selecting one (or more) topic(s) from all available topics, then stochastically selecting words (or phrases I guess but mostly I’ve seen words) from those topics.

Based on this assumption, which I hope is not actually correct, the algorithm induces these generative probabilities from the documents given to it. The linguist in me finds probabilistic models distasteful, but the computational linguist in me can’t really argue with the results or the relative computational simplicity.

The same data ingestion we’ve already seen

import pandas as pd
df = pd.read_csv("./data_elonmusk.csv", encoding="latin1")
print(df.shape)
df.head()
row ID	Tweet	Time	Retweet from	User
0	Row0	@MeltingIce Assuming max acceleration of 2 to ...	2017-09-29 17:39:19	NaN	elonmusk
1	Row1	RT @SpaceX: BFR is capable of transporting sat...	2017-09-29 10:44:54	SpaceX	elonmusk
2	Row2	@bigajm Yup :)	2017-09-29 10:39:57	NaN	elonmusk
3	Row3	Part 2 https://t.co/8Fvu57muhM	2017-09-29 09:56:12	NaN	elonmusk
4	Row4	Fly to most places on Earth in under 30 mins a...	2017-09-29 09:19:21	NaN	elonmusk

Tokenizing and removing stopwords

import spacy
from tqdm import tqdm_notebook as tqdm

nlp = None
try:
    nlp = spacy.load("en", disabled=["ner", "parser", "tagger"])
except:
    import sys
    !{sys.executable} -m spacy download en
    nlp = spacy.load("en", disabled=["ner", "parser", "tagger"])

def preprocess(text):
    result = []
    for token in nlp(text):
        if token.is_stop:
            continue
        if len(token.text) < 3:
            continue
        result.append(token.lower_)
    return result

preprocessed_documents = [preprocess(tweet) for tweet in tqdm(df.Tweet)]

Configuring Gensim to play nicer with Jupyter

Gensim prints a lot of information to logs, not stdout, which is less than ideal in a Jupyter environment. So first we’ll fix that.

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Building the model piece by piece in Gensim

Gensim first wants to build a dictionary of the words available in the corpus, so it can use indices into this dictionary instead of manipulating word forms directly.

from gensim import corpora
dictionary = corpora.Dictionary(preprocessed_documents)
print(dictionary)
Dictionary(8989 unique tokens: ['@meltingice', 'acceleration', 'assuming', 'comfortable', 'direction']...)

9000 unique tokens is not a lot - the relatively small number of tweets will cause us some problems later on. But we can continue in blissful ignorance.

Converting preprocessed documents into a corpus

The next step is to transform the preprocessed tweets into a corpus of these dictionary indices.

corpus = [dictionary.doc2bow(document) for document in preprocessed_documents]
print(corpus[:3])
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1)], [(12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1)], [(24, 1), (25, 1)]]

Of critical note here is the function doc2bow, which is turning our documents into Bag of Words representations. We are already making Important Decisions about what we think is important - namely that we will be ignoring word order. The tuples in this corpus are the dictionary index together with the count of that word in the document.

TF-IDF Weights

Term Frequency - Inverse Document Frequency is a common metric used to weight the importance of a word. The Term Frequency is the number of times a word appears in the document in question, and the Document Frequency is the number of total documents in which the word appears.

This makes common words much less “important” as they will appear in many documents. Conversely, a word that appears in only a few documents but appears very frequently in the one you’re looking at must be very important to that particular document. If our corpus were large enough, this weighting may obviate the need to do stopword removal up front.

from gensim import models
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
INFO : calculating IDF weights for 3218 documents and 8988 features (27471 matrix non-zeros)

LDA At Last

One of the things I find really annoying about LDA and other topic modeling algorithms is you have to tell it how many topics to discover. In this case it’s sort of fine because I only really want to differentiate between Tesla and SpaceX, but in the Real World I probably don’t know ahead of time what the appropriate number is, which makes it a hyperparameter I need to tune somehow.

lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lda = lda[corpus_tfidf]
lda.print_topics()
[(0,
  '0.003*"tesla" + 0.002*"thanks" + 0.002*"good" + 0.002*"model" + 0.002*"@teslamotors" + 0.002*"..." + 0.002*"test" + 0.002*"rocket" + 0.002*"time" + 0.001*"yes"'),
 (1,
  '0.003*"@spacex" + 0.003*"launch" + 0.002*"model" + 0.002*"n\'t" + 0.002*"tesla" + 0.002*"falcon" + 0.002*"rocket" + 0.002*"dragon" + 0.002*"good" + 0.002*"like"')]

Well there’s our topic model. I don’t think it did a very good job - it’s hard to tell at a glance what each of these topics is about, and there’s some crossover between Tesla and SpaceX.

Gensim did very helpfully make some suggestions, though:

WARNING : too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy

Yeah, we don’t have enough training data to get the model to converge out of the box. Let’s follow Gensim’s advice and increase the number of passes over the data:

lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2, passes=20)
corpus_lda = lda[corpus_tfidf]
lda.print_topics()
[(0,
  '0.004*"yes" + 0.003*"@spacex" + 0.003*"model" + 0.003*"launch" + 0.003*"dragon" + 0.002*"falcon" + 0.002*"@teslamotors" + 0.002*"thanks" + 0.002*"tesla" + 0.002*"landing"'),
 (1,
  '0.003*"tesla" + 0.003*"good" + 0.003*"n\'t" + 0.002*"like" + 0.002*"will" + 0.002*"rocket" + 0.002*"..." + 0.002*"@elonmusk" + 0.001*"right" + 0.001*"car"')]

I’m not sure that’s really better. We at least got “car” in there together with “tesla”, but we also got “rocket.” Mr. Musk very unhelpfully decided to launch a Tesla car into space on a SpaceX rocket, which pretty well muddies the waters for this topic disambiguation task. I think we’ll need more data, or to spend a little bit more time thinking about our model.