Topic modeling is an unsupervised approach for learning the structure of a text corpus. You can think of it as a ‘soft clustering’. While generally associated with text, topic modeling is really just a specific (often interpretable) type of dimensionality reduction.
Let’s start with a very simple example to illustrate the basic idea of topic modeling. Suppose our corpus consists of the following six documents. We will build a topic model step-by-step.
txt <- list(
A = "I love the rain, I love the sun, I love the snow. I can find the joy in any type of weather.",
B = "What is the weather going to be tomorrow? Is it going to rain?",
C = "I'm tired of rain, when will we have some sun?",
D = "Who won the game yesterday? I heard Lebron played the game of his life.",
E = "I can't wait to play baseball this afternoon. I hope we win our game.",
F = "My basketball game is at noon tomorrow. I'm going to play point guard for this one."
)
tm
packagetm
is a standard package to use for text mining in R. We use it to create a corpus object, which simplifies the standard text processing pipeling:
Once we have finished our processing, we are ready to output a document/term matrix. There are a ton of options here, for now, our matrix will include raw counts of terms that occur in at least two documents.
For this model, we will use a technique called singular value decomposition (SVD). SVD is a common, extremely well studied, technique for computing with large matrices. When SVD is used for topic modeling, it is often called Latent Semantic Indexing (LSI).
SVD is one of the most general and fundamental forms of matrix factorization. Every matrix has a unique decomposition in the following form:
\(\underset{m \times n}M = \underset{m \times r}U \times \underset{r \times r}\Sigma \times \underset{r \times n}V^T\)
where
We can think of these matrices as follows:
Let’s see how it works.
Two components account for about 85% of the power in the term-document matrix
Reconstruction
Now that we have a sense for how topic modeling works, let’s try to apply it to a more realistic dataset.
articles <- read.csv('times_articles.csv', stringsAsFactors = F)
How many articles are in this dataset?
How many articles are there from each section?
Topic modeling is an usupervised approach to learning. We won’t use the section names in the modeling process, but we will use them to inspect the results and validate our intuition about what the topic model is learning.
Use tm
to load the article contents into a corpus and do any necessary processing.
# Load a corpus object
weight
item to the control list in DocumentTermMatrix
)# Doc-term matrix. Aim for about 1000 columns
# Remove documents that don't contain any of these words. (from the doc-term matrix and the original dataset)
Now, let’s perform the singular-value decomposition. In order to save computation time, we can pass the target rank as an argument to svd
.
It looks like we would need about 250 latent topics to explain 90% of the variance in the data. This makes sense, you would expect New York Times articles to be fairly high dimensional. Luckily, the components are in order, we can take a look at the first few:
Write a function that takes as input the term loading matrix, and the list of unique terms, and returns a data frame with three columns:
topic_words <- function(term_mat, terms){
# Your code goes here
}
# term_mat <- t(svd_fit$v)
# terms <- Terms(dtm)
# topic_words(term_mat, terms) %>%
# pander(justify = 'left')
Write a function that takes as input the document loading matrix, and the list of article titles, and returns a data frame with three columns:
topic_articles <- function(doc_mat, headlines){
# Put code here
}
# topic_articles(svd_fit$u, articles$headline) %>%
# pander(split.table=Inf, justify='left')
Distribution of each topic over the whole dataset
Scatterplots:
Topic Distributions:
Histogram of number of zero loadings per document:
NMF produces uni-directional topics. This means we don’t have to worry about positively/negatively associated words and articles. Write a function that outputs a data frame with three columns:
topic_descriptions <- function(term_mat, doc_mat, terms, headlines){
# Your code goes here
}
# topics_df <- topic_descriptions(H, W, Terms(dtm), articles$headline)
# pander(topics_df, split.table=Inf)
Start by building the following data frame:
Now pick five articles at random and make a visualization to see the loading, across all 9 topics, of each of the articles.
NMF produces uni-directional topics. This means we don’t have to worry about positively/negatively associated words and articles. Write a function that outputs a data frame with three columns:
Start by building the following data frame:
Now pick five articles at random and make a visualization to see the loading, across all 9 topics, of each of the articles.
Now that we have seen a handful of topic models, a common question is which one is best? Often, this is answered just by looking at the topics extracted and seeing if they are satisfying. Sometimes, topic modeling is the first step in a larger data modeling operation. In that case, we grade topic models based on how well they get the job done.
For this exercise, the setup is that we have a large corpus of documents and only a handful of them are labeled. For now, let’s stick with a binary label—we will classify whether an article should be classified in the ‘World’ section of the NY Times.
We will assume that at test time we can
Don’t segment the data yet, just pick a set of indices that correspond to ‘train’ and one that corresponds to ‘test’
Build a topic model—using all the articles—using each of the three methodologies above.
Now construct some data frames you will use for modeling. Each data frame should have one column indicating whether or not that article appeared in the ‘World’ section of the NY Times. The remaining columns should correspond to topics, and they should be named according to the top 3 terms in that topic.
Build a logistic regression model for each of your modeling data frames. Be sure to use the Train/Test split that you defined above.
For each model, you may want to try the following process:
Use the following function to help you evaluate your models. Which model do you like the best? There is no ‘right’ answer to this question, different models may be appealing for different reasons.
roc <- function(predicted, actual, key='None'){
# Prediction object
pred = prediction(predicted, actual)
# ROC Curve
perf <- performance(pred, measure = 'tpr', x.measure = 'fpr')
roc <- data.frame(perf@alpha.values,
perf@x.values,
perf@y.values)
colnames(roc) <- c('Threshold', 'FPR', 'TPR')
roc$key <- key
# Area under the curve
perf <- performance(pred, measure = 'auc')
auc <- perf@y.values
list(roc=roc, auc=auc)
}
In the example above, we made an unrealistic assumption that you can build the topic model using both train and test sets. In reality, one probably needs to build the topic model, set it aside, and be prepared to apply it to new data.
A proper train/test methodology applied to topic models:
Use this methodology to evaluate the performance of each of the three topic models.