Topic modeling is an unsupervised approach for learning the structure of a text corpus. You can think of it as a ‘soft clustering’. While generally associated with text, topic modeling is really just a specific (often interpretable) type of dimensionality reduction.

A Contrived Example

Let’s start with a very simple example to illustrate the basic idea of topic modeling. Suppose our corpus consists of the following six documents. We will build a topic model step-by-step.

txt <- list(
  A = "I love the rain, I love the sun, I love the snow. I can find the joy in any type of weather.",
  B = "What is the weather going to be tomorrow? Is it going to rain?",
  C = "I'm tired of rain, when will we have some sun?",
  D = "Who won the game yesterday? I heard Lebron played the game of his life.",
  E = "I can't wait to play baseball this afternoon. I hope we win our game.",
  F = "My basketball game is at noon tomorrow. I'm going to play point guard for this one."
)

1. Load the corpus using the tm package

tm is a standard package to use for text mining in R. We use it to create a corpus object, which simplifies the standard text processing pipeling:

  • remove whitespace
  • remove punctuation
  • convert to lowercase
  • remove stopwords
  • stemming

2. Convert to a term-document matrix

Once we have finished our processing, we are ready to output a document/term matrix. There are a ton of options here, for now, our matrix will include raw counts of terms that occur in at least two documents.

3. Singular Value Decomposition

For this model, we will use a technique called singular value decomposition (SVD). SVD is a common, extremely well studied, technique for computing with large matrices. When SVD is used for topic modeling, it is often called Latent Semantic Indexing (LSI).

SVD is one of the most general and fundamental forms of matrix factorization. Every matrix has a unique decomposition in the following form:

\(\underset{m \times n}M = \underset{m \times r}U \times \underset{r \times r}\Sigma \times \underset{r \times n}V^T\)

where

  • \(U\) is column orthogonal: \(U^T U = I\)
  • \(V\) is column orthogonal: \(V^T V = I\)
  • \(\Sigma\) is a diagonal matrix of positive values, where the diagonal is ordered in decreasing order.

We can think of these matrices as follows:

  • \(U\) maps each document to latent topic space
  • \(\Sigma\) gives the relative strengh of each topic in the data
  • \(V^T\) maps from latent topic space to term space.

Let’s see how it works.

a. Inspect the elements
b. Put it back together
c. Power analysis

Two components account for about 85% of the power in the term-document matrix

d. Reduced-rank reconstruction

Reconstruction

4. Observations

  • SVD can be used to extract two ‘topics’ from the corpus
  • remaining topics are difficult to interpret

New York Times

Now that we have a sense for how topic modeling works, let’s try to apply it to a more realistic dataset.

articles <- read.csv('times_articles.csv', stringsAsFactors = F)

Topic modeling is an usupervised approach to learning. We won’t use the section names in the modeling process, but we will use them to inspect the results and validate our intuition about what the topic model is learning.

Load the corpus object

Use tm to load the article contents into a corpus and do any necessary processing.

# Load a corpus object

Output a term-document matrix

  • Only include terms that occur in at least 20 documents, but no more than 100 documents.
  • Instead of using raw term counts, use TFIDF weighting for the columns. (Hint: Add a weight item to the control list in DocumentTermMatrix)
  • It’s possible (however improbable) that some of the documents in the corpus will not contain any of the terms chosen above. These documents should be removed from both the corpus and the term-document matrix.
  • How many documents are you left with? How many terms?
# Doc-term matrix. Aim for about 1000 columns


# Remove documents that don't contain any of these words. (from the doc-term matrix and the original dataset)

SVD

Now, let’s perform the singular-value decomposition. In order to save computation time, we can pass the target rank as an argument to svd.

Exercises:

1) Plot the power as a function of the number of components. How many dimensions are required to explain 90% of the variance in the data?

It looks like we would need about 250 latent topics to explain 90% of the variance in the data. This makes sense, you would expect New York Times articles to be fairly high dimensional. Luckily, the components are in order, we can take a look at the first few:

2) Make some scatterplots of one topic versus another. Color each point by the section of the newspaper it originally belonged in. Try X1 vs X2 and X2 vs X3. What do you notice?
3) List the words associated with each topic (both positive and negative)

Write a function that takes as input the term loading matrix, and the list of unique terms, and returns a data frame with three columns:

  1. The topic index
  2. The five words that are most positively associated with that topic
  3. The five words that most negatively associated with that topic
topic_words <- function(term_mat, terms){
  # Your code goes here

}

# term_mat <- t(svd_fit$v)
# terms <- Terms(dtm)
# topic_words(term_mat, terms) %>% 
#   pander(justify = 'left')
4) List the articles associated with each topic

Write a function that takes as input the document loading matrix, and the list of article titles, and returns a data frame with three columns:

  1. The topic index
  2. The two articles that are most positively associated with that topic
  3. The two articles that most negatively associated with that topic
topic_articles <- function(doc_mat, headlines){
  # Put code here

  
}
# topic_articles(svd_fit$u, articles$headline) %>%
#   pander(split.table=Inf, justify='left')

Topic Distributions

Distribution of each topic over the whole dataset

Individual article topic loadings

Observations:

  1. The first topic just points to the center of the data.
  2. SVD topics are bi-directional, which makes them hard to intepret
  3. The topic loading distribution is smooth.

Now try with NMF

Exercises:

1) Visualizing NMF
  • Make scatterplots of the topics against each other (colored by section name)
  • Plot the distributions of the topic loadings
  • What do you notice is different about these topics from those generated by SVD?

Scatterplots:

Topic Distributions:

Histogram of number of zero loadings per document:

2) Describe the topics

NMF produces uni-directional topics. This means we don’t have to worry about positively/negatively associated words and articles. Write a function that outputs a data frame with three columns:

  1. Index of topic
  2. Top five terms associated with that topic
  3. Top two articles associated with that topic.
  4. Make sure you re-use the two functions you wrote above.
topic_descriptions <- function(term_mat, doc_mat, terms, headlines){
  # Your code goes here

}
# topics_df <- topic_descriptions(H, W, Terms(dtm), articles$headline)
# pander(topics_df, split.table=Inf)
3) Visualize topic loadings of individual articles.

Start by building the following data frame:

  • Each row is a document from the corpus
  • There is one column for the headline of the article, the rest of the columns are the topic loadings
  • Each topic loading is named according to the three most popular terms for that topic.

Now pick five articles at random and make a visualization to see the loading, across all 9 topics, of each of the articles.

And finally, LDA

1) Visualizing LDA
  • Make scatterplots of the topics against each other (colored by section name)
  • Plot the distributions of the topic loadings
  • What do you notice is different about these topics from those generated by SVD?
2) Describe the topics

NMF produces uni-directional topics. This means we don’t have to worry about positively/negatively associated words and articles. Write a function that outputs a data frame with three columns:

  1. Index of topic
  2. Top five terms associated with that topic
  3. Top two articles associated with that topic.
  4. Make sure you re-use the two functions you wrote above.
3) Visualize topic loadings of individual articles.

Start by building the following data frame:

  • Each row is a document from the corpus
  • There is one column for the headline of the article, the rest of the columns are the topic loadings
  • Each topic loading is named according to the three most popular terms for that topic.

Now pick five articles at random and make a visualization to see the loading, across all 9 topics, of each of the articles.

Task-based Topic Modeling

Now that we have seen a handful of topic models, a common question is which one is best? Often, this is answered just by looking at the topics extracted and seeing if they are satisfying. Sometimes, topic modeling is the first step in a larger data modeling operation. In that case, we grade topic models based on how well they get the job done.

Exercise: Evaluate topic models based on an evaluation task

For this exercise, the setup is that we have a large corpus of documents and only a handful of them are labeled. For now, let’s stick with a binary label—we will classify whether an article should be classified in the ‘World’ section of the NY Times.

We will assume that at test time we can

  1. Build a topic model for all the data
  2. Use the learned topics as features for classifying the unlabeled articles.
1) Keep track of train/test indices

Don’t segment the data yet, just pick a set of indices that correspond to ‘train’ and one that corresponds to ‘test’

2) Build Topic Models

Build a topic model—using all the articles—using each of the three methodologies above.

3) Modeling Data Frames

Now construct some data frames you will use for modeling. Each data frame should have one column indicating whether or not that article appeared in the ‘World’ section of the NY Times. The remaining columns should correspond to topics, and they should be named according to the top 3 terms in that topic.

4) Logistic Regression

Build a logistic regression model for each of your modeling data frames. Be sure to use the Train/Test split that you defined above.

For each model, you may want to try the following process:

  1. Build a logistic regression model using all the topics.
  2. Examine your model using summary()
  3. Decide if there is an optimal (in the sense of both modeling accuracy and interpretation) set of 2-4 topics you can use to build the final model.
5) Model Comparison

Use the following function to help you evaluate your models. Which model do you like the best? There is no ‘right’ answer to this question, different models may be appealing for different reasons.

roc <- function(predicted, actual, key='None'){
  # Prediction object
  pred = prediction(predicted, actual)

  # ROC Curve
  perf <- performance(pred, measure = 'tpr', x.measure = 'fpr')
  roc <- data.frame(perf@alpha.values,
                    perf@x.values,
                    perf@y.values)
  colnames(roc) <- c('Threshold', 'FPR', 'TPR')
  roc$key <- key

  # Area under the curve
  perf <- performance(pred, measure = 'auc')
  auc <- perf@y.values

  list(roc=roc, auc=auc)
}

Exercise (HW)

In the example above, we made an unrealistic assumption that you can build the topic model using both train and test sets. In reality, one probably needs to build the topic model, set it aside, and be prepared to apply it to new data.

A proper train/test methodology applied to topic models:

  1. Split data into train/test
  2. Learn a topic model on the training data only.
  3. Learn a logistic regression model from the topics on the training data.
  4. Apply your topic model to the test data.
  5. Apply your logistic regression model to the topics in the test data.
  6. Evaluate the performance of your classifier.

Use this methodology to evaluate the performance of each of the three topic models.