Find model parameters using EM
The dataset below contains eruption data from old faithful.
Analysts note that old faithful erupts in certain patterns: Sometimes there are long eruptions, sometimes there are short. The eruptions are also followed by a delay that can vary accordingly.
Read in the data to get started.
data(faithful)
# investigate the old faithful eruption patterns using scatter plots, etc.
Q1: How many clusters do you believe exist?
Q2: How did you arrive at your conclusion?
Q3 : Use Expectation Maximization clustering
Extract cluster parameters using your chosenk
cluster count and report them below
r
r library(EMCluster) # shortemcluster(faithful, simple.init(faithful, nclass = k))
Q6 : Which clustering technique works the best here?
Topic Modeling
In this problem, you will use a topic model as part of a supervised learning pipeline. We will use the New York Times articles that we looked at in class.
articles <- read.csv('../datasets/nyt_articles.csv', stringsAsFactors = F)
Q7: Define a target. For this problem, let’s try to predict whether or not an article appears in the “Sports” section.
Q8: Split your data into three segments: Train_1, Train_2, Test
Q9: Use the train_1 dataset to build an LDA topic model of the article content.
You get to decide how many topics to find, and what other parameters you would like to play with. You may want to use some of the functions we defined during class for examining topics.
Q10: Apply your topic model to the Train_2 datset. You may have to play around with the documentation to figure out how to do this. Hint: You want to calculate posterior probabilities for a new set of documents…
Q11: Train a logistic regression model on the topics extracted from the Train_2. That is, you are trying to model the probability that a given article is from the sports section, given the loadings on the topics you found in Q10.
Q13: What are your observations?
Q14 : Final Project Time!
Write up a paragraph or two on what your final project will be. Answer these questions:
- What data are you using?
- What techniques are you using?
- Do you plan on doing any data cleaning/preparation? (If so, what?)
- Are you going to perform supervised or unsupervised learning? (And which technique(s)?)