Unsupervised Learning

Find model parameters using EM

The dataset below contains eruption data from old faithful.
Analysts note that old faithful erupts in certain patterns: Sometimes there are long eruptions, sometimes there are short. The eruptions are also followed by a delay that can vary accordingly.

Read in the data to get started.

data(faithful)
# investigate the old faithful eruption patterns using scatter plots, etc.  

Q1: How many clusters do you believe exist?

Q2: How did you arrive at your conclusion?

Q3 : Use Expectation Maximization clustering

Extract cluster parameters using your chosenk cluster count and report them below

rr library(EMCluster) # shortemcluster(faithful, simple.init(faithful, nclass = k))

Q4: Use dbscan to perform clustering.

rr library(dbscan) # dbscan(faithful, minPts, eps)

Report the settings you chose for minPts and epsilon, and how you arrived at them (hint: histograms and distances)

Q5: Use kmeans to perform clustering

Use the k you chose before

rr # kmeans(faithful, k) should already be available in your R environment

Q6 : Which clustering technique works the best here?

Topic Modeling

In this problem, you will use a topic model as part of a supervised learning pipeline. We will use the New York Times articles that we looked at in class.

articles <- read.csv('../datasets/nyt_articles.csv', stringsAsFactors = F)

Q7: Define a target. For this problem, let’s try to predict whether or not an article appears in the “Sports” section.

Q8: Split your data into three segments: Train_1, Train_2, Test

Q9: Use the train_1 dataset to build an LDA topic model of the article content.

You get to decide how many topics to find, and what other parameters you would like to play with. You may want to use some of the functions we defined during class for examining topics.

Q10: Apply your topic model to the Train_2 datset. You may have to play around with the documentation to figure out how to do this. Hint: You want to calculate posterior probabilities for a new set of documents…

Q11: Train a logistic regression model on the topics extracted from the Train_2. That is, you are trying to model the probability that a given article is from the sports section, given the loadings on the topics you found in Q10.

Q12: Test the performance of your model on the Test set. You will have to apply the topic model before you can apply the logistic regression model. You can use the following function to help you evaluate the results.

require(ROCR)
roc <- function(predicted, actual, key='None'){
  # Prediction object
  pred = prediction(predicted, actual)

  # ROC Curve
  perf <- performance(pred, measure = 'tpr', x.measure = 'fpr')
  roc <- data.frame(perf@alpha.values,
                    perf@x.values,
                    perf@y.values)
  colnames(roc) <- c('Threshold', 'FPR', 'TPR')
  roc$key <- key

  # Area under the curve
  perf <- performance(pred, measure = 'auc')
  auc <- perf@y.values

  list(roc=roc, auc=auc)
}

Q13: What are your observations?

Q14 : Final Project Time!

Write up a paragraph or two on what your final project will be. Answer these questions:

  1. What data are you using?
  2. What techniques are you using?
  3. Do you plan on doing any data cleaning/preparation? (If so, what?)
  4. Are you going to perform supervised or unsupervised learning? (And which technique(s)?)
LS0tCnRpdGxlOiAiSG9tZXdvcmsgMiIKb3V0cHV0OgogIGh0bWxfZG9jdW1lbnQ6IGRlZmF1bHQKICBodG1sX25vdGVib29rOiBkZWZhdWx0Ci0tLQoKIyBVbnN1cGVydmlzZWQgTGVhcm5pbmcKCiMjIEZpbmQgbW9kZWwgcGFyYW1ldGVycyB1c2luZyBFTSAKVGhlIGRhdGFzZXQgYmVsb3cgY29udGFpbnMgZXJ1cHRpb24gZGF0YSBmcm9tIG9sZCBmYWl0aGZ1bC4gIApBbmFseXN0cyBub3RlIHRoYXQgb2xkIGZhaXRoZnVsIGVydXB0cyBpbiBjZXJ0YWluIHBhdHRlcm5zOiBTb21ldGltZXMgdGhlcmUgYXJlIGxvbmcgZXJ1cHRpb25zLCBzb21ldGltZXMgdGhlcmUgYXJlIHNob3J0LgpUaGUgZXJ1cHRpb25zIGFyZSBhbHNvIGZvbGxvd2VkIGJ5IGEgZGVsYXkgdGhhdCBjYW4gdmFyeSBhY2NvcmRpbmdseS4KClJlYWQgaW4gdGhlIGRhdGEgdG8gZ2V0IHN0YXJ0ZWQuCgoKYGBge3J9CmRhdGEoZmFpdGhmdWwpCiMgaW52ZXN0aWdhdGUgdGhlIG9sZCBmYWl0aGZ1bCBlcnVwdGlvbiBwYXR0ZXJucyB1c2luZyBzY2F0dGVyIHBsb3RzLCBldGMuICAKYGBgCgojIyMgUTE6IEhvdyBtYW55IGNsdXN0ZXJzIGRvIHlvdSBiZWxpZXZlIGV4aXN0PwoKIyMjIFEyOiBIb3cgZGlkIHlvdSBhcnJpdmUgYXQgeW91ciBjb25jbHVzaW9uPwoKIyMjIFEzIDogVXNlIEV4cGVjdGF0aW9uIE1heGltaXphdGlvbiBjbHVzdGVyaW5nIApFeHRyYWN0IGNsdXN0ZXIgcGFyYW1ldGVycyB1c2luZyB5b3VyIGNob3NlbmBga2BgIGNsdXN0ZXIgY291bnQgIGFuZCByZXBvcnQgdGhlbSBiZWxvdwpgYGB7cn0KbGlicmFyeShFTUNsdXN0ZXIpCiMgc2hvcnRlbWNsdXN0ZXIoZmFpdGhmdWwsIHNpbXBsZS5pbml0KGZhaXRoZnVsLCBuY2xhc3MgPSBrKSkKYGBgCgoKIyMjIFE0OiBVc2UgZGJzY2FuIHRvIHBlcmZvcm0gY2x1c3RlcmluZy4gIAoKYGBge3J9CmxpYnJhcnkoZGJzY2FuKQojIGRic2NhbihmYWl0aGZ1bCwgbWluUHRzLCBlcHMpCgpgYGAKUmVwb3J0IHRoZSBzZXR0aW5ncyB5b3UgY2hvc2UgZm9yIG1pblB0cyBhbmQgZXBzaWxvbiwgYW5kIGhvdyB5b3UgYXJyaXZlZCBhdCB0aGVtIChoaW50OiBoaXN0b2dyYW1zIGFuZCBkaXN0YW5jZXMpCgojIyMgUTU6IFVzZSBrbWVhbnMgdG8gcGVyZm9ybSBjbHVzdGVyaW5nClVzZSB0aGUgayB5b3UgY2hvc2UgYmVmb3JlCgpgYGB7cn0KIyBrbWVhbnMoZmFpdGhmdWwsIGspIHNob3VsZCBhbHJlYWR5IGJlIGF2YWlsYWJsZSBpbiB5b3VyIFIgZW52aXJvbm1lbnQKYGBgCgojIyMgUTYgOiBXaGljaCBjbHVzdGVyaW5nIHRlY2huaXF1ZSB3b3JrcyB0aGUgYmVzdCBoZXJlPyAgCgoKIyMgVG9waWMgTW9kZWxpbmcKCkluIHRoaXMgcHJvYmxlbSwgeW91IHdpbGwgdXNlIGEgdG9waWMgbW9kZWwgYXMgcGFydCBvZiBhIHN1cGVydmlzZWQgbGVhcm5pbmcgcGlwZWxpbmUuIFdlIHdpbGwgdXNlIHRoZSBOZXcgWW9yayBUaW1lcyBhcnRpY2xlcyB0aGF0IHdlIGxvb2tlZCBhdCBpbiBjbGFzcy4KCmBgYHtyfQphcnRpY2xlcyA8LSByZWFkLmNzdignLi4vZGF0YXNldHMvbnl0X2FydGljbGVzLmNzdicsIHN0cmluZ3NBc0ZhY3RvcnMgPSBGKQpgYGAKCiMjIyBRNzogRGVmaW5lIGEgdGFyZ2V0LiBGb3IgdGhpcyBwcm9ibGVtLCBsZXQncyB0cnkgdG8gcHJlZGljdCB3aGV0aGVyIG9yIG5vdCBhbiBhcnRpY2xlIGFwcGVhcnMgaW4gdGhlICJTcG9ydHMiIHNlY3Rpb24uCgpgYGB7cn0KCmBgYAoKCiMjIyBRODogU3BsaXQgeW91ciBkYXRhIGludG8gdGhyZWUgc2VnbWVudHM6IFRyYWluXzEsIFRyYWluXzIsIFRlc3QKCmBgYHtyfQoKYGBgCgoKIyMjIFE5OiBVc2UgdGhlIHRyYWluXzEgZGF0YXNldCB0byBidWlsZCBhbiBMREEgdG9waWMgbW9kZWwgb2YgdGhlIGFydGljbGUgY29udGVudC4KCllvdSBnZXQgdG8gZGVjaWRlIGhvdyBtYW55IHRvcGljcyB0byBmaW5kLCBhbmQgd2hhdCBvdGhlciBwYXJhbWV0ZXJzIHlvdSB3b3VsZCBsaWtlIHRvIHBsYXkgd2l0aC4gWW91IG1heSB3YW50IHRvIHVzZSBzb21lIG9mIHRoZSBmdW5jdGlvbnMgd2UgZGVmaW5lZCBkdXJpbmcgY2xhc3MgZm9yIGV4YW1pbmluZyB0b3BpY3MuCgpgYGB7cn0KCmBgYAoKIyMjIFExMDogX0FwcGx5XyB5b3VyIHRvcGljIG1vZGVsIHRvIHRoZSBUcmFpbl8yIGRhdHNldC4gWW91IG1heSBoYXZlIHRvIHBsYXkgYXJvdW5kIHdpdGggdGhlIGRvY3VtZW50YXRpb24gdG8gZmlndXJlIG91dCBob3cgdG8gZG8gdGhpcy4gSGludDogWW91IHdhbnQgdG8gY2FsY3VsYXRlIHBvc3RlcmlvciBwcm9iYWJpbGl0aWVzIGZvciBhIG5ldyBzZXQgb2YgZG9jdW1lbnRzLi4uCgpgYGB7cn0KCmBgYAoKIyMjIFExMTogVHJhaW4gYSBsb2dpc3RpYyByZWdyZXNzaW9uIG1vZGVsIG9uIHRoZSB0b3BpY3MgZXh0cmFjdGVkIGZyb20gdGhlIFRyYWluXzIuIFRoYXQgaXMsIHlvdSBhcmUgdHJ5aW5nIHRvIG1vZGVsIHRoZSBwcm9iYWJpbGl0eSB0aGF0IGEgZ2l2ZW4gYXJ0aWNsZSBpcyBmcm9tIHRoZSBzcG9ydHMgc2VjdGlvbiwgZ2l2ZW4gdGhlIGxvYWRpbmdzIG9uIHRoZSB0b3BpY3MgeW91IGZvdW5kIGluIFExMC4KCmBgYHtyfQoKYGBgCgoKIyMjIFExMjogVGVzdCB0aGUgcGVyZm9ybWFuY2Ugb2YgeW91ciBtb2RlbCBvbiB0aGUgVGVzdCBzZXQuIFlvdSB3aWxsIGhhdmUgdG8gYXBwbHkgdGhlIHRvcGljIG1vZGVsIGJlZm9yZSB5b3UgY2FuIGFwcGx5IHRoZSBsb2dpc3RpYyByZWdyZXNzaW9uIG1vZGVsLiBZb3UgY2FuIHVzZSB0aGUgZm9sbG93aW5nIGZ1bmN0aW9uIHRvIGhlbHAgeW91IGV2YWx1YXRlIHRoZSByZXN1bHRzLgoKCmBgYHtyfQpyZXF1aXJlKFJPQ1IpCnJvYyA8LSBmdW5jdGlvbihwcmVkaWN0ZWQsIGFjdHVhbCwga2V5PSdOb25lJyl7CiAgIyBQcmVkaWN0aW9uIG9iamVjdAogIHByZWQgPSBwcmVkaWN0aW9uKHByZWRpY3RlZCwgYWN0dWFsKQoKICAjIFJPQyBDdXJ2ZQogIHBlcmYgPC0gcGVyZm9ybWFuY2UocHJlZCwgbWVhc3VyZSA9ICd0cHInLCB4Lm1lYXN1cmUgPSAnZnByJykKICByb2MgPC0gZGF0YS5mcmFtZShwZXJmQGFscGhhLnZhbHVlcywKICAgICAgICAgICAgICAgICAgICBwZXJmQHgudmFsdWVzLAogICAgICAgICAgICAgICAgICAgIHBlcmZAeS52YWx1ZXMpCiAgY29sbmFtZXMocm9jKSA8LSBjKCdUaHJlc2hvbGQnLCAnRlBSJywgJ1RQUicpCiAgcm9jJGtleSA8LSBrZXkKCiAgIyBBcmVhIHVuZGVyIHRoZSBjdXJ2ZQogIHBlcmYgPC0gcGVyZm9ybWFuY2UocHJlZCwgbWVhc3VyZSA9ICdhdWMnKQogIGF1YyA8LSBwZXJmQHkudmFsdWVzCgogIGxpc3Qocm9jPXJvYywgYXVjPWF1YykKfQpgYGAKCgojIyMgUTEzOiBXaGF0IGFyZSB5b3VyIG9ic2VydmF0aW9ucz8KCiMjIyBRMTQgOiBGaW5hbCBQcm9qZWN0IFRpbWUhCgpXcml0ZSB1cCBhIHBhcmFncmFwaCBvciB0d28gb24gd2hhdCB5b3VyIGZpbmFsIHByb2plY3Qgd2lsbCBiZS4gIEFuc3dlciB0aGVzZSBxdWVzdGlvbnM6CgoxLiBXaGF0IGRhdGEgYXJlIHlvdSB1c2luZz8KMi4gV2hhdCB0ZWNobmlxdWVzIGFyZSB5b3UgdXNpbmc/CjMuIERvIHlvdSBwbGFuIG9uIGRvaW5nIGFueSBkYXRhIGNsZWFuaW5nL3ByZXBhcmF0aW9uPyAoSWYgc28sIHdoYXQ/KQo0LiBBcmUgeW91IGdvaW5nIHRvIHBlcmZvcm0gc3VwZXJ2aXNlZCBvciB1bnN1cGVydmlzZWQgbGVhcm5pbmc/IChBbmQgd2hpY2ggdGVjaG5pcXVlKHMpPykKCgoKCgo=