Supervised Learning

Necessary packages

library(caret)
Loading required package: lattice
Loading required package: ggplot2
library(e1071)

Misc Metric Questions

Q1: Generalize the entropy function from the slides

Modify the information entropy function to accept multinomial probabilities (as a list, etc.), rather than just inferring a binary probability.

# e.g., here's the old eta function from the slides that calculates entropy assuming a binary distribution:
# eta = function(h){
#   t = 1-h
#   - ((h * log2(h)) + (t * log2(t)))
# }
# e.g. after rewriting something like this should succeed:
# eta(list(.1, .2, .4, .3))

Q2: Write a function to produce an ROC curve (true positive rate and false positive rate)

roc = function(pred, dat){
  #...
}
# e.g.
# pred = c(.1, .2., .9, .8)
# dat = c(1, 0, 0, 0, 1, 1)
# roc(pred, dat)
# plot(roc(pred,dat))

Q3: Use the roc curve function to calculate a AUC metric

auc = function(pred, dat){
  # roc(...)
}
# e.g.
# auc(roc(pred,dat))

Data Processing Questions

Read in the titanic csv and analyze it (e.g. plot interesting fields you find with boxplots, scatterplots, etc.)

Think about whether the it makes sense to include a column based on what it is.

The “Titanic” dataset is a passenger manifest that also includes a “survived” field, indicating whether the individual survived the trip. We’re interested in whether we can predict whether a passenger survived, based solely on the information we knew about them before they boarded the ship.

titanic = read.csv("https://jdonaldson.github.io/uw-mlearn410/homework/titanic3.csv")
head(titanic)

Use the plots to answer the following questions:

Q4: Which fields seem to be important for predicting survival?

Q5: Which fields are leakage?

Q6: Which fields look like noise?

Q7: Extract the titles from the name field

The name field contains some useful demographic information. Use strsplit and look at the counts of each unique title. These should be values like “Mr.”, “Mrs.”, etc. If there are some that are very low, decide what to do with them - you can create a manual ontology and rename them, create an “Other” class, or drop those rows. Keep in mind - if you drop null rows during training, tell us what to do with them while testing/running in production.

#modify titanic dataset here

Q8: Deal with NA values

Let’s deal with imputing (filling-in) NAs and missing values in age and embarked: age is numeric, so we can replace it with the mean of all the non-null ages. embarked is categorical, so let’s just replace it with the most frequent port of embarkation.

# modify titanic dataset here. 

Q9: What assumptions are we implicitly making by using these methods of imputation?

Q10: Convert all the categorical variables into appropriate factors.

Example: What’s the deal with pclass? Is it categorical?

# modify titanic here

Q11: Create a sampling function that splits the titanic dataset into 75% train, 25% test dataframe.

# datasplit = function(d){
#  # ... 
# }
# split = datasplit(titanic)
# e.g. should contain split$train and split$test

Modeling Questions

Q12: Is accuracy a good metric for evaluating this model? If so, what is the “chance” level for this dataset?

Q13: Use caret/rpart to train a decision tree on the test dataset.

# e.g., use your train data from the split.  Fill in the proper fields in "?"
# tm = train(survived ~ ? , data=split$train, method="rpart")
# summary(tm)

Q14: Use caret/rf to train a random forest on the test dataset.

# e.g., use your train data:
# rfm = train(survived ~ ? , data=split$train, method="rf")
# summary(rfm)

Q15: Use caret/glm to train a logistic model on the test dataset

# e.g., use your train data:
# lmm = train(survived ~ ? , data=split$train, method="glm")
# summary(lmm)

Q16: Gather predictions from your models on the test dataset

# e.g.
# tm_eval  = predict(tm, split$test)
#...

Q17: Use your roc/auc functions to plot and compare your models’ performance

#e.g
# plot(roc(tm_eval, split$test$survived))
# auc(roc(tm_eval, split$test$survived))

Q17: Which model performed the best and why do you think it did better?

Closing Notes/Follow-up

Consider submitting your responses to Kaggle and see how you did! https://www.kaggle.com/c/titanic

LS0tCnRpdGxlOiAiSG9tZXdvcmsgMSIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKIyBTdXBlcnZpc2VkIExlYXJuaW5nCk5lY2Vzc2FyeSBwYWNrYWdlcwpgYGB7cn0KbGlicmFyeShjYXJldCkKbGlicmFyeShlMTA3MSkKYGBgCgoKIyBNaXNjIE1ldHJpYyBRdWVzdGlvbnMKCiMjIFExOiBHZW5lcmFsaXplIHRoZSBlbnRyb3B5IGZ1bmN0aW9uIGZyb20gdGhlIHNsaWRlcwpNb2RpZnkgdGhlIGluZm9ybWF0aW9uIGVudHJvcHkgZnVuY3Rpb24gdG8gIGFjY2VwdCBtdWx0aW5vbWlhbCBwcm9iYWJpbGl0aWVzIChhcyBhIGxpc3QsIGV0Yy4pLCByYXRoZXIgdGhhbiBqdXN0IGluZmVycmluZyBhIGJpbmFyeSBwcm9iYWJpbGl0eS4KYGBge3J9CiMgZS5nLiwgaGVyZSdzIHRoZSBvbGQgZXRhIGZ1bmN0aW9uIGZyb20gdGhlIHNsaWRlcyB0aGF0IGNhbGN1bGF0ZXMgZW50cm9weSBhc3N1bWluZyBhIGJpbmFyeSBkaXN0cmlidXRpb246CiMgZXRhID0gZnVuY3Rpb24oaCl7CiMgICB0ID0gMS1oCiMgICAtICgoaCAqIGxvZzIoaCkpICsgKHQgKiBsb2cyKHQpKSkKIyB9CiMgZS5nLiBhZnRlciByZXdyaXRpbmcgc29tZXRoaW5nIGxpa2UgdGhpcyBzaG91bGQgc3VjY2VlZDoKIyBldGEobGlzdCguMSwgLjIsIC40LCAuMykpCmBgYAoKIyMgUTI6IFdyaXRlIGEgZnVuY3Rpb24gdG8gcHJvZHVjZSBhbiBST0MgY3VydmUgKHRydWUgcG9zaXRpdmUgcmF0ZSBhbmQgZmFsc2UgcG9zaXRpdmUgcmF0ZSkKYGBge3J9CnJvYyA9IGZ1bmN0aW9uKHByZWQsIGRhdCl7CiAgIy4uLgp9CiMgZS5nLgojIHByZWQgPSBjKC4xLCAuMi4sIC45LCAuOCkKIyBkYXQgPSBjKDEsIDAsIDAsIDAsIDEsIDEpCiMgcm9jKHByZWQsIGRhdCkKIyBwbG90KHJvYyhwcmVkLGRhdCkpCmBgYAoKIyMgUTM6IFVzZSB0aGUgcm9jIGN1cnZlIGZ1bmN0aW9uIHRvIGNhbGN1bGF0ZSBhIEFVQyBtZXRyaWMKYGBge3J9CmF1YyA9IGZ1bmN0aW9uKHByZWQsIGRhdCl7CiAgIyByb2MoLi4uKQp9CiMgZS5nLgojIGF1Yyhyb2MocHJlZCxkYXQpKQpgYGAKCiMgRGF0YSBQcm9jZXNzaW5nIFF1ZXN0aW9ucwojIyAgUmVhZCBpbiB0aGUgdGl0YW5pYyBjc3YgYW5kIGFuYWx5emUgaXQgKGUuZy4gcGxvdCBpbnRlcmVzdGluZyBmaWVsZHMgeW91IGZpbmQgd2l0aCBib3hwbG90cywgc2NhdHRlcnBsb3RzLCBldGMuKQojIyMgVGhpbmsgYWJvdXQgd2hldGhlciB0aGUgaXQgbWFrZXMgc2Vuc2UgdG8gaW5jbHVkZSBhIGNvbHVtbiBiYXNlZCBvbiB3aGF0IGl0IGlzLgoKVGhlICJUaXRhbmljIiBkYXRhc2V0IGlzIGEgcGFzc2VuZ2VyIG1hbmlmZXN0IHRoYXQgYWxzbyBpbmNsdWRlcyBhICJzdXJ2aXZlZCIgZmllbGQsIGluZGljYXRpbmcgd2hldGhlciB0aGUgaW5kaXZpZHVhbCBzdXJ2aXZlZCB0aGUgdHJpcC4KV2UncmUgaW50ZXJlc3RlZCBpbiB3aGV0aGVyIHdlIGNhbiBwcmVkaWN0IHdoZXRoZXIgYSBwYXNzZW5nZXIgc3Vydml2ZWQsIGJhc2VkIHNvbGVseSBvbiB0aGUgaW5mb3JtYXRpb24gd2Uga25ldyBhYm91dCB0aGVtICpiZWZvcmUqIHRoZXkgYm9hcmRlZCB0aGUgc2hpcC4KCmBgYHtyfQp0aXRhbmljID0gcmVhZC5jc3YoImh0dHBzOi8vamRvbmFsZHNvbi5naXRodWIuaW8vdXctbWxlYXJuNDEwL2hvbWV3b3JrL3RpdGFuaWMzLmNzdiIpCmhlYWQodGl0YW5pYykKCmBgYAoKVXNlIHRoZSBwbG90cyB0byBhbnN3ZXIgdGhlIGZvbGxvd2luZyBxdWVzdGlvbnM6IAoKIyMgUTQ6IFdoaWNoIGZpZWxkcyBzZWVtIHRvIGJlIGltcG9ydGFudCBmb3IgcHJlZGljdGluZyBzdXJ2aXZhbD8gIAojIyBRNTogV2hpY2ggZmllbGRzIGFyZSBsZWFrYWdlPyAKIyMgUTY6IFdoaWNoIGZpZWxkcyBsb29rIGxpa2Ugbm9pc2U/CgoKIyMgUTc6IEV4dHJhY3QgdGhlIHRpdGxlcyBmcm9tIHRoZSBgYG5hbWVgYCBmaWVsZCAKVGhlIGBgbmFtZWBgIGZpZWxkIGNvbnRhaW5zIHNvbWUgdXNlZnVsIGRlbW9ncmFwaGljIGluZm9ybWF0aW9uLiAgVXNlIGBzdHJzcGxpdGAgYW5kIGxvb2sgYXQgdGhlIGNvdW50cyBvZiBlYWNoIHVuaXF1ZSB0aXRsZS4gClRoZXNlIHNob3VsZCBiZSB2YWx1ZXMgbGlrZSAiTXIuIiwgIk1ycy4iLCBldGMuIElmIHRoZXJlIGFyZSBzb21lIHRoYXQgYXJlIHZlcnkgbG93LCBkZWNpZGUgd2hhdCB0byBkbyB3aXRoIHRoZW0gLSB5b3UgY2FuIGNyZWF0ZSBhIG1hbnVhbCBvbnRvbG9neSBhbmQgcmVuYW1lIHRoZW0sIGNyZWF0ZSBhbiAiT3RoZXIiIGNsYXNzLCBvciBkcm9wIHRob3NlIHJvd3MuIEtlZXAgaW4gbWluZCAtIGlmIHlvdSBkcm9wIGBudWxsYCByb3dzIGR1cmluZyB0cmFpbmluZywgdGVsbCB1cyB3aGF0IHRvIGRvIHdpdGggdGhlbSB3aGlsZSB0ZXN0aW5nL3J1bm5pbmcgaW4gcHJvZHVjdGlvbi4KYGBge3J9CiNtb2RpZnkgdGl0YW5pYyBkYXRhc2V0IGhlcmUKYGBgCgoKIyMgUTg6IERlYWwgd2l0aCBOQSB2YWx1ZXMKTGV0J3MgZGVhbCB3aXRoIGltcHV0aW5nIChmaWxsaW5nLWluKSBgTkFzYCBhbmQgbWlzc2luZyB2YWx1ZXMgaW4gYGFnZWAgYW5kIGBlbWJhcmtlZGA6CmBhZ2VgIGlzIG51bWVyaWMsIHNvIHdlIGNhbiByZXBsYWNlIGl0IHdpdGggdGhlIG1lYW4gb2YgYWxsIHRoZSBub24tbnVsbCBhZ2VzLiBgZW1iYXJrZWRgIGlzIGNhdGVnb3JpY2FsLCBzbyBsZXQncyBqdXN0IHJlcGxhY2UgaXQgd2l0aCB0aGUgbW9zdCBmcmVxdWVudCBwb3J0IG9mIGVtYmFya2F0aW9uLgpgYGB7cn0KIyBtb2RpZnkgdGl0YW5pYyBkYXRhc2V0IGhlcmUuIApgYGAKCiMjIFE5OiBXaGF0IGFzc3VtcHRpb25zIGFyZSB3ZSBpbXBsaWNpdGx5IG1ha2luZyBieSB1c2luZyB0aGVzZSBtZXRob2RzIG9mIGltcHV0YXRpb24/CgoKIyMgUTEwOiBDb252ZXJ0IGFsbCB0aGUgY2F0ZWdvcmljYWwgdmFyaWFibGVzIGludG8gYXBwcm9wcmlhdGUgZmFjdG9ycy4KRXhhbXBsZTogV2hhdCdzIHRoZSBkZWFsIHdpdGggYHBjbGFzc2A/IElzIGl0IGNhdGVnb3JpY2FsPwpgYGB7cn0KIyBtb2RpZnkgdGl0YW5pYyBoZXJlCmBgYAoKIyMgUTExOiBDcmVhdGUgYSBzYW1wbGluZyBmdW5jdGlvbiB0aGF0IHNwbGl0cyB0aGUgdGl0YW5pYyBkYXRhc2V0IGludG8gNzUlIHRyYWluLCAyNSUgdGVzdCBkYXRhZnJhbWUuCgpgYGB7cn0KIyBkYXRhc3BsaXQgPSBmdW5jdGlvbihkKXsKIyAgIyAuLi4gCiMgfQojIHNwbGl0ID0gZGF0YXNwbGl0KHRpdGFuaWMpCiMgZS5nLiBzaG91bGQgY29udGFpbiBzcGxpdCR0cmFpbiBhbmQgc3BsaXQkdGVzdApgYGAKCiMgTW9kZWxpbmcgUXVlc3Rpb25zCiMjIFExMjogSXMgYWNjdXJhY3kgYSBnb29kIG1ldHJpYyBmb3IgZXZhbHVhdGluZyB0aGlzIG1vZGVsPyBJZiBzbywgd2hhdCBpcyB0aGUgImNoYW5jZSIgbGV2ZWwgZm9yIHRoaXMgZGF0YXNldD8KCiMjIFExMzogVXNlIGNhcmV0L3JwYXJ0IHRvIHRyYWluIGEgZGVjaXNpb24gdHJlZSBvbiB0aGUgdGVzdCBkYXRhc2V0LgoKYGBge3J9CiMgZS5nLiwgdXNlIHlvdXIgdHJhaW4gZGF0YSBmcm9tIHRoZSBzcGxpdC4gIEZpbGwgaW4gdGhlIHByb3BlciBmaWVsZHMgaW4gIj8iCiMgdG0gPSB0cmFpbihzdXJ2aXZlZCB+ID8gLCBkYXRhPXNwbGl0JHRyYWluLCBtZXRob2Q9InJwYXJ0IikKIyBzdW1tYXJ5KHRtKQpgYGAKCiMjIFExNDogVXNlIGNhcmV0L3JmIHRvIHRyYWluIGEgcmFuZG9tIGZvcmVzdCBvbiB0aGUgdGVzdCBkYXRhc2V0LiAKCmBgYHtyfQojIGUuZy4sIHVzZSB5b3VyIHRyYWluIGRhdGE6CiMgcmZtID0gdHJhaW4oc3Vydml2ZWQgfiA/ICwgZGF0YT1zcGxpdCR0cmFpbiwgbWV0aG9kPSJyZiIpCiMgc3VtbWFyeShyZm0pCmBgYAoKIyMgUTE1OiBVc2UgY2FyZXQvZ2xtIHRvIHRyYWluIGEgbG9naXN0aWMgbW9kZWwgb24gdGhlIHRlc3QgZGF0YXNldAoKYGBge3J9CiMgZS5nLiwgdXNlIHlvdXIgdHJhaW4gZGF0YToKIyBsbW0gPSB0cmFpbihzdXJ2aXZlZCB+ID8gLCBkYXRhPXNwbGl0JHRyYWluLCBtZXRob2Q9ImdsbSIpCiMgc3VtbWFyeShsbW0pCmBgYAoKCiMjIFExNjogR2F0aGVyIHByZWRpY3Rpb25zIGZyb20geW91ciBtb2RlbHMgb24gdGhlIHRlc3QgZGF0YXNldApgYGB7cn0KIyBlLmcuCiMgdG1fZXZhbCAgPSBwcmVkaWN0KHRtLCBzcGxpdCR0ZXN0KQojLi4uCmBgYAoKIyMgUTE3OiBVc2UgeW91ciByb2MvYXVjIGZ1bmN0aW9ucyB0byBwbG90IGFuZCBjb21wYXJlIHlvdXIgbW9kZWxzJyBwZXJmb3JtYW5jZSAKYGBge3J9CiNlLmcKIyBwbG90KHJvYyh0bV9ldmFsLCBzcGxpdCR0ZXN0JHN1cnZpdmVkKSkKIyBhdWMocm9jKHRtX2V2YWwsIHNwbGl0JHRlc3Qkc3Vydml2ZWQpKQpgYGAKCiMjIFExNzogV2hpY2ggbW9kZWwgcGVyZm9ybWVkIHRoZSBiZXN0IGFuZCB3aHkgZG8geW91IHRoaW5rIGl0IGRpZCBiZXR0ZXI/CgojIENsb3NpbmcgTm90ZXMvRm9sbG93LXVwCkNvbnNpZGVyIHN1Ym1pdHRpbmcgeW91ciByZXNwb25zZXMgdG8gS2FnZ2xlIGFuZCBzZWUgaG93IHlvdSBkaWQhIApodHRwczovL3d3dy5rYWdnbGUuY29tL2MvdGl0YW5pYwoKCg==