Supervised Learning
Necessary packages
library(caret)
Loading required package: lattice
Loading required package: ggplot2
library(e1071)
Misc Metric Questions
Q1: Generalize the entropy function from the slides
Modify the information entropy function to accept multinomial probabilities (as a list, etc.), rather than just inferring a binary probability.
# e.g., here's the old eta function from the slides that calculates entropy assuming a binary distribution:
# eta = function(h){
# t = 1-h
# - ((h * log2(h)) + (t * log2(t)))
# }
# e.g. after rewriting something like this should succeed:
# eta(list(.1, .2, .4, .3))
Q2: Write a function to produce an ROC curve (true positive rate and false positive rate)
roc = function(pred, dat){
#...
}
# e.g.
# pred = c(.1, .2., .9, .8)
# dat = c(1, 0, 0, 0, 1, 1)
# roc(pred, dat)
# plot(roc(pred,dat))
Q3: Use the roc curve function to calculate a AUC metric
auc = function(pred, dat){
# roc(...)
}
# e.g.
# auc(roc(pred,dat))
Data Processing Questions
Read in the titanic csv and analyze it (e.g. plot interesting fields you find with boxplots, scatterplots, etc.)
Think about whether the it makes sense to include a column based on what it is.
The “Titanic” dataset is a passenger manifest that also includes a “survived” field, indicating whether the individual survived the trip. We’re interested in whether we can predict whether a passenger survived, based solely on the information we knew about them before they boarded the ship.
titanic = read.csv("https://jdonaldson.github.io/uw-mlearn410/homework/titanic3.csv")
head(titanic)
Use the plots to answer the following questions:
Q4: Which fields seem to be important for predicting survival?
Q5: Which fields are leakage?
Q6: Which fields look like noise?
Q8: Deal with NA values
Let’s deal with imputing (filling-in) NAs
and missing values in age
and embarked
: age
is numeric, so we can replace it with the mean of all the non-null ages. embarked
is categorical, so let’s just replace it with the most frequent port of embarkation.
# modify titanic dataset here.
Q9: What assumptions are we implicitly making by using these methods of imputation?
Q10: Convert all the categorical variables into appropriate factors.
Example: What’s the deal with pclass
? Is it categorical?
# modify titanic here
Q11: Create a sampling function that splits the titanic dataset into 75% train, 25% test dataframe.
# datasplit = function(d){
# # ...
# }
# split = datasplit(titanic)
# e.g. should contain split$train and split$test
Modeling Questions
Q12: Is accuracy a good metric for evaluating this model? If so, what is the “chance” level for this dataset?
Q13: Use caret/rpart to train a decision tree on the test dataset.
# e.g., use your train data from the split. Fill in the proper fields in "?"
# tm = train(survived ~ ? , data=split$train, method="rpart")
# summary(tm)
Q14: Use caret/rf to train a random forest on the test dataset.
# e.g., use your train data:
# rfm = train(survived ~ ? , data=split$train, method="rf")
# summary(rfm)
Q15: Use caret/glm to train a logistic model on the test dataset
# e.g., use your train data:
# lmm = train(survived ~ ? , data=split$train, method="glm")
# summary(lmm)
Q16: Gather predictions from your models on the test dataset
# e.g.
# tm_eval = predict(tm, split$test)
#...
Q17: Which model performed the best and why do you think it did better?
LS0tCnRpdGxlOiAiSG9tZXdvcmsgMSIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKIyBTdXBlcnZpc2VkIExlYXJuaW5nCk5lY2Vzc2FyeSBwYWNrYWdlcwpgYGB7cn0KbGlicmFyeShjYXJldCkKbGlicmFyeShlMTA3MSkKYGBgCgoKIyBNaXNjIE1ldHJpYyBRdWVzdGlvbnMKCiMjIFExOiBHZW5lcmFsaXplIHRoZSBlbnRyb3B5IGZ1bmN0aW9uIGZyb20gdGhlIHNsaWRlcwpNb2RpZnkgdGhlIGluZm9ybWF0aW9uIGVudHJvcHkgZnVuY3Rpb24gdG8gIGFjY2VwdCBtdWx0aW5vbWlhbCBwcm9iYWJpbGl0aWVzIChhcyBhIGxpc3QsIGV0Yy4pLCByYXRoZXIgdGhhbiBqdXN0IGluZmVycmluZyBhIGJpbmFyeSBwcm9iYWJpbGl0eS4KYGBge3J9CiMgZS5nLiwgaGVyZSdzIHRoZSBvbGQgZXRhIGZ1bmN0aW9uIGZyb20gdGhlIHNsaWRlcyB0aGF0IGNhbGN1bGF0ZXMgZW50cm9weSBhc3N1bWluZyBhIGJpbmFyeSBkaXN0cmlidXRpb246CiMgZXRhID0gZnVuY3Rpb24oaCl7CiMgICB0ID0gMS1oCiMgICAtICgoaCAqIGxvZzIoaCkpICsgKHQgKiBsb2cyKHQpKSkKIyB9CiMgZS5nLiBhZnRlciByZXdyaXRpbmcgc29tZXRoaW5nIGxpa2UgdGhpcyBzaG91bGQgc3VjY2VlZDoKIyBldGEobGlzdCguMSwgLjIsIC40LCAuMykpCmBgYAoKIyMgUTI6IFdyaXRlIGEgZnVuY3Rpb24gdG8gcHJvZHVjZSBhbiBST0MgY3VydmUgKHRydWUgcG9zaXRpdmUgcmF0ZSBhbmQgZmFsc2UgcG9zaXRpdmUgcmF0ZSkKYGBge3J9CnJvYyA9IGZ1bmN0aW9uKHByZWQsIGRhdCl7CiAgIy4uLgp9CiMgZS5nLgojIHByZWQgPSBjKC4xLCAuMi4sIC45LCAuOCkKIyBkYXQgPSBjKDEsIDAsIDAsIDAsIDEsIDEpCiMgcm9jKHByZWQsIGRhdCkKIyBwbG90KHJvYyhwcmVkLGRhdCkpCmBgYAoKIyMgUTM6IFVzZSB0aGUgcm9jIGN1cnZlIGZ1bmN0aW9uIHRvIGNhbGN1bGF0ZSBhIEFVQyBtZXRyaWMKYGBge3J9CmF1YyA9IGZ1bmN0aW9uKHByZWQsIGRhdCl7CiAgIyByb2MoLi4uKQp9CiMgZS5nLgojIGF1Yyhyb2MocHJlZCxkYXQpKQpgYGAKCiMgRGF0YSBQcm9jZXNzaW5nIFF1ZXN0aW9ucwojIyAgUmVhZCBpbiB0aGUgdGl0YW5pYyBjc3YgYW5kIGFuYWx5emUgaXQgKGUuZy4gcGxvdCBpbnRlcmVzdGluZyBmaWVsZHMgeW91IGZpbmQgd2l0aCBib3hwbG90cywgc2NhdHRlcnBsb3RzLCBldGMuKQojIyMgVGhpbmsgYWJvdXQgd2hldGhlciB0aGUgaXQgbWFrZXMgc2Vuc2UgdG8gaW5jbHVkZSBhIGNvbHVtbiBiYXNlZCBvbiB3aGF0IGl0IGlzLgoKVGhlICJUaXRhbmljIiBkYXRhc2V0IGlzIGEgcGFzc2VuZ2VyIG1hbmlmZXN0IHRoYXQgYWxzbyBpbmNsdWRlcyBhICJzdXJ2aXZlZCIgZmllbGQsIGluZGljYXRpbmcgd2hldGhlciB0aGUgaW5kaXZpZHVhbCBzdXJ2aXZlZCB0aGUgdHJpcC4KV2UncmUgaW50ZXJlc3RlZCBpbiB3aGV0aGVyIHdlIGNhbiBwcmVkaWN0IHdoZXRoZXIgYSBwYXNzZW5nZXIgc3Vydml2ZWQsIGJhc2VkIHNvbGVseSBvbiB0aGUgaW5mb3JtYXRpb24gd2Uga25ldyBhYm91dCB0aGVtICpiZWZvcmUqIHRoZXkgYm9hcmRlZCB0aGUgc2hpcC4KCmBgYHtyfQp0aXRhbmljID0gcmVhZC5jc3YoImh0dHBzOi8vamRvbmFsZHNvbi5naXRodWIuaW8vdXctbWxlYXJuNDEwL2hvbWV3b3JrL3RpdGFuaWMzLmNzdiIpCmhlYWQodGl0YW5pYykKCmBgYAoKVXNlIHRoZSBwbG90cyB0byBhbnN3ZXIgdGhlIGZvbGxvd2luZyBxdWVzdGlvbnM6IAoKIyMgUTQ6IFdoaWNoIGZpZWxkcyBzZWVtIHRvIGJlIGltcG9ydGFudCBmb3IgcHJlZGljdGluZyBzdXJ2aXZhbD8gIAojIyBRNTogV2hpY2ggZmllbGRzIGFyZSBsZWFrYWdlPyAKIyMgUTY6IFdoaWNoIGZpZWxkcyBsb29rIGxpa2Ugbm9pc2U/CgoKIyMgUTc6IEV4dHJhY3QgdGhlIHRpdGxlcyBmcm9tIHRoZSBgYG5hbWVgYCBmaWVsZCAKVGhlIGBgbmFtZWBgIGZpZWxkIGNvbnRhaW5zIHNvbWUgdXNlZnVsIGRlbW9ncmFwaGljIGluZm9ybWF0aW9uLiAgVXNlIGBzdHJzcGxpdGAgYW5kIGxvb2sgYXQgdGhlIGNvdW50cyBvZiBlYWNoIHVuaXF1ZSB0aXRsZS4gClRoZXNlIHNob3VsZCBiZSB2YWx1ZXMgbGlrZSAiTXIuIiwgIk1ycy4iLCBldGMuIElmIHRoZXJlIGFyZSBzb21lIHRoYXQgYXJlIHZlcnkgbG93LCBkZWNpZGUgd2hhdCB0byBkbyB3aXRoIHRoZW0gLSB5b3UgY2FuIGNyZWF0ZSBhIG1hbnVhbCBvbnRvbG9neSBhbmQgcmVuYW1lIHRoZW0sIGNyZWF0ZSBhbiAiT3RoZXIiIGNsYXNzLCBvciBkcm9wIHRob3NlIHJvd3MuIEtlZXAgaW4gbWluZCAtIGlmIHlvdSBkcm9wIGBudWxsYCByb3dzIGR1cmluZyB0cmFpbmluZywgdGVsbCB1cyB3aGF0IHRvIGRvIHdpdGggdGhlbSB3aGlsZSB0ZXN0aW5nL3J1bm5pbmcgaW4gcHJvZHVjdGlvbi4KYGBge3J9CiNtb2RpZnkgdGl0YW5pYyBkYXRhc2V0IGhlcmUKYGBgCgoKIyMgUTg6IERlYWwgd2l0aCBOQSB2YWx1ZXMKTGV0J3MgZGVhbCB3aXRoIGltcHV0aW5nIChmaWxsaW5nLWluKSBgTkFzYCBhbmQgbWlzc2luZyB2YWx1ZXMgaW4gYGFnZWAgYW5kIGBlbWJhcmtlZGA6CmBhZ2VgIGlzIG51bWVyaWMsIHNvIHdlIGNhbiByZXBsYWNlIGl0IHdpdGggdGhlIG1lYW4gb2YgYWxsIHRoZSBub24tbnVsbCBhZ2VzLiBgZW1iYXJrZWRgIGlzIGNhdGVnb3JpY2FsLCBzbyBsZXQncyBqdXN0IHJlcGxhY2UgaXQgd2l0aCB0aGUgbW9zdCBmcmVxdWVudCBwb3J0IG9mIGVtYmFya2F0aW9uLgpgYGB7cn0KIyBtb2RpZnkgdGl0YW5pYyBkYXRhc2V0IGhlcmUuIApgYGAKCiMjIFE5OiBXaGF0IGFzc3VtcHRpb25zIGFyZSB3ZSBpbXBsaWNpdGx5IG1ha2luZyBieSB1c2luZyB0aGVzZSBtZXRob2RzIG9mIGltcHV0YXRpb24/CgoKIyMgUTEwOiBDb252ZXJ0IGFsbCB0aGUgY2F0ZWdvcmljYWwgdmFyaWFibGVzIGludG8gYXBwcm9wcmlhdGUgZmFjdG9ycy4KRXhhbXBsZTogV2hhdCdzIHRoZSBkZWFsIHdpdGggYHBjbGFzc2A/IElzIGl0IGNhdGVnb3JpY2FsPwpgYGB7cn0KIyBtb2RpZnkgdGl0YW5pYyBoZXJlCmBgYAoKIyMgUTExOiBDcmVhdGUgYSBzYW1wbGluZyBmdW5jdGlvbiB0aGF0IHNwbGl0cyB0aGUgdGl0YW5pYyBkYXRhc2V0IGludG8gNzUlIHRyYWluLCAyNSUgdGVzdCBkYXRhZnJhbWUuCgpgYGB7cn0KIyBkYXRhc3BsaXQgPSBmdW5jdGlvbihkKXsKIyAgIyAuLi4gCiMgfQojIHNwbGl0ID0gZGF0YXNwbGl0KHRpdGFuaWMpCiMgZS5nLiBzaG91bGQgY29udGFpbiBzcGxpdCR0cmFpbiBhbmQgc3BsaXQkdGVzdApgYGAKCiMgTW9kZWxpbmcgUXVlc3Rpb25zCiMjIFExMjogSXMgYWNjdXJhY3kgYSBnb29kIG1ldHJpYyBmb3IgZXZhbHVhdGluZyB0aGlzIG1vZGVsPyBJZiBzbywgd2hhdCBpcyB0aGUgImNoYW5jZSIgbGV2ZWwgZm9yIHRoaXMgZGF0YXNldD8KCiMjIFExMzogVXNlIGNhcmV0L3JwYXJ0IHRvIHRyYWluIGEgZGVjaXNpb24gdHJlZSBvbiB0aGUgdGVzdCBkYXRhc2V0LgoKYGBge3J9CiMgZS5nLiwgdXNlIHlvdXIgdHJhaW4gZGF0YSBmcm9tIHRoZSBzcGxpdC4gIEZpbGwgaW4gdGhlIHByb3BlciBmaWVsZHMgaW4gIj8iCiMgdG0gPSB0cmFpbihzdXJ2aXZlZCB+ID8gLCBkYXRhPXNwbGl0JHRyYWluLCBtZXRob2Q9InJwYXJ0IikKIyBzdW1tYXJ5KHRtKQpgYGAKCiMjIFExNDogVXNlIGNhcmV0L3JmIHRvIHRyYWluIGEgcmFuZG9tIGZvcmVzdCBvbiB0aGUgdGVzdCBkYXRhc2V0LiAKCmBgYHtyfQojIGUuZy4sIHVzZSB5b3VyIHRyYWluIGRhdGE6CiMgcmZtID0gdHJhaW4oc3Vydml2ZWQgfiA/ICwgZGF0YT1zcGxpdCR0cmFpbiwgbWV0aG9kPSJyZiIpCiMgc3VtbWFyeShyZm0pCmBgYAoKIyMgUTE1OiBVc2UgY2FyZXQvZ2xtIHRvIHRyYWluIGEgbG9naXN0aWMgbW9kZWwgb24gdGhlIHRlc3QgZGF0YXNldAoKYGBge3J9CiMgZS5nLiwgdXNlIHlvdXIgdHJhaW4gZGF0YToKIyBsbW0gPSB0cmFpbihzdXJ2aXZlZCB+ID8gLCBkYXRhPXNwbGl0JHRyYWluLCBtZXRob2Q9ImdsbSIpCiMgc3VtbWFyeShsbW0pCmBgYAoKCiMjIFExNjogR2F0aGVyIHByZWRpY3Rpb25zIGZyb20geW91ciBtb2RlbHMgb24gdGhlIHRlc3QgZGF0YXNldApgYGB7cn0KIyBlLmcuCiMgdG1fZXZhbCAgPSBwcmVkaWN0KHRtLCBzcGxpdCR0ZXN0KQojLi4uCmBgYAoKIyMgUTE3OiBVc2UgeW91ciByb2MvYXVjIGZ1bmN0aW9ucyB0byBwbG90IGFuZCBjb21wYXJlIHlvdXIgbW9kZWxzJyBwZXJmb3JtYW5jZSAKYGBge3J9CiNlLmcKIyBwbG90KHJvYyh0bV9ldmFsLCBzcGxpdCR0ZXN0JHN1cnZpdmVkKSkKIyBhdWMocm9jKHRtX2V2YWwsIHNwbGl0JHRlc3Qkc3Vydml2ZWQpKQpgYGAKCiMjIFExNzogV2hpY2ggbW9kZWwgcGVyZm9ybWVkIHRoZSBiZXN0IGFuZCB3aHkgZG8geW91IHRoaW5rIGl0IGRpZCBiZXR0ZXI/CgojIENsb3NpbmcgTm90ZXMvRm9sbG93LXVwCkNvbnNpZGVyIHN1Ym1pdHRpbmcgeW91ciByZXNwb25zZXMgdG8gS2FnZ2xlIGFuZCBzZWUgaG93IHlvdSBkaWQhIApodHRwczovL3d3dy5rYWdnbGUuY29tL2MvdGl0YW5pYwoKCg==