To begin our exploration of anomaly detection, we will look at the following dataset of credit application scoring: https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data). The file german.data contains the raw data and te file german.doc contains the field descriptions. The fields are qualitative and numerical fields such as Status of existing checking account (qualitative), Credit history (qualitative), Credit amount requested (numerical), and so on.
The final column of this dataset contains a manually label for creditworthiness, where 1 is good, and 2 is bad. In practice, we may or may not have access to handlabeled training data. Some potential ways to deal with it are:
We will start off assuming we have some labeled samples.
Load in german.data and encode categorical variables with dummy variables. Pull out the last column as a separate variable. How many data points do we have? How many are credit-worthy (inliers)?
Assuming we have labeled data, we only train our model on inliers and test on both inliers and outliers. So let us split out inliers (\(y==1\)) into an \(75-25\) train-test split. All outliers will go into the test split.
Let’s try the Gaussian density approximation method. Use the methods colMeans, and cov to fit the distribution to the training data. Then visualize the distance distribution (using the mahalanobis function) with a histogram. Based on this alone, what distance value might you use as a threshold?
Note: You would normally want a train-test-dev split, or cross-validation here so that you can pick the threshold optimally. We are being lazy for the purposes of instruction.
We note that the tail of the distribution (values past \(~ 150\)) is heavier. This is expected because our test set contains the anomalies we are trying to detect, so we would hope they are further away.
We cannot just say we would pick the knee point - we have to ask is it better to make one type of error versus another? i.e. is it better to deny credit to good candidates versus giving credit to bad ones? Typically for credit, the answer is no, but you will need to quantify the cost of these error rates (and it can be hard to do) and decide what the optimal trade-off is. This is a business decision, not just a machine learning one.
Compute the centered and scaled PCA (using prcomp) for the training set. And visualize the plot of variance explained as a function of number of components. Pick the appropriate number of components.
pc.use <- k # explains ~80% of variance
project_pca <- getS3method("predict", "prcomp")
reconstructed_credit.train <- project_pca(pca.train, encoded_credit.train)[,1:pc.use] %*% t(pca.train$rotation[,1:pc.use])
# and add the center (and re-scale) back to data
reconstructed_credit.train <- scale(reconstructed_credit.train, center = FALSE , scale=1/pca.train$scale)
reconstructed_credit.train <- scale(reconstructed_credit.train, center = -1 * pca.train$center, scale=FALSE)
Consider the scenario where we had some labeled examples of anomalies, but we cannot be sure that all examples of anomalies were labeled. For instance, we may have some examples of credit card fraud, but the remaining transactions may also contain frauduent transactions that we were not able to catch. This is the scenario where we would use multivariate covariance determination (MCD).
Our dataset contains a lot of categorical variables, which become dummy variables with binomail coefficients. All the techniques we have used so far make assumptions of normality. We will try some non-parametric classification-based methods. We will use the train-test splits from the multivariate Guassian and PCA examples.
oneclass_svm.model <- svm(encoded_credit.train, y=NULL, type='one-classification', nu=0.10, scale=TRUE, kernel="radial")
Isolation Forests would be another candidate to try on this. However, installing it on all platforms is not easy: https://github.com/Zelazny7/isofor . There is also a Python implementation in scikit-learn also, for people who want to try it.