Sid Rajaram
May-18-2017
UW ML410 - Applied Machine Learning
An anomaly or outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism. [1]
[1] D. Hawkins. Identification of Outliers, Chapman and Hall, 1980.
outlier_x <- c(rnorm(100,0,0.125), rnorm(100,1,0.125))
outlier_y <- c(rnorm(100,0,0.125), rnorm(100,1,0.125))
noise_x <- c(rnorm(100,0,0.25), rnorm(100,1,0.25))
noise_y <- c(rnorm(100,0,0.25), rnorm(100,1,0.25))
par(mfrow=c(1,2))
plot(outlier_x, outlier_y, main="Clearly defined outlier", width=1.0, height=0.7, xaxt='n', yaxt='n', ann=FALSE)
points(0.5, 0.5, pch=23, col = "blue", bg="red")
plot(noise_x, noise_y, main="Outlier buried in noise", width=1.0, height=0.7, xaxt='n', yaxt='n', ann=FALSE)
points(0.5, 0.5, pch=23, col = "blue", bg="red")
Need to pick the right set of features so that an outlier will pop out clearly. Typically requires domain expertise.
The outliers often contains useful information about the abnormal behavior of the data generating process. This can provide useful application-specific insights.
Typically an unsupervised problem, but shares aspects and approaches with supervised learning.
We may have some (typically non-exhaustive) examples of outliers.
Even if we do have examples of anomalies, we should refrain from using supervised techniques to classify them.
Typically, we have very few examples - supervised classifier may have trouble learning.
We also may have anomalies that are unlabeled (e.g., fraud that has gone undetected).
Future anomalies may look completely different than ones present in the dataset. All we know is that they will be different somehow.
Intrusion Detection Systems: Using OS logs, network logs and DB logs,to detect malicious activity.
System Performance: Resource usage logs such as CPU load, memory usage can help identify hardware or software issues in data centers. This can be generalized to find issues on any number of signals from sensors on electrical or mechanical devices.
Credit Card Fraud: Credit card are notoriously unsafe. Unauthorized use may show different patterns, such as a buying spree from geographically obscure locations.
Medical Diagnosis: Unusual patterns in medical scans, such as MRI or PET scans, or ECG time-series may reflect disease conditions.
Law Enforcement: Determining fraud in financial transactions, trading activity, taxes, or insurance claims may be detected by unusual patterns of activity.
Earth Science: Spatiotemporal data about weather or land cover patterns collected through satellites or remote sensing can provide insights about illegal activities or environmental anomalies.
Learn a distribution of what inlier or non-anomalous data looks like.
Compute a score that measures distance from this distribution. If the distribution is an actual probability distribution, then just use likelihood.
Find a threshold for this score below (or above) which, you mark the test point as an outlier.
If you have some examples of anomalies, use it to test and validate the system.
You will ultimately need some labeled data to measure performance. This likely means you may have to (or pay someone to) label points manually. Or look at actual customer complaints.
In general, you never want to have labeled outliers in your training set (overfitting).
Unlabeled outliers may sneak in - there are robust methods to handle that.
Stick all the labeled anomalies in the test (and dev/validation) sets.
Do an 80-20 (or whatever proportion) split of the assumed inliers into the train-test splits.
Metrics:
Distribution Approximation
PCA Reconstruction
Robust distribution approximation, such as the multivariate covariance determinant
One-class Support Vector Machines
Isolation Forests
Learning algorithms on specific data types:
Assumption: The data is distributed normally.
The pdf of the multivariate normal distribution is given by:
\[ p(\vec{x};\vec{\mu},\Sigma) = \frac{1}{(2\pi)^{n/2}\cdot|\Sigma|^{1/2}}\cdot\textrm{exp}\left(\frac{-(\vec{x}-\vec{\mu})^{T}\cdot\Sigma^{-1}\cdot(\vec{x}-\vec{\mu})}{2}\right) \]
The notation \( |A| \) denotes the determinant of matrix \( A \).
This distribution is parametrized by:
Given data points \( \vec{x_{1}}, \vec{x_{2}}, \ldots, \vec{x_{n}} \) the mean is given by:
\[ \vec{\mu} = \frac{1}{n}\sum_{i=1}^{n} \vec{x_{i}}. \]
And the covariance matrix is given by:
\[ \Sigma = \frac{1}{n-1} \sum_{i=1}^{n} (\vec{x_{i}} - \vec{\mu}). \]
Naive Idea 1: How about we use the difference between the test point and the mean, i.e. \( \vec{x} - \vec{\mu} \)?
Refined Idea 2: How about we just use the pdf?
Compute the distance from the mean, scaled by the standard deviation along the appropriate direction.
This is called the Mahalanobis distance.
Can be easily converted to a probability.
\[ D_{M}(\vec{x}) = \sqrt{(\vec{x}-\vec{\mu})^{T}S^{-1}(\vec{x}-\vec{\mu})}. \]
Assumptions:
Approach:
Can use Mahalanobis distance as in previous case.
But we assume that the variance in the error subspace is small (by design).
Thus our covariance estimates are likely less stable anyway => Directions can be ignored.
Frequently, norm of projection error, \( ||\vec{x} - \vec{x'}|| \), is used instead.
Approach:
Computing the central mode:
Download Anomaly.Rmd
Also, download german.data
An SVM classifier tries to find a separating hyperplane that maximizes the distance between two classes
When used with a kernel, this boundary can take on various complex shapes (depending on the kernel).
A one-class SVM aims to define a boundary for the distribution of inlier points so as to maximize distance from the origin.
This results in a tight boundary around observed inlier points.
Estimating the support of a high-dimensional distribution. Schölkopf, Bernhard, et al. Neural computation 13.7 (2001): 1443-1471.
Based on Random Forests (as the name suggests).
Pick a random subsample of points and grow a tree as follows:
Grow multiple such trees with random subsets of the data.
The average path length (over all trees) from root to leaf is a measure of normality and will be our decision metric.
Random partitioning produces noticeably shorter paths for anomalies.
Isolation forest. Liu, Fei Tony, et al. ICDM 2008 - Eighth IEEE International Conference on Data Mining.