Applied Machine Learning 410

Sid Rajaram
May-18-2017

Anomaly Detection (or Outlier Analysis)

UW ML410 - Applied Machine Learning

Introduction

An anomaly or outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism. [1]

  • Different from noise, which is a consequence of natural variability of the generating mechanism.

Noise vs Outliers

[1] D. Hawkins. Identification of Outliers, Chapman and Hall, 1980.

Introduction

  • In practice, this threshold can be challenging to define.
outlier_x <- c(rnorm(100,0,0.125), rnorm(100,1,0.125))
outlier_y <- c(rnorm(100,0,0.125), rnorm(100,1,0.125))
noise_x <- c(rnorm(100,0,0.25), rnorm(100,1,0.25))
noise_y <- c(rnorm(100,0,0.25), rnorm(100,1,0.25))

par(mfrow=c(1,2))
plot(outlier_x, outlier_y, main="Clearly defined outlier", width=1.0, height=0.7, xaxt='n', yaxt='n', ann=FALSE)
points(0.5, 0.5, pch=23, col = "blue", bg="red")
plot(noise_x, noise_y, main="Outlier buried in noise", width=1.0, height=0.7, xaxt='n', yaxt='n', ann=FALSE)
points(0.5, 0.5, pch=23, col = "blue", bg="red")

Introduction

  • The outlier (in red) is easier to identify if the variance of the features is smaller.

plot of chunk unnamed-chunk-2

Introduction

  • Need to pick the right set of features so that an outlier will pop out clearly. Typically requires domain expertise.

  • The outliers often contains useful information about the abnormal behavior of the data generating process. This can provide useful application-specific insights.

  • Typically an unsupervised problem, but shares aspects and approaches with supervised learning.

  • We may have some (typically non-exhaustive) examples of outliers.

Anomaly Detection versus Supervised learning

  • Even if we do have examples of anomalies, we should refrain from using supervised techniques to classify them.

  • Typically, we have very few examples - supervised classifier may have trouble learning.

  • We also may have anomalies that are unlabeled (e.g., fraud that has gone undetected).

  • Future anomalies may look completely different than ones present in the dataset. All we know is that they will be different somehow.

Applications

  • Intrusion Detection Systems: Using OS logs, network logs and DB logs,to detect malicious activity.

  • System Performance: Resource usage logs such as CPU load, memory usage can help identify hardware or software issues in data centers. This can be generalized to find issues on any number of signals from sensors on electrical or mechanical devices.

  • Credit Card Fraud: Credit card are notoriously unsafe. Unauthorized use may show different patterns, such as a buying spree from geographically obscure locations.

Applications

  • Medical Diagnosis: Unusual patterns in medical scans, such as MRI or PET scans, or ECG time-series may reflect disease conditions.

  • Law Enforcement: Determining fraud in financial transactions, trading activity, taxes, or insurance claims may be detected by unusual patterns of activity.

  • Earth Science: Spatiotemporal data about weather or land cover patterns collected through satellites or remote sensing can provide insights about illegal activities or environmental anomalies.

General Approach


  • Learn a distribution of what inlier or non-anomalous data looks like.

  • Compute a score that measures distance from this distribution. If the distribution is an actual probability distribution, then just use likelihood.

  • Find a threshold for this score below (or above) which, you mark the test point as an outlier.

  • If you have some examples of anomalies, use it to test and validate the system.

Measuring Performance


  • You will ultimately need some labeled data to measure performance. This likely means you may have to (or pay someone to) label points manually. Or look at actual customer complaints.

  • In general, you never want to have labeled outliers in your training set (overfitting).

  • Unlabeled outliers may sneak in - there are robust methods to handle that.

Measuring Performance


  • Stick all the labeled anomalies in the test (and dev/validation) sets.

  • Do an 80-20 (or whatever proportion) split of the assumed inliers into the train-test splits.

  • Metrics:

    • All the usual metrics for supervised learning - TPR, FPR, FNR, precision, recall, f1, ranking based metrics, AUC

Algorithms

Based on Unsupervised Learning

  • Distribution Approximation

    • Gaussian
  • PCA Reconstruction

  • Robust distribution approximation, such as the multivariate covariance determinant

    • Robustness to outliers in the training data

Algorithms

Based on Supervised Learning

  • One-class Support Vector Machines

    • Based on SVMs
  • Isolation Forests

    • Based on Random Forests
  • Learning algorithms on specific data types:

    • e.g., Time series (ARIMA modeling)

Multivariate Gaussian Distribution

  • Assumption: The data is distributed normally.

  • The pdf of the multivariate normal distribution is given by:

    \[ p(\vec{x};\vec{\mu},\Sigma) = \frac{1}{(2\pi)^{n/2}\cdot|\Sigma|^{1/2}}\cdot\textrm{exp}\left(\frac{-(\vec{x}-\vec{\mu})^{T}\cdot\Sigma^{-1}\cdot(\vec{x}-\vec{\mu})}{2}\right) \]

  • The notation \( |A| \) denotes the determinant of matrix \( A \).

  • This distribution is parametrized by:

    • \( \vec{\mu} \), the mean vector, and
    • \( \Sigma \) is the covariance matrix.

Estimating a Multivariate Gaussian

Given data points \( \vec{x_{1}}, \vec{x_{2}}, \ldots, \vec{x_{n}} \) the mean is given by:

\[ \vec{\mu} = \frac{1}{n}\sum_{i=1}^{n} \vec{x_{i}}. \]

And the covariance matrix is given by:

\[ \Sigma = \frac{1}{n-1} \sum_{i=1}^{n} (\vec{x_{i}} - \vec{\mu}). \]

Scoring or Distance Function

  • Naive Idea 1: How about we use the difference between the test point and the mean, i.e. \( \vec{x} - \vec{\mu} \)?

    • It has the nice property that the score is \( 0 \) at the mean and drops off as we get further away.
    • Problem: absolute distance does not take into account difference in variance of the data along different directions
      Correlated data

Scoring or Distance Function


  • Refined Idea 2: How about we just use the pdf?

    • Monotonically decreases as we move away from the mean
    • Takes into account the direction of variance with the covariance matrix.
    • Nice probabilistic interpretation: probability of seeing an inlier in an small neighborhood around the observed point.

Scoring or Distance Function

  • Problem: Scaling seems arbitrary. The value of the probability of the mean of the data can be made arbitrarily small or large by changing the variance.

Noise vs Outliers

Solution: Combine the two ideas

  • Compute the distance from the mean, scaled by the standard deviation along the appropriate direction.

  • This is called the Mahalanobis distance.

  • Can be easily converted to a probability.

    • All \( n \)-dim points at Mahalanobis distance \( d \) have the same probability, regardless of the shape of \( \Sigma \).

\[ D_{M}(\vec{x}) = \sqrt{(\vec{x}-\vec{\mu})^{T}S^{-1}(\vec{x}-\vec{\mu})}. \]

Mahalanobis Distance


  • For scalar (\( 1 \)-dim) values, this is also known as the z-score.

Correlated data

PCA Reconstruction


  • Assumptions:

    • Inliers form a low-dimensional subspace of the feature space.

    • Outliers will show variation outside this low-dimensional subspace.

    • Inliers constitute a large majority of the data and thus, the dimensions with the largest variance constitute the inlier subspace.

PCA Reconstruction


  • Approach:

    • Do PCA, find top \( k \) principal components that account for most of the variance. Typically \( k << n \).

    • Project the data point \( \vec{x} \) on to its top \( k \) components and call this projection \( \vec{x'} \)

    • Compute the norm of the projection error \( \vec{x} - \vec{x'} \).

PCA Reconstruction


Correlated data

Distance Function


  • Can use Mahalanobis distance as in previous case.

  • But we assume that the variance in the error subspace is small (by design).

    • Error subspace is the subspace that remains after projecting out the top \( k \).
  • Thus our covariance estimates are likely less stable anyway => Directions can be ignored.

  • Frequently, norm of projection error, \( ||\vec{x} - \vec{x'}|| \), is used instead.

Multivariate Covariance Determination (MCD)


  • Problem with the 2 previous approaches: The sample data may (probably does) contain some outliers (anomalies).

    • We know that estimates of mean and covariance are sensitive to outliers.

    • Affects estimation of normal distribution parameters, as well as principal components.

  • Assumption: The inliers are normally distributed and constitute at least half the data points.

Multivariate Covariance Determination (MCD)

  • Approach:

    • Find the central mode and ignore all points outside it.
      MCD
    • Fit a Gaussian distribution (our first aproach) to only these central mode points.
    • Construct Mahalanobis distance using the Gaussian defined above.

Multivariate Covariance Determination (MCD)


  • Computing the central mode:

    • Let \( h \geq \frac{m}{2} \) be the estimated number of inlier points.
    • \( m \) is the total number of points in the dataset.
    • Better to underestimate \( h \), than overestimate.
    • Find the subset \( D \) of \( |h| \) points that minimizes the determinant of the covariance matrix defined by them.

On to Notebook

Break?


Download Anomaly.Rmd

Also, download german.data

One-class Support Vector Machine


  • An SVM classifier tries to find a separating hyperplane that maximizes the distance between two classes

  • When used with a kernel, this boundary can take on various complex shapes (depending on the kernel).

  • A one-class SVM aims to define a boundary for the distribution of inlier points so as to maximize distance from the origin.

  • This results in a tight boundary around observed inlier points.

  • Estimating the support of a high-dimensional distribution. Schölkopf, Bernhard, et al. Neural computation 13.7 (2001): 1443-1471.

One-class Support Vector Machine

MCD

Isolation Forest


  • Based on Random Forests (as the name suggests).

  • Pick a random subsample of points and grow a tree as follows:

    • Randomly pick a feature and randomly split between the max and min values of that feature - split the data appropriately.
    • At each resulting node pick another random feature and another random split with the remaining points.
    • Continue until each leaf node has one single isolated data point.

Isolation Forest


  • Grow multiple such trees with random subsets of the data.

  • The average path length (over all trees) from root to leaf is a measure of normality and will be our decision metric.

  • Random partitioning produces noticeably shorter paths for anomalies.


    Isolation forest. Liu, Fei Tony, et al. ICDM 2008 - Eighth IEEE International Conference on Data Mining.

Isolation Forest

Isolation Forest