FCDD - Explainable Anomaly Detection

FCDD achieves SOTA performance on MVTec-AD and comparable performances on other standard datasets for anomaly detection, while providing inbuilt explanations.

Introduction

“Anomaly detection (AD) is the task of identifying anomalies in a corpus of data.” Everytime I read this, I get the Among Us image in my head. If you aren’t familiar, it is a game where there is an impostor (usually one, sometimes more) among a group of regular characters, and the regular ones have to find which one is the impostor (anomaly) hidden among them.

Among Us Among-Us

Prior Work

There are many real-life applications where we find anomalies. Finding tumors for cancer, cracks inside metal bars, fraud detection are a few of the very common applications. However, in case of the first two applications, the task of AD could be of two types - Anomaly Classification and Anomaly Segmentation. An Anomaly Detector model could simply look at a CT scan image and output a label of anomalous (contains malignant growth) or normal. This would be similar to a binary classification problem. The Anomaly Detector could further segment the malignant pixels in the CT scan image to predict a mask of the anomalous pixels, which would be the task of Anomaly Segmentation. For all of these problems, several solutions have been developed over the years. Deep learning based approaches, particularly, have become popular in recent years.

Prior Work Prior-Work

One set of approaches use autoencoders. These are usually trained on a nominal (no anomalies present) dataset and used to reconstruct anomalous samples, which it is then expected to reconstruct poorly. The drawback to this is of course that we can’t incorporate known ground-truth anomalies during the training process itself. The advantage of this method is that we can use the reconstruction error as an anomaly score and the pixel-wise difference directly as an anomaly heatmap, providing a natural explanation without having the explicit need to include attention mechanisms in the model. Some recent works have however incorporated attention into the autoencoders itself as an explanation mechanism, providing an alternative to the natural solution.

One-class classification methods map nominal data to a concentrated feature space and map anomalous data to other locations. While this is done unsupervised, the methods usually rely on a separate attention mechanism or model-agnostic explanation methods for generating an explanation, instead of it being inbuilt.

The best performing methods use self-supervision. These methods apply some transformation to nominal samples. Then a model is trained for predicting the transformation used and then an anomaly score is generated via the confidence score of the prediction. The advantage of this method is mainly that it provides good results while not requiring supervision. But the disadvantage is that it is inherently not explainable yet, meaning that we cannot use them for sensitive applications like healthcare.

In this particular paper, the authors adopt a one-class classifier based approach, and then add some innovations for inbuilt explainability for them.

Explainable Deep One-class classification with FCDD

FCDD FCDD-method

Fully Convolutional architecture

The paper adopts a fully convolutional network (FCN) architecture with alternating convolutional and pooling layers. The datasets used in the paper are Fashion-MNIST, CIFAR-10, ImageNet, MVTex-AD and Pascal VOC. For each dataset a slightly different (larger) network is used, but the overall convolutional nature of the networks remains the same.

Model Architectures for different datasets Architectures

Another important property of a convolutional layer is that each pixel in the output depends on a small region of the input, because the values are calculated by moving a small kernel (filter) across the input. The small region is known as the output pixel’s receptive field. This property allows the networks used to preserve spatial information.

The authors have performed an analysis by varying the size of the receptive field to see its effect on the final heatmap produced. In the figure below, where this analysis is shown for a small set of examples from MVTec-AD, we can see that for smaller receptive field sizes the heatmaps are more concentrated and closer to the shapes in the ground truth heatmaps.

Sensitivity Analysis for receptive field size for MVTec-AD
Receptive-field-size-sensitivity-analysis

Training objective based on a Hypersphere Classifier

With the given FCN architecture, the next task is to decide on a training objective. If X1, . . . , Xn denote a collection of samples and y1, . . . , yn be labels where yi = 1 denotes an anomaly and yi = 0 denotes a nominal sample, then the objective function as described in the image below maps nominal samples near the origin centered hypersphere and anomalies to be mapped away. Note from the expression of the objective, that for an anomaly, the first term of the summation would become 0 for each anomaly as yi = 1, and similarly for each nominal sample, the second term would be 0 as yi = 0. Over the weights W of the neural network ɸ, the objective is thus able to maximize the anomaly score from the output heatmap for anomalies, and minimize it for nominal samples. This allows it to perform the hypersphere mapping as described.

Formal definition of Objective Function Objective-function

For visualisation purposes only, consider this unit hypersphere centered at the origin. The coloured patches on the surface of the sphere can be assumed to be representations of concentrated data points (anomalies in our case) and we can assume that all the nominal samples are present at the centre of the hypersphere. This is what the objective function tries to achieve. Note that for our case, the distribution of the anomalies and nominal samples may look different than the exact representation shown in this figure, and this figure is only intended to provide a visual aid for understanding the concept.

Example Hypersphere Hypersphere

Gaussian Kernel based upsampling

Based on the formal description of the network, we can see that it maps an input image of size (h x w) to an output heatmap of size (u x v) and the output is smaller (lower resolution) than the input. While low-resolution heatmaps may be useful for benchmarks and other tests of performance, for a human to understand the output, there is a very pertinent need to have an upsampled heatmap. Usually there is no access to ground-truth heatmaps for most datasets (MVTec-AD is an exception), so we cannot learn in a supervised fashion the upsampling mapping. So, instead a clever mathematical manoeuvre is used based on the property of receptive fields for the FCN.

For every pixel in the output heatmap A(X), there must be a unique input pixel at the center of its receptive field. An empirical observation is known that the effect of the receptive field for an output pixel decays in a Gaussian manner as one moves away from the center of the receptive field. This fact is used in the paper. A(X) is upsampled using a strided convolution with a fixed Gaussian Kernel whose μ parameter (mean) comes from the input pixel at the center of the receptive field for a given output pixel. The 𝝈 parameter (variance) is set based on empirical testing to see which value generates an explainable heatmap. As can be seen from the image below, which shows this empirical analysis for a few inputs from the MVTec-AD dataset, a 𝝈 value that is neither too high nor too low works well enough for most cases on comparing with the ground truth heatmaps produced.

Sensitivity Analysis for 𝝈 value of Gaussian Kernel for MVTec-AD
Gaussian-sigma-variation-sensitivity-analysis

For visualisation purposes, this is what a Gaussian 2D distribution would look like when plotted.

2D Gaussian Distribution 2d-Gaussian-distribution

This shows the working of the transposed convolution operation with a 2 x 2 kernel (not Gaussian, just a random kernel for illustration purposes).

Example of Transposed Convolution operation Transpose-convolution-example

And this figure from the paper shows the effect of both of these ideas combined. It illustrates a 3 x 3 convolution followed by a 3 x 3 transposed convolution with a gaussian kernel, both using a step size of 2.

Transposed convolution based upsampling for FCDD output Transpose-convolution-based-upsampling-for-FCDD

Semi-supervised FCDD for MVTec-AD

The most exciting thing about this entire approach is that it is unsupervised and does not require ground-truth heatmaps for providing output heatmaps. But if ground-truth heatmaps are present, these can also be provided along with the input to act as a semi-supervised setting. To do this, the model objective is slightly modified during training. If X1, . . . , Xn denote an input batch and Y1, . . ., Yn denote corresponding ground-truth heatmaps, we can replace the denominator of the objective function from (u . v) to (h . w) and in the numerator we can replace (yi) by (Yi) and then add a summation for the Yi over j from 1 to (u . v). This allows us to train the model with a pixel-wise objective. Using even a few of these ground-truth explanations improves performance significantly, achieving up to 0.99 pixel-wise mean AUC for some of the classes in the MVTec-AD and almost always outperforming the unsupervised scores by a few points.

Modified objective function
Modified-objective-function-for-semi-supervised-FCDD

Another thing to note for datasets like MVTec-AD, where anomalies are subtle, is that synthetic anomalies are also generated using confetti-like noise that inserts coloured blobs into images to reflect the general nature of anomalies. This helps in the overall training process.

Confetti Noise Confetti-noise-example

Overall FCDD achieves SOTA performance for MVTec-AD and is very comparable to SOTA performance for the other datasets that the paper has experimented with.

Clever Hans

The FCDD approach where heatmaps as explanations are generated directly as output, has also generated an interesting by-product observation. Images might have watermarks sometimes, and it has been observed that deep one-class classifiers fail to recognize true image features in the presence of watermarks. For example, in the PASCAL VOC dataset, some of the images with horses have a watermark in the lower left corner. A classifier trained on this dataset identifies this watermark as the class identifying pattern. This effect is called the “Clever Hans” effect in memory of the horse Hans, which could correctly answer maths problems by reading its master.

In an attempt to interpret whether FCDD is vulnerable to the Clever Hans effect, the authors swap the experimental setting by treating the “horse” class as anomalous. This, of course, implies that the anomaly heatmaps produced by FCDD should highlight the horses. Thus, any other highlighting in the heatmaps produced would easily reveal that FCDD is characterizing based on other spurious features.

The heatmaps produced by FCDD indeed reveal the vulnerabilities of the deep one-classifier which learns features like the watermarks in the lower left corner and also structures like bars and fences, instead of learning horse features.

Clever Hans effect Clever-hans-effect

This particular example underlines the robustness and interpretability of the FCDD approach as it allows a trained practitioner to look at the inbuilt explanations and understand / mitigate possible causes of failure. And this makes it a favorable method for now, to answer the question of whether something is an anomaly or not.

Maybe-or-Maybe-Not

References

  • The image of the hypersphere is adopted from this paper. {Liu, Weiyang, et al. ‘SphereFace: Deep Hypersphere Embedding for Face Recognition’. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017, pp. 6738–46. DOI.org (Crossref), https://doi.org/10.1109/CVPR.2017.713.}
  • The image of the 2D gaussian distribution is adopted from this website
  • The image of the transposed convolution examples is adopted from this website