The paper titled “Probabilistic reasoning for streaming anomaly detection” from MIT CSAIL proposed a framework for performing online anomaly detection on univariate data. Unfortunately, most of the data in the real world are multivariate. Hence, mandating the need for more research into performing online anomaly detection in multivariate data. We have been inspired by their work and extended their framework to support multivariate data with some clever optimizations to build a scalable system.
One would be tempted to ask why we have chosen this paper [1] for our study. One answer to this question is that to the best of our knowledge, their work provided a simple framework based on basic statistics to perform real-time anomaly detection of a stream. Furthermore, in the past, I successfully used a derivation of their work for the detection of breaking news in an aggregated news stream.
The blog will begin by introducing the topic of anomaly detection, followed by a discussion of the original paper [1], and describing extensions of the existing work to handle multivariate data streams. The new formulation that we are proposing would depend on building an online version of the covariance matrix and as such we have provided an implementation of the online covariance matrix, alongside an online inverse covariance matrix based on Sherman–Morrison formula. We have provided a set of mathematical representations and source code.
My implementation of the original paper and the enhanced version of our modified algorithm which is the subject of this blog post can be found in the following links. Acoompanying source code: [withheld due to dobule blind requirements]
Furthermore, we have provided a set of detailed experiments on the proposed algorithms in different realistic scenarios. However, we maintain the statistical framework provided by the original paper [1] as it is already tested. We will not fall into the trap of making this writing a survey paper. Hence, we will discuss a few interesting developments in space, and as such this manuscript is not expected to be exhaustive. This blog will focus on statistical models and as such, we won’t discuss neural networks and their variants in any depth as those would fall outside the scope of this blog post. For more information, see the paper.
Anomaly detection is the task of classifying patterns that depict abnormal behavior. Therefore, the notion of normal behavior has to be quantified objectively. This concept can be described by several names such as outlier detection, novelty detection, noise detection, and deviation detection. These names are equivalent and would be used interchangeably for the remainder of our monograph. Outliers can arise as a result of human error, equipment error, and faulty systems. Anomaly detection is well-suited for unbalanced data, where the ideal scenario is to predict the behavior of the minority class. There are many applications of anomaly detection in detecting default on loans, fraud detection, and network intrusion detection among others.
There are different types of anomaly which are discussed as follows.
An anomaly detection algorithm can be aimed at identifying outliers in (any or combination) of the signal changes which may include abrupt transient shift, abrupt distributional shift, and gradual distributional shift [1] which is labeled as “A”, “B”, and “C” respectively.
Online algorithms are useful for real-time applications, as they operate incrementally which is ideal for analyzing the data streams. These algorithms incrementally receive input and make a decision based on an updated parameter that conveys the current state of the data stream. This philosophy contrasts with offline algorithms that assume the entire data is available in memory. The issue with an offline algorithm is that the data may not fit in memory. The online algorithm should be both time and space-efficient.
Anomaly detection algorithms may work in diagnosis or accommodation mode [2]. The diagnosis method identifies the outlier in the data for further processing of the outlier. The outlier is removed from the data sample so it does not skew the distribution. This is useful when the exact parameters of the distribution are known, so the outlier is excluded from the further estimation of the parameters of the distribution [2]. The accommodation method identifies the outliers and uses them for estimating the parameters of the statistical model. This is suitable for data streams that account for the effect of concept drift [3].
Exponential Weighted Moving Average (EWMA) is ideal for keeping a set of running moments in the data stream, but it has some limitations that have led the authors to introduce Probabilistic Exponentially Weighted Moving Average (PEWMA). A single slide from my presentation will clear every misconception between the two algorithms (EWMA and PEWMA) in context.
PEWMA [1] algorithm works in the accommodation mode. The algorithm allows for concept drift [3], which occurs in data streams by updating the set of parameters that convey the state of the stream. PEWMA [1] is suitable as an anomaly detection algorithm that works on an abrupt transient shift, where EWMA fails.
The parameters of the anomaly detection algorithm consist of $X_{t}$ the current data, $\mu_{t}$ the mean of the data, $\hat{X_{t}}$ is the mean of the data, $\hat{\alpha_{t}}$ the current standard deviation, $P_{t}$ the probability density function, $\hat{X_{t+1}}$ the mean of the next data (incremental aggregate), $\hat{\alpha_{t+1}}$ the next standard deviation (incremental aggregates), $T$ the data size, and $t$ a point in $T$. Initialize the process by setting the initial data for training the model $s_{1} = X_{1}$ and $s_{2} = X_{1}^{2}$.
The processed data is fed to the anomaly detection algorithm with the parameters $\alpha = 0.98, \beta = 0.98$, and $\tau = 0.0044$. The thresholds are chosen for determining outliers that are greater than 3 times the standard deviation in normally distributed data. PEWMA in the original paper was designed to work for point anomaly.
A hypothesis is a subjective intuition about the problem. This can be guided by current best practices or transferable skills from adjacent domains. These forms of educated guesses have to be empirically verified to allow your preconceived intuitions to be checked against reality. Let us look at some examples of hypotheses:
import math
def cumfunc(mean, sigma, xval):
"""
@summary: cumulative pdf to the left of the standard normal distribution curve.
"""
z = (xval - mean) / (sigma * math.sqrt(2))
y = 0.5 * (1 + math.erf(z))
return y
if __name__ == '__main__':
mean = 80; sigma = 15
x = 60
res = cumfunc(mean, sigma, x) # < 60
print (round(res, 2)) # 0.09
x = 90
res = 1 - cumfunc(mean, sigma, x) # > 90
print (round(res, 2)) # 0.25
Let us provide the source code for visualizing the probability of the events described in the code snippet.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
def contour(plt, score, x_axis, y_axis, colour, lessthan=True):
x_temp = [x for x in x_axis if x <= score]
if lessthan:
y_temp = y_axis[:len(x_temp)]
else:
y_temp = y_axis[len(x_temp): ]
x_temp = [x for x in x_axis if x > score]
plt.fill_between(x_temp, 0, y_temp, facecolor=colour)
if __name__ == '__main__':
mean = 80; sd = 15
x_axis, y_axis = np.arange(10, 140, 0.001), norm.pdf(x_axis,mean,sd)
plt.plot(x_axis, y_axis); plt.xlabel('Scores'); plt.ylabel('PDF')
plt.title("Gaussian Distribution with Mean: {} and STD: {}".format(mean, sd))
colour='#4dac26'; score = 60
contour(plt, score, x_axis, y_axis, colour, lessthan=True)
colour='#f1b6da'; score = 90
contour(plt, score, x_axis, y_axis, colour, lessthan=False)
plt.show()
The probability of events (score < 60 and score > 90) is captured by the area of the shaded regions.
If the class average is 80 with a standard deviation of 15, it is with a minuscule probability that a student scores less than 0 or greater than 120. The event categories these outrageous scores can be said to be an anomaly. The scores used in our examples are thresholds. The areas depicting these probabilities can be seen from our chart.
In summary, the rule of thumb for hypothesis testing can be summarized as follows:
Our contribution begins here. We simplify the algorithm by ignoring the details of evolutionary computation in the paper. The author of the blog post took the premise of evolution as described in the paper to be moving from one generation to the next; as equivalent to moving from one state to another state. This is analogous to how online algorithms work with dynamic changes as new data enters the stream. Cholesky decomposition is used extensively in the algorithms. The paper provided the basis for the online covariance matrix used in this work.
The mathematical formulation can be found here.
def updateCovariance(alpha, beta, C_t, A_t, z_t):
"""
@param: alpha, beta, A_t are parameters of the model
@param: C_t is the old covariance matrix, z_t as the new data vector
@return: C_tplus1 is updated covariance matrix
"""
v_t = np.dot(A_t, z_t.T)
C_tplus1 = (alpha * C_t) + (beta * np.matmul(v_t, v_t.T))
print ("v_t: {}, C_tplus1: {}".format(v_t.shape, C_tplus1.shape))
return C_tplus1
def updateCholeskyFactor(alpha, beta, A_t, z_t):
"""
@param: alpha, beta, A_t are parameters of the model
@param: z_t as new data vector
@return: A_tplus1 is updated covariance matrix
"""
v_t = np.dot(A_t, z_t.T)
norm_z = np.linalg.norm(z_t)
x = math.sqrt(alpha) * A_t
w = beta * norm_z / alpha
y = math.sqrt(alpha) * (math.sqrt(1 + w) - 1) * np.dot(v_t, z_t) / norm_z
A_tplus1 = x + y
print ("A_t: {}, A_tplus1: {}".format(A_t.shape, A_tplus1.shape))
return A_tplus1
The mathematical formulation can be found here
Let us fix, $\hat{v_t} = \frac{\beta * v_t}{\alpha}$. The resulting simplification using Sherman-Morrison Formula reduces the expression to
\[\begin{equation} C_{t+1}^{-1} = \frac{1}{\alpha} * \left({{C_t}^{-1}} - \frac{{{C_t}^{-1}} * \hat{v_t} * {v_t}^T * {{C_t}^{-1}}}{1 + (\hat{v_t} * {{C_t}^{-1}} * {v_t}^T)} \right) \end{equation}\]The Implementation can be found here
def updateInverseCovariance(alpha, beta, invC_t, A_t, z_t):
@param: alpha, beta, A_t are parameters of the model
@param: invC_t is the old inverse covariance matrix, z_t as the new data vector
@return: invC_tplus1 is updated inverse covariance matrix
print ("A_t: {}, z_t: {}".format(A_t.shape, z_t.shape))
v_t = np.dot(A_t, z_t.T)
hat_vt = (beta * v_t) / alpha
print ("invC_t: {}, hat_vt: {}, v_t: {}, invC_t: {}".format(invC_t.shape, hat_vt.shape, v_t.shape, invC_t.shape))
y = multi_dot([invC_t, hat_vt, v_t.T, invC_t]) / (1 + multi_dot([hat_vt.T, invC_t, v_t]))
invC_tplus1 = (invC_t - y) / alpha
print ("invC_tplus1: {}".format(invC_tplus1.shape))
return invC_tplus1
The probability density function makes use of ideas from hypothesis testing. We decide on a threshold which is a confidence level that is used to decide on the acceptance and rejection regions.
def anomaly(x, mean, cov, threshold=0.001):
"""
@param: x is the current data vector
@param: mean is mean vector
@param: cov is covariance matrix
@return: score
"""
score = multivariate_normal.pdf(x, mean=mean, cov=cov)
return score
def updateMean(mean, z):
mean_tplus1 = ((n * mean) + z) / (n + 1)
return mean_tplus1
We have provided a clean object-oriented programming-based solution with a cleaner API.
seed = 0
np.random.seed(seed)
X = np.random.rand(1000,15)
z_t0 = np.random.rand(1,15) # new data
single case predict
anom = probabilisticMultiEWMA()
anom.init(X)
z_t0 = np.random.rand(1,15) # new data
anom.update(z_t0)
z_t1 = np.random.rand(1,15) # next new data
print ("score: {}".format(anom.predict(z_t1)))
Z = np.random.rand(1000,15)
Bulk predict
anom = probabilisticMultiEWMA()
anom.init(X)
pred = anom.bulkPredict(Z)
print (pred)
We have experimented to evaluate the usefulness of our algorithm by creating a simulation with 10000000 vectors with dimensions of 15. The repeated trial shows that our algorithm is not sensitive to initialization seeds and dimensions of the matrix. This requirement was a deciding factor in the choice of the evaluation metric. More information on the metric will be provided in the Discussion section. This is to find the trade-off between the static window and the update window. The source code for the experiments can be found here.
The goal of this experiment is to check the effect of varying the size of the initial static window versus the update window The experiment setup follows loosely the description.
The goal of this experiment is to check the effect of varying the size of the initial static window versus the update window The experiment setup follows loosely the description
Our matrix was flattened to a vector which is used as input. The length of the vector is used to make the loss metric that is agnostic to the dimension of the matrix. The loss function used in the evaluation is Absolute Average Deviation (AAD) because it gives a tighter bound on the error in comparison to MSE or MAD. This is because we take the average of the residuals divided by the ground truth for every sample in our evaluation set. If the residual is close to zero, we contribute almost nothing to the measure. However, if the residual is large, we want to know the factor of how large in comparison to the ground truth. This behavior of scaling by the ground truth may explain why this metric tends to be conservative in regression analysis.
\(\begin{equation}
AAD = \sum_{i=1}^{n} \left| \frac{\hat{Y_i} - Y_i}{Y_i} \right|
\end{equation}\)
Where $\hat{Y_i}$ is the predicted value, $Y_i$ is ground truth, and $n$ is the length of the flattened matrix.
From our experiments, we can see that building your model with more data in the init (static) phase tends to lead to smaller errors in comparison to having smaller data in the init phase and using more of the data for an update. The observation matches our intuition because when you operate in an online mode, you tend to use smaller storage space, but there is a performance trade-off in comparison to batch mode.
The error at the beginning of our training is huge in both charts. This insight shows that rather than performing the expensive operation of converting a covariance matrix to have the property of positive definiteness, it is better to just use random matrices that are positive definite. More data would help us get to convergence as more data arrives.
There are many challenges with anomaly detection methods in modeling the normal behavior of the system. The abnormal behavior shows a deviation from what is the anticipated normal behavior of the system. Many anomaly detections are susceptible to adversarial attacks. In a supervised setting, getting labeled data can be expensive. The definition of noise can be ambiguous.
The success of the experiments has given us the confidence that we would have similar characteristics to the univariate case described in the original paper concerning the enhancement to support the multivariate case.
There is no generic anomaly detection that works for every possible task. It has to be tuned for your purpose. The underlying assumption in this work is that the features that are used capture significant information about the underlying dynamic of the system. Future work can include extending the multivariate to be production-ready.