<!DOCTYPE html>
<html lang="en-us">

  <head>
  <link href="http://gmpg.org/xfn/11" rel="profile">
  <meta http-equiv="content-type" content="text/html; charset=utf-8">

  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1">

  <title>
    
      The ICLR Blog Track &middot; 
    
  </title>

  
  <link rel="canonical" href="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/blog/">
  

  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/public/css/poole.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/public/css/syntax.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/public/css/lanyon.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/public/css/custom.css">
  <link rel="stylesheet" href="https://fonts.googleapis.com/css?family=PT+Serif:400,400italic,700%7CPT+Sans:400">

  <link rel="apple-touch-icon-precomposed" sizes="144x144" href="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/public/apple-touch-icon-precomposed.png">
  <link rel="shortcut icon" href="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/public/favicon.ico">

  <link rel="alternate" type="application/rss+xml" title="RSS" href="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/atom.xml">

  

  <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript" ></script>
 <!-- <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { equationNumbers: { autoNumber: "AMS" } } }); </script> -->
  <script type="text/x-mathjax-config">
      MathJax.Hub.Config({
        tex2jax: { inlineMath: [ ['$','$'], ["\\(","\\)"] ],
         processEscapes: false
        }
      });
</script>
</head>


  <body>

    <!-- Target for toggling the sidebar `.sidebar-checkbox` is for regular
     styles, `#sidebar-checkbox` for behavior. -->
<input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox">
<!-- <input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox" > -->

<!-- Toggleable sidebar -->
<div class="sidebar" id="sidebar">
  <div class="sidebar-item">
    <p>For short-term, peer-sourced tests of time, generalizations, specializations, reproductions, etc.!</p>
  </div>

  <nav class="sidebar-nav">

    

    
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/">ICLR 2022 Blog Track</a>
        
      
    
      
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/about/">About</a>
        
      
    
      
    
      
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/submitting/">Submitting</a>
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/tags/">Tags</a>
        
      
    

    <a class="sidebar-nav-item" href="https://github.com/iclr-blog-track/iclr-blog-track.github.io">GitHub project</a>
    <span class="sidebar-nav-item">Currently vICLR Spring 2021</span>
  </nav>

  <div class="sidebar-item">
    <p>
      &copy; 2021. All rights reserved.
    </p>
  </div>
</div>


    <!-- Wrap is the content to shift when toggling the sidebar. We wrap the
         content to avoid any CSS collisions with our real content. -->
    <div class="wrap">
      <div class="masthead">
        <div class="container">
          <h3 class="masthead-title">
            <a href="/" title="Home">The ICLR Blog Track</a>
            <small></small>
          </h3>
        </div>
      </div>

      <div class="container content">
        <div class="posts">
  
  <div >
    <h2 class="post-title">
      <a href="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/2021/12/01/Real-Time-Anomaly-Detection-for-Multivariate-Data-Stream/">
        Real-Time Anomaly Detection for Multivariate Data Stream
      </a>
    </h2>

    <span class="post-date">01 Dec 2021 | 
      <a class="content-tag" href="/tags/#machine-learning"> machine-learning </a>
        
      <a class="content-tag" href="/tags/#signal-processing"> signal-processing </a>
        
    </span>
    <span class="post-date">xxxxxxx</span>

    <!-- <p>The paper titled “Probabilistic reasoning for streaming anomaly detection” from MIT CSAIL proposed a framework for performing online anomaly detection on univariate data. Unfortunately, most of the data in the real world are multivariate. Hence, mandating the need for more research into performing online anomaly detection in multivariate data. We have been inspired by their work and extended their framework to support multivariate data with some clever optimizations to build a scalable system.
<br /><br />
One would be tempted to ask why we have chosen this paper <a href="">[1]</a> for our study. One answer to this question is that to the best of our knowledge, their work provided a simple framework based on basic statistics to perform real-time anomaly detection of a stream. Furthermore, in the past, I successfully used a derivation of their work for the detection of breaking news in an aggregated news stream.
<br /><br />
The blog will begin by introducing the topic of anomaly detection, followed by a discussion of the original paper <a href="">[1]</a>, and describing extensions of the existing work to handle multivariate data streams. The new formulation that we are proposing would depend on building an online version of the covariance matrix and as such we have provided an implementation of the online covariance matrix, alongside an online inverse covariance matrix based on Sherman–Morrison formula. We have provided a set of mathematical representations and source code.
<br /><br />
My implementation of the original paper and the enhanced version of our modified algorithm which is the subject of this blog post can be found in the following links. Acoompanying source code: [withheld due to dobule blind requirements]
Furthermore, we have provided a set of detailed experiments on the proposed algorithms in different realistic scenarios. However, we maintain the statistical framework provided by the original paper <a href="">[1]</a> as it is already tested. We will not fall into the trap of making this writing a survey paper. Hence, we will discuss a few interesting developments in space, and as such this manuscript is not expected to be exhaustive. This blog will focus on statistical models and as such, we won’t discuss neural networks and their variants in any depth as those would fall outside the scope of this blog post. For more information, see the <a href="https://arxiv.org/abs/1901.03407">paper</a>.
<br /><br />
Anomaly detection is the task of classifying patterns that depict abnormal behavior. Therefore, the notion of normal behavior has to be quantified objectively. This concept can be described by several names such as outlier detection, novelty detection, noise detection, and deviation detection. These names are equivalent and would be used interchangeably for the remainder of our monograph. Outliers can arise as a result of human error, equipment error, and faulty systems. Anomaly detection is well-suited for unbalanced data, where the ideal scenario is to predict the behavior of the minority class. There are many applications of anomaly detection in detecting default on loans, fraud detection, and network intrusion detection among others.
There are <a href="http://cucis.ece.northwestern.edu/projects/DMS/publications/AnomalyDetection.pdf">different types</a> of anomaly which are discussed as follows.</p>
<ul>
  <li>Point anomaly: This is where a single instance is classified as an anomaly concerning the entire data set. This is ideal for univariate data.</li>
  <li>Contextual Anomaly: when a data instance can be anomalous based on the context (attributes and position in the stream) of the data. This is ideal for multivariate data where for example, in the snapshot reading of a machine, an attribute of a single reading may seem anomalous but can be normal based on consideration of the entire data.</li>
  <li>Collective Anomaly: These are the collection of data that are anomalies as a group, but individually these data points exhibit normal behaviors.
<br /><br />
We have summarized many approaches for performing anomaly detection. Our categorization follows loosely the groupings described in <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.109.1943&amp;rep=rep1&amp;type=pdf">paper</a> is to group existing approaches for anomaly detection</li>
  <li>Unsupervised: This is classifying an outlier with training on unlabelled data.</li>
  <li>Supervised: This is classifying an outlier with training on labeled data.</li>
  <li>Hybrid (mix of both): This is a mix of both schemes. These include semi-supervised learning, self-supervised learning e.t.c.
<br /><br />
Anomaly detection algorithms can operate in many settings. This should be carefully thought and be problem-specific.</li>
  <li>Static: These algorithms are designed to work in static datasets. Every item is loaded into memory at a time to perform computation.</li>
  <li>Online: These algorithms are designed to work in real-time data streams. Items are incrementally loaded into memory.</li>
  <li>Static + Online: The model can operate in two stages. The initial parameters are estimated in the static setting. Once the parameters are set, as more data arrives, these parameters are incrementally updated. Our extensions and the original work are of this type.
<br /><br />
Our discussion will be incomplete if we don’t describe how to maintain a collection of the data in the stream to be processed. Hence, we do a quick review of the windows. <a href="https://www.kdd.org/exploration_files/20-1-Article2.pdf">Windows</a> provide a way to manage data streams. There are several window techniques for streaming analytics:</li>
  <li>Fixed window: This is using a fixed window to store some past information to allow for processing.</li>
  <li>Adaptive window (ADWIN): Keep two windows and drop the former if the past distribution deviates from the current distribution.</li>
  <li>Landmark window: Keep a history of data points that is representative of the distribution of the stream.</li>
  <li>Damped window: This is using a weighting factor on the recent sample and past sample to intensify or dampen the signal. To forget the past, increase the weight of the present.
<br /><br />
We have decided to work on anomaly detection algorithms that work in an unsupervised manner. The normal behavior is represented using the PDF of a multivariate normal distribution. Thresholds are set as a way to specify the significance level. The online formulation was used in our work to help the algorithm work even when concept drift occurs. Unsupervised learning provided advantages in cases where getting data with labels can be challenging or even impossible in some contexts. This fits nicely with a data stream when you don’t know what to expect from your test distribution.
<br /><br />
We will proceed to describe the algorithm in the original paper <a href="">[1]</a> and then provide extensions in upcoming sections.
    <h3 id="background-work">Background work</h3>
    <p>An anomaly detection algorithm can be aimed at identifying outliers in (any or combination) of the signal changes which may include abrupt transient shift, abrupt distributional shift, and gradual distributional shift <a href="">[1]</a> which is labeled as “A”, “B”, and “C” respectively.
<img src="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/public/images/anomaly/2021-12-01-timeseries.png" alt="Signal Changes" />
Online algorithms are useful for real-time applications, as they operate incrementally which is ideal for analyzing the data streams. These algorithms incrementally receive input and make a decision based on an updated parameter that conveys the current state of the data stream. This philosophy contrasts with offline algorithms that assume the entire data is available in memory. The issue with an offline algorithm is that the data may not fit in memory. The online algorithm should be both time and space-efficient.
<br /><br />
Anomaly detection algorithms may work in diagnosis or accommodation mode <a href="">[2]</a>. The diagnosis method identifies the outlier in the data for further processing of the outlier. The outlier is removed from the data sample so it does not skew the distribution. This is useful when the exact parameters of the distribution are known, so the outlier is excluded from the further estimation of the parameters of the distribution <a href="">[2]</a>. The accommodation method identifies the outliers and uses them for estimating the parameters of the statistical model. This is suitable for data streams that account for the effect of concept drift <a href="">[3]</a>.
<br /><br />
Exponential Weighted Moving Average (EWMA) is ideal for keeping a set of running moments in the data stream, but it has some limitations that have led the authors to introduce Probabilistic Exponentially Weighted Moving Average (PEWMA). A single slide from my presentation will clear every misconception between the two algorithms (EWMA and PEWMA) in context.
<img src="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/public/images/anomaly/2021-12-01-pewma_emwa.png" alt="PEWMA vs EMWA" />
PEWMA <a href="">[1]</a> algorithm works in the accommodation mode. The algorithm allows for concept drift <a href="">[3]</a>, which occurs in data streams by updating the set of parameters that convey the state of the stream. PEWMA <a href="">[1]</a> is suitable as an anomaly detection algorithm that works on an abrupt transient shift, where EWMA fails.
<br /><br />
The parameters of the anomaly detection algorithm consist of $X_{t}$ the current data, $\mu_{t}$ the mean of the data, $\hat{X_{t}}$ is the mean of the data, $\hat{\alpha_{t}}$ the current standard deviation, $P_{t}$ the probability density function, $\hat{X_{t+1}}$ the mean of the next data (incremental aggregate), $\hat{\alpha_{t+1}}$ the next standard deviation (incremental aggregates), $T$ the data size, and $t$ a point in $T$. Initialize the process by setting the initial data for training the model $s_{1} = X_{1}$ and $s_{2} = X_{1}^{2}$.
<img src="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/public/images/anomaly/2021-12-01-original.png" alt="PEWMA" />
The processed data is fed to the anomaly detection algorithm with the parameters $\alpha = 0.98, \beta = 0.98$, and $\tau = 0.0044$. The thresholds are chosen for determining outliers that are greater than 3 times the standard deviation in normally distributed data. PEWMA in the original paper was designed to work for point anomaly.</p>
    <h5 id="hypothesis-testing">Hypothesis Testing</h5>
    <p>A hypothesis is a subjective intuition about the problem. This can be guided by current best practices or transferable skills from adjacent domains. These forms of educated guesses have to be empirically verified to allow your preconceived intuitions to be checked against reality. Let us look at some examples of hypotheses:</p>
  </li>
  <li>Will this vaccine work on a new virus?</li>
  <li>Will It rain today?
<br /><br />
Let us look at the example of a certain school where the Physics teacher is generous with marks. The end-of-semester report has a class average of 80 with a standard deviation of 15. What is the probability that a student scores</li>
  <li>score &lt; 60 ?</li>
  <li>score &gt; 90 ?
<br /><br />
The following code snippet solves the problem as a student scoring less than 60 and greater than 90 has a probability of 0.09 and 0.25 respectively.
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import math
def cumfunc(mean, sigma, xval):
  """
  @summary: cumulative pdf to the left of the standard normal distribution curve.
  """
  z = (xval - mean) / (sigma * math.sqrt(2))
  y = 0.5 * (1 + math.erf(z))
  return y
if __name__ == '__main__':
  mean = 80; sigma = 15
  x = 60
  res = cumfunc(mean, sigma, x)  # &lt; 60
  print (round(res, 2)) # 0.09
  x = 90
  res = 1 - cumfunc(mean, sigma, x) # &gt; 90
  print (round(res, 2)) # 0.25
</code></pre></div>    </div>
    <p>Let us provide the source code for visualizing the probability of the events described in the code snippet.</p>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
def contour(plt, score, x_axis, y_axis, colour, lessthan=True):
  x_temp = [x for x in x_axis if x &lt;= score]
  if lessthan:
      y_temp = y_axis[:len(x_temp)]
  else:
      y_temp = y_axis[len(x_temp): ]
      x_temp = [x for x in x_axis if x &gt; score]
  plt.fill_between(x_temp, 0, y_temp, facecolor=colour)
if __name__ == '__main__':
  mean = 80; sd = 15
  x_axis, y_axis = np.arange(10, 140, 0.001), norm.pdf(x_axis,mean,sd)
  plt.plot(x_axis, y_axis); plt.xlabel('Scores'); plt.ylabel('PDF')
  plt.title("Gaussian Distribution with Mean: {} and STD: {}".format(mean, sd))
  colour='#4dac26'; score = 60
  contour(plt, score, x_axis, y_axis, colour, lessthan=True)
  colour='#f1b6da'; score = 90
  contour(plt, score, x_axis, y_axis, colour, lessthan=False)
  plt.show()
</code></pre></div>    </div>
    <p><img src="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/public/images/anomaly/2021-12-01-chart3.png" alt="Normal Distribution" />
The probability of events (score &lt; 60 and score &gt; 90) is captured by the area of the shaded regions.
If the class average is 80 with a standard deviation of 15, it is with a minuscule probability that a student scores less than 0 or greater than 120. The event categories these outrageous scores can be said to be an anomaly. The scores used in our examples are thresholds. The areas depicting these probabilities can be seen from our chart. 
In summary, the rule of thumb for hypothesis testing can be summarized as follows:</p>
  </li>
  <li>Identify the Null and Alternative hypotheses.</li>
  <li>Choose a significance level by setting the threshold.</li>
  <li>Decide to reject based on the significance level.
    <h3 id="extension">Extension</h3>
    <p>Our contribution begins here. We simplify the algorithm by ignoring the details of evolutionary computation in the <a href="http://www.cmap.polytechnique.fr/~nikolaus.hansen/ACECMUaa1p1CMAfES.pdf">paper</a>. The author of the blog post took the premise of evolution as described in the paper to be moving from one generation to the next; as equivalent to moving from one state to another state. This is analogous to how online algorithms work with dynamic changes as new data enters the stream. Cholesky decomposition is used extensively in the algorithms. The  <a href="http://www.cmap.polytechnique.fr/~nikolaus.hansen/ACECMUaa1p1CMAfES.pdf">paper</a> provided the basis for the online covariance matrix used in this work.</p>
    <h5 id="online-covariance-matrix">Online Covariance matrix</h5>
    <p>The mathematical formulation can be found here.</p>
    <ol>
      <li>Estimate covariance matrix for initial data, $X \in R^{n \times m}$.
Initial covariance matrix, $C$ where $C \in R^{n \times m}$, $n$ is the number of samples, $m$ is the number of dimensions.
\(\begin{equation}
C = X * {X}^T
\end{equation}\)</li>
      <li>Perform Cholesky factorization on the initial covariance matrix, $C$.
\(\begin{equation}
C_{t} = A_{t} * {A_{t}}^T
\end{equation}\)
We make use of Scipy’s Cholesky decomposition. The input matrix must be positive-definite which means that the eigenvalues are positive which is a requirement for the Cholesky decomposition. For a quick primer on positive-definite, positive semi-definite, and their variants peruse over the <a href="https://www.cse.iitk.ac.in/users/rmittal/prev_course/s14/notes/lec11.pdf">tutorial</a>. The covariance matrix of a multivariate distribution is positive semi-definite. For more insight on how to create a positive semi-definite covariance matrix. Kindly take a look at <a href="https://www.researchgate.net/post/How_to_generate_positive-definite_covariance_matrices">discussion board</a>. The approach taken in this work is to convert to the nearest positive definite matrix. This is sufficient for our purpose. Evaluate and appropriately reapply to your use case. A kind of reasonable approach is just to begin from a random positive definite matrix as a default choice for your covariance.</li>
      <li>The general form of incremental covariance. This can be best understood that the updated covariance in the presence of new data is equivalent to the weighted average of the past covariance without the new data, and covariance of the transformed input.
\(\begin{equation}
C_{t+1} = \alpha * C_{t} + \beta * v_t * {v_t}^T
\end{equation}\)
Where $v_t = A_t * z_t$ and $z_t \in R^m$ is understood in our implementation is the current data. \alpha and \beta are positive scalar values.</li>
      <li>Increment the Cholesky factor of the covariance matrix
\(\begin{equation}
A_{t+1} = \sqrt{\alpha} * A_t + \frac{\sqrt{\alpha}}{\Big\|z_t \Big\|^2} * \left( \sqrt{1 + \frac{\beta * \Big\|z_t \Big\|^2}{\alpha}} - 1 \right) * v_t * z_t
\end{equation}\)</li>
      <li>There are difficulties with setting the values of $\alpha$ and $\beta$ respectively. $\alpha + \beta = 1$ as an explicit form of exponential moving average. The author chose to set the values of $\alpha$, $\beta$ using the statistics of the data stream.
The parameters are set as $\alpha = {C_{a}}^2$, $\beta = 1 - {C_{a}}^2$ and $n$ is the size of the original data in the static settings, where ${C_{a}} = \sqrt{1 - C_{cov}}$ and $C_{cov} = \frac{2}{n^2+6}$.
The Implementation can be found here
\(\begin{equation}
A_{t+1} = {C_{a}} * A_t + \frac}{\Big\|z_t \Big\|^2} * \left( \sqrt{1 + \frac{(1 - {C_{a}}^2) * \Big\|z_t \Big\|^2}^2}} - 1 \right) * v_t * z_t
\end{equation}\)
        <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def updateCovariance(alpha, beta, C_t, A_t, z_t):
  """
  @param: alpha, beta, A_t are parameters of the model
  @param: C_t is the old covariance matrix, z_t as the new data vector
  @return: C_tplus1 is updated covariance matrix
  """
  v_t = np.dot(A_t, z_t.T)
  C_tplus1 = (alpha * C_t)  + (beta * np.matmul(v_t, v_t.T))
  print ("v_t: {}, C_tplus1: {}".format(v_t.shape, C_tplus1.shape))
  return C_tplus1
def updateCholeskyFactor(alpha, beta, A_t, z_t):
  """
  @param: alpha, beta, A_t are parameters of the model
  @param: z_t as new data vector
  @return: A_tplus1 is updated covariance matrix
  """
  v_t = np.dot(A_t, z_t.T)
  norm_z = np.linalg.norm(z_t)
  x = math.sqrt(alpha) * A_t
  w = beta * norm_z / alpha
  y = math.sqrt(alpha) * (math.sqrt(1 + w) - 1) * np.dot(v_t, z_t) / norm_z
  A_tplus1 = x + y
  print ("A_t: {}, A_tplus1: {}".format(A_t.shape, A_tplus1.shape))
  return A_tplus1
</code></pre></div>        </div>
        <p><br /><br /></p>
        <h5 id="online-inverse-covariance-matrix">Online Inverse Covariance matrix</h5>
        <p>The mathematical formulation can be found here</p>
      </li>
      <li>Estimate covariance matrix for initial data, $X \in R^{n \times m}$.
Initial covariance matrix, $C$ where $C \in R^{n \times m}$, $n$ is the number of samples, $m$ is the number of dimensions. 
\(\begin{equation}
C = X * {X}^T
\end{equation}\)
Inverse the covariance matrix, $C^{-1}$.
\(\begin{equation}
C^{-1} = \left( X * {X}^T \right)^{-1}
\end{equation}\)</li>
      <li>Perform Cholesky factorization on initial covariance matrix, $C$.
\(\begin{equation}
C_{t} = A_t * {A_t}^T
\end{equation}\)
We make use of Scipy’s Cholesky decomposition.</li>
      <li>General form of incremental covariance.
This can be best understood that the updated covariance in the presence of new data is equivalent to the weighted average of the past covariance without the new data, and covariance of the transformed input.
\(\begin{equation}
C_{t+1} = \alpha * C_{t} + \beta * v_t * {v_t}^T
\end{equation}\)
Where $v_t = A_t * z_t$ and $z_t \in R^m$ is understood in our implementation is the current data. $\alpha$ and $\beta$ are positive scalar values.</li>
      <li>Increment the Cholesky factor of the covariance matrix</li>
    </ol>
  </li>
</ul>

\[\begin{equation}
C_{t+1}^{-1} = ({\alpha * C_t} + \beta * v_t * {v_t}^T)^{-1}
\end{equation}\]

\[\begin{equation}
C_{t+1}^{-1} = \alpha^{-1} * \{C_t + \frac{\beta * v_t * {v_t}^T}{\alpha}\}^{-1}
\end{equation}\]

<p>Let us fix, $\hat{v_t} = \frac{\beta * v_t}{\alpha}$. The resulting simplification using Sherman-Morrison Formula reduces the expression to</p>

\[\begin{equation}
C_{t+1}^{-1} = \frac{1}{\alpha} * \left({{C_t}^{-1}} - \frac{{{C_t}^{-1}} * \hat{v_t} * {v_t}^T * {{C_t}^{-1}}}{1 + (\hat{v_t} * {{C_t}^{-1}} * {v_t}^T)} \right)
\end{equation}\]

<p>The Implementation can be found here</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def updateInverseCovariance(alpha, beta, invC_t, A_t, z_t):
    @param: alpha, beta, A_t are parameters of the model
    @param: invC_t is the old inverse covariance matrix, z_t as the  new data vector
    @return: invC_tplus1 is updated inverse covariance matrix
    print ("A_t: {}, z_t: {}".format(A_t.shape, z_t.shape))
    v_t = np.dot(A_t, z_t.T)
    hat_vt = (beta * v_t) / alpha
    print ("invC_t: {}, hat_vt: {}, v_t: {}, invC_t: {}".format(invC_t.shape, hat_vt.shape, v_t.shape, invC_t.shape))
    y = multi_dot([invC_t, hat_vt, v_t.T, invC_t]) / (1 + multi_dot([hat_vt.T, invC_t, v_t]))
    invC_tplus1 = (invC_t - y) / alpha
    print ("invC_tplus1: {}".format(invC_tplus1.shape))
    return invC_tplus1
</code></pre></div></div>
<h5 id="online-multivariate-anomaly-detection">Online Multivariate Anomaly Detection</h5>
<p>The probability density function makes use of ideas from hypothesis testing. We decide on a threshold which is a confidence level that is used to decide on the acceptance and rejection regions.</p>
<ol>
  <li>Use the covariance matrix, $C_{t+1}$ and inverse covariance matrix, ${C_{t+1}}^{-1}$.</li>
  <li>However, we attempt to increment the mean vector, $\mu$ as new data arrives. It is possible to simplify the Covariance matrix, $C$, which will capture a number of the dynamics of the system. Let $n$ represent the current count of data before new data has arrived.
Also, $\hat{x}$: is the new data, $\mu_{t+1}$: moving average
\(\begin{equation}
\mu_{t+1} = \frac{(n * \mu_t) + \hat{x}}{n+1}
\end{equation}\)</li>
  <li>Set a threshold to determine the acceptance and rejection regions. Items in the acceptance region are considered to be normal behavior.
\(\begin{equation}
p(x)=\frac{1}{\sqrt{(2\pi)^m|C|}} \exp\left(-\frac{1}{2}(x-\mu)^T{C}^{-1}(x-\mu) \right)
\end{equation}\)
Where $\mu$ is mean vector, $C$ is covariance matrix, $|C|$ is the determinant of $C$ matrix, $x \in R^{m}$ is data vector, and $m$ is the dimension of $x$ respectively.
The Implementation can be found here.
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def anomaly(x, mean, cov, threshold=0.001):
 """
 @param: x is the current data vector
 @param: mean is mean vector
 @param: cov is covariance matrix
 @return: score
 """
 score = multivariate_normal.pdf(x, mean=mean, cov=cov)
 return score
def updateMean(mean, z):
 mean_tplus1 = ((n * mean) + z) / (n + 1)
 return mean_tplus1
</code></pre></div>    </div>
    <p><br /><br />
We have provided a clean object-oriented programming-based solution with a cleaner API.</p>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>seed = 0
np.random.seed(seed)
X = np.random.rand(1000,15)
z_t0 = np.random.rand(1,15) # new data
single case predict
anom = probabilisticMultiEWMA()
anom.init(X)
z_t0 = np.random.rand(1,15) # new data
anom.update(z_t0)
z_t1 = np.random.rand(1,15) # next new data
print ("score: {}".format(anom.predict(z_t1)))
Z = np.random.rand(1000,15)
Bulk predict
anom = probabilisticMultiEWMA()
anom.init(X)
pred = anom.bulkPredict(Z)
print (pred)
</code></pre></div>    </div>
    <h3 id="experiments">Experiments</h3>
    <p>We have experimented to evaluate the usefulness of our algorithm by creating a simulation with 10000000 vectors with dimensions of 15. The repeated trial shows that our algorithm is not sensitive to initialization seeds and dimensions of the matrix. This requirement was a deciding factor in the choice of the evaluation metric. More information on the metric will be provided in the Discussion section.
This is to find the trade-off between the static window and the update window. The source code for the experiments can be found <a href="withheld due to dobule blind requirements">here</a>.</p>
    <h5 id="experiment-1">Experiment 1</h5>
    <p>The goal of this experiment is to check the effect of varying the size of the initial static window versus the update window
The experiment setup follows loosely the description.</p>
    <ul>
      <li>Split the data into 5 segments
train on 1st segment(static), update covariance on 2nd (online), compare with static covariance - get error</li>
      <li>Train on 1, 2 segment(static), update covariance on 3rd (online), compare with static covariance - get error.</li>
      <li>Train on 1, 2, 3 segment(static), update covariance on 4th (online), compare with static covariance - get error</li>
      <li>Train on 1, 2, 3, 4 segment(static), update covariance on 5th (online), compare with static covariance - get error
<img src="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/public/images/anomaly/2021-12-01-chart1.png" alt="Experiment 1" />
        <h5 id="experiment-2">Experiment 2</h5>
        <p>The goal of this experiment is to check the effect of varying the size of the initial static window versus the update window
The experiment setup follows loosely the description</p>
      </li>
      <li>Split the data into 5 segments.</li>
      <li>Train on 1st segment(static), update covariance on remaining segments (2,3,4,5) (online), compare with static covariance - get errors on segments (2,3,4,5)</li>
      <li>Train on 1, 2 segment(static), update covariance on remaining segments (3,4,5) (online), compare with static covariance - get errors on segments (3,4,5)</li>
      <li>Train on 1, 2, 3 segment(static), update covariance on remaining segments (4,5) (online), compare with static covariance - get errors on segments (4,5)</li>
      <li>Train on 1, 2, 3, 4 segment(static), update covariance on remaining segments (5) (online), compare with static covariance - get errors on segments (5)
<img src="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/public/images/anomaly/2021-12-01-chart2.png" alt="Experiment 2" />
        <h3 id="discussion">Discussion</h3>
        <p>Our matrix was flattened to a vector which is used as input. The length of the vector is used to make the loss metric that is agnostic to the dimension of the matrix. The loss function used in the evaluation is Absolute Average Deviation (AAD) because it gives a tighter bound on the error in comparison to MSE or MAD. This is because we take the average of the residuals divided by the ground truth for every sample in our evaluation set. If the residual is close to zero, we contribute almost nothing to the measure. However, if the residual is large, we want to know the factor of how large in comparison to the ground truth. This behavior of scaling by the ground truth may explain why this metric tends to be conservative in regression analysis.
\(\begin{equation}
AAD = \sum_{i=1}^{n} \left| \frac{\hat{Y_i} - Y_i}{Y_i} \right|
\end{equation}\)
Where $\hat{Y_i}$ is the predicted value, $Y_i$ is ground truth, and $n$ is the length of the flattened matrix.
From our experiments, we can see that building your model with more data in the init (static) phase tends to lead to smaller errors in comparison to having smaller data in the init phase and using more of the data for an update. The observation matches our intuition because when you operate in an online mode, you tend to use smaller storage space, but there is a performance trade-off in comparison to batch mode.
<br /><br />
The error at the beginning of our training is huge in both charts. This insight shows that rather than performing the expensive operation of converting a covariance matrix to have the property of positive definiteness, it is better to just use random matrices that are positive definite. More data would help us get to convergence as more data arrives.
There are many challenges with anomaly detection methods in modeling the normal behavior of the system. The abnormal behavior shows a deviation from what is the anticipated normal behavior of the system. Many anomaly detections are susceptible to adversarial attacks. In a supervised setting, getting labeled data can be expensive. The definition of noise can be ambiguous.
The success of the experiments has given us the confidence that we would have similar characteristics to the univariate case described in the original paper concerning the enhancement to support the multivariate case.</p>
        <h3 id="conclusion">Conclusion</h3>
        <p>There is no generic anomaly detection that works for every possible task. It has to be tuned for your purpose. The underlying assumption in this work is that the features that are used capture significant information about the underlying dynamic of the system. Future work can include extending the multivariate to be production-ready.</p>
        <h3 id="references">References</h3>
      </li>
      <li><a href="">[1]</a> Kevin M. Carter and William W. Streilein. Probabilistic reasoning for streaming anomaly detection. In Proceedings of the Statistical Signal Processing Workshop, pages 377–380, 2012.</li>
      <li><a href="">[2]</a> Victoria Hodge and Jim Austin. A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2):85–126, 2004.</li>
      <li><a href="">[3]</a> Gregory Ditzler and Robi Polikar. Incremental learning of concept drifts from streaming imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 25(10):2283–2301, 2013.</li>
    </ul>
  </li>
</ol>

 -->
    <p>The paper titled “Probabilistic reasoning for streaming anomaly detection” from MIT CSAIL proposed a framework for performing online anomaly detection on univariate data. Unfortunately, most of the data in the real world are multivariate. Hence, mandating the need for more research into performing online anomaly detection in multivariate data. We have been inspired by their work and extended their framework to support multivariate data with some clever optimizations to build a scalable system.
<br /><br />
One would be tempted to ask why we have chosen this paper <a href="">[1]</a> for our study. One answer to this question is that to the best of our knowledge, their work provided a simple framework based on basic statistics to perform real-time anomaly detection of a stream. Furthermore, in the past, I successfully used a derivation of their work for the detection of breaking news in an aggregated news stream.
<br /><br />
The blog will begin by introducing the topic of anomaly detection, followed by a discussion of the original paper <a href="">[1]</a>, and describing extensions of the existing work to handle multivariate data streams. The new formulation that we are proposing would depend on building an online version of the covariance matrix and as such we have provided an implementation of the online covariance matrix, alongside an online inverse covariance matrix based on Sherman–Morrison formula. We have provided a set of mathematical representations and source code.
<br /><br />
My implementation of the original paper and the enhanced version of our modified algorithm which is the subject of this blog post can be found in the following links. Acoompanying source code: [withheld due to dobule blind requirements]
Furthermore, we have provided a set of detailed experiments on the proposed algorithms in different realistic scenarios. However, we maintain the statistical framework provided by the original paper <a href="">[1]</a> as it is already tested. We will not fall into the trap of making this writing a survey paper. Hence, we will discuss a few interesting developments in space, and as such this manuscript is not expected to be exhaustive. This blog will focus on statistical models and as such, we won’t discuss neural networks and their variants in any depth as those would fall outside the scope of this blog post. For more information, see the <a href="https://arxiv.org/abs/1901.03407">paper</a>.
<br /><br />
Anomaly detection is the task of classifying patterns that depict abnormal behavior. Therefore, the notion of normal behavior has to be quantified objectively. This concept can be described by several names such as outlier detection, novelty detection, noise detection, and deviation detection. These names are equivalent and would be used interchangeably for the remainder of our monograph. Outliers can arise as a result of human error, equipment error, and faulty systems. Anomaly detection is well-suited for unbalanced data, where the ideal scenario is to predict the behavior of the minority class. There are many applications of anomaly detection in detecting default on loans, fraud detection, and network intrusion detection among others.
There are <a href="http://cucis.ece.northwestern.edu/projects/DMS/publications/AnomalyDetection.pdf">different types</a> of anomaly which are discussed as follows.</p>
<ul>
  <li>Point anomaly: This is where a single instance is classified as an anomaly concerning the entire data set. This is ideal for univariate data.</li>
  <li>Contextual Anomaly: when a data instance can be anomalous based on the context (attributes and position in the stream) of the data. This is ideal for multivariate data where for example, in the snapshot reading of a machine, an attribute of a single reading may seem anomalous but can be normal based on consideration of the entire data.</li>
  <li>Collective Anomaly: These are the collection of data that are anomalies as a group, but individually these data points exhibit normal behaviors.
<br /><br />
We have summarized many approaches for performing anomaly detection. Our categorization follows loosely the groupings described in <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.109.1943&amp;rep=rep1&amp;type=pdf">paper</a> is to group existing approaches for anomaly detection</li>
  <li>Unsupervised: This is classifying an outlier with training on unlabelled data.</li>
  <li>Supervised: This is classifying an outlier with training on labeled data.</li>
  <li>Hybrid (mix of both): This is a mix of both schemes. These include semi-supervised learning, self-supervised learning e.t.c.
<br /><br />
Anomaly detection algorithms can operate in many settings. This should be carefully thought and be problem-specific.</li>
  <li>Static: These algorithms are designed to work in static datasets. Every item is loaded into memory at a time to perform computation.</li>
  <li>Online: These algorithms are designed to work in real-time data streams. Items are incrementally loaded into memory.</li>
  <li>Static + Online: The model can operate in two stages. The initial parameters are estimated in the static setting. Once the parameters are set, as more data arrives, these parameters are incrementally updated. Our extensions and the original work are of this type.
<br /><br />
Our discussion will be incomplete if we don’t describe how to maintain a collection of the data in the stream to be processed. Hence, we do a quick review of the windows. <a href="https://www.kdd.org/exploration_files/20-1-Article2.pdf">Windows</a> provide a way to manage data streams. There are several window techniques for streaming analytics:</li>
  <li>Fixed window: This is using a fixed window to store some past information to allow for processing.</li>
  <li>Adaptive window (ADWIN): Keep two windows and drop the former if the past distribution deviates from the current distribution.</li>
  <li>Landmark window: Keep a history of data points that is representative of the distribution of the stream.</li>
  <li>Damped window: This is using a weighting factor on the recent sample and past sample to intensify or dampen the signal. To forget the past, increase the weight of the present.
<br /><br />
We have decided to work on anomaly detection algorithms that work in an unsupervised manner. The normal behavior is represented using the PDF of a multivariate normal distribution. Thresholds are set as a way to specify the significance level. The online formulation was used in our work to help the algorithm work even when concept drift occurs. Unsupervised learning provided advantages in cases where getting data with labels can be challenging or even impossible in some contexts. This fits nicely with a data stream when you don’t know what to expect from your test distribution.
<br /><br />
We will proceed to describe the algorithm in the original paper <a href="">[1]</a> and then provide extensions in upcoming sections.
    <h3 id="background-work">Background work</h3>
    <p>An anomaly detection algorithm can be aimed at identifying outliers in (any or combination) of the signal changes which may include abrupt transient shift, abrupt distributional shift, and gradual distributional shift <a href="">[1]</a> which is labeled as “A”, “B”, and “C” respectively.
<img src="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/public/images/anomaly/2021-12-01-timeseries.png" alt="Signal Changes" />
Online algorithms are useful for real-time applications, as they operate incrementally which is ideal for analyzing the data streams. These algorithms incrementally receive input and make a decision based on an updated parameter that conveys the current state of the data stream. This philosophy contrasts with offline algorithms that assume the entire data is available in memory. The issue with an offline algorithm is that the data may not fit in memory. The online algorithm should be both time and space-efficient.
<br /><br />
Anomaly detection algorithms may work in diagnosis or accommodation mode <a href="">[2]</a>. The diagnosis method identifies the outlier in the data for further processing of the outlier. The outlier is removed from the data sample so it does not skew the distribution. This is useful when the exact parameters of the distribution are known, so the outlier is excluded from the further estimation of the parameters of the distribution <a href="">[2]</a>. The accommodation method identifies the outliers and uses them for estimating the parameters of the statistical model. This is suitable for data streams that account for the effect of concept drift <a href="">[3]</a>.
<br /><br />
Exponential Weighted Moving Average (EWMA) is ideal for keeping a set of running moments in the data stream, but it has some limitations that have led the authors to introduce Probabilistic Exponentially Weighted Moving Average (PEWMA). A single slide from my presentation will clear every misconception between the two algorithms (EWMA and PEWMA) in context.
<img src="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/public/images/anomaly/2021-12-01-pewma_emwa.png" alt="PEWMA vs EMWA" />
PEWMA <a href="">[1]</a> algorithm works in the accommodation mode. The algorithm allows for concept drift <a href="">[3]</a>, which occurs in data streams by updating the set of parameters that convey the state of the stream. PEWMA <a href="">[1]</a> is suitable as an anomaly detection algorithm that works on an abrupt transient shift, where EWMA fails.
<br /><br />
The parameters of the anomaly detection algorithm consist of $X_{t}$ the current data, $\mu_{t}$ the mean of the data, $\hat{X_{t}}$ is the mean of the data, $\hat{\alpha_{t}}$ the current standard deviation, $P_{t}$ the probability density function, $\hat{X_{t+1}}$ the mean of the next data (incremental aggregate), $\hat{\alpha_{t+1}}$ the next standard deviation (incremental aggregates), $T$ the data size, and $t$ a point in $T$. Initialize the process by setting the initial data for training the model $s_{1} = X_{1}$ and $s_{2} = X_{1}^{2}$.
<img src="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/public/images/anomaly/2021-12-01-original.png" alt="PEWMA" />
The processed data is fed to the anomaly detection algorithm with the parameters $\alpha = 0.98, \beta = 0.98$, and $\tau = 0.0044$. The thresholds are chosen for determining outliers that are greater than 3 times the standard deviation in normally distributed data. PEWMA in the original paper was designed to work for point anomaly.</p>
    <h5 id="hypothesis-testing">Hypothesis Testing</h5>
    <p>A hypothesis is a subjective intuition about the problem. This can be guided by current best practices or transferable skills from adjacent domains. These forms of educated guesses have to be empirically verified to allow your preconceived intuitions to be checked against reality. Let us look at some examples of hypotheses:</p>
  </li>
  <li>Will this vaccine work on a new virus?</li>
  <li>Will It rain today?
<br /><br />
Let us look at the example of a certain school where the Physics teacher is generous with marks. The end-of-semester report has a class average of 80 with a standard deviation of 15. What is the probability that a student scores</li>
  <li>score &lt; 60 ?</li>
  <li>score &gt; 90 ?
<br /><br />
The following code snippet solves the problem as a student scoring less than 60 and greater than 90 has a probability of 0.09 and 0.25 respectively.
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import math
def cumfunc(mean, sigma, xval):
  """
  @summary: cumulative pdf to the left of the standard normal distribution curve.
  """
  z = (xval - mean) / (sigma * math.sqrt(2))
  y = 0.5 * (1 + math.erf(z))
  return y
if __name__ == '__main__':
  mean = 80; sigma = 15
  x = 60
  res = cumfunc(mean, sigma, x)  # &lt; 60
  print (round(res, 2)) # 0.09
  x = 90
  res = 1 - cumfunc(mean, sigma, x) # &gt; 90
  print (round(res, 2)) # 0.25
</code></pre></div>    </div>
    <p>Let us provide the source code for visualizing the probability of the events described in the code snippet.</p>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
def contour(plt, score, x_axis, y_axis, colour, lessthan=True):
  x_temp = [x for x in x_axis if x &lt;= score]
  if lessthan:
      y_temp = y_axis[:len(x_temp)]
  else:
      y_temp = y_axis[len(x_temp): ]
      x_temp = [x for x in x_axis if x &gt; score]
  plt.fill_between(x_temp, 0, y_temp, facecolor=colour)
if __name__ == '__main__':
  mean = 80; sd = 15
  x_axis, y_axis = np.arange(10, 140, 0.001), norm.pdf(x_axis,mean,sd)
  plt.plot(x_axis, y_axis); plt.xlabel('Scores'); plt.ylabel('PDF')
  plt.title("Gaussian Distribution with Mean: {} and STD: {}".format(mean, sd))
  colour='#4dac26'; score = 60
  contour(plt, score, x_axis, y_axis, colour, lessthan=True)
  colour='#f1b6da'; score = 90
  contour(plt, score, x_axis, y_axis, colour, lessthan=False)
  plt.show()
</code></pre></div>    </div>
    <p><img src="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/public/images/anomaly/2021-12-01-chart3.png" alt="Normal Distribution" />
The probability of events (score &lt; 60 and score &gt; 90) is captured by the area of the shaded regions.
If the class average is 80 with a standard deviation of 15, it is with a minuscule probability that a student scores less than 0 or greater than 120. The event categories these outrageous scores can be said to be an anomaly. The scores used in our examples are thresholds. The areas depicting these probabilities can be seen from our chart. 
In summary, the rule of thumb for hypothesis testing can be summarized as follows:</p>
  </li>
  <li>Identify the Null and Alternative hypotheses.</li>
  <li>Choose a significance level by setting the threshold.</li>
  <li>Decide to reject based on the significance level.
    <h3 id="extension">Extension</h3>
    <p>Our contribution begins here. We simplify the algorithm by ignoring the details of evolutionary computation in the <a href="http://www.cmap.polytechnique.fr/~nikolaus.hansen/ACECMUaa1p1CMAfES.pdf">paper</a>. The author of the blog post took the premise of evolution as described in the paper to be moving from one generation to the next; as equivalent to moving from one state to another state. This is analogous to how online algorithms work with dynamic changes as new data enters the stream. Cholesky decomposition is used extensively in the algorithms. The  <a href="http://www.cmap.polytechnique.fr/~nikolaus.hansen/ACECMUaa1p1CMAfES.pdf">paper</a> provided the basis for the online covariance matrix used in this work.</p>
    <h5 id="online-covariance-matrix">Online Covariance matrix</h5>
    <p>The mathematical formulation can be found here.</p>
    <ol>
      <li>Estimate covariance matrix for initial data, $X \in R^{n \times m}$.
Initial covariance matrix, $C$ where $C \in R^{n \times m}$, $n$ is the number of samples, $m$ is the number of dimensions.
\(\begin{equation}
C = X * {X}^T
\end{equation}\)</li>
      <li>Perform Cholesky factorization on the initial covariance matrix, $C$.
\(\begin{equation}
C_{t} = A_{t} * {A_{t}}^T
\end{equation}\)
We make use of Scipy’s Cholesky decomposition. The input matrix must be positive-definite which means that the eigenvalues are positive which is a requirement for the Cholesky decomposition. For a quick primer on positive-definite, positive semi-definite, and their variants peruse over the <a href="https://www.cse.iitk.ac.in/users/rmittal/prev_course/s14/notes/lec11.pdf">tutorial</a>. The covariance matrix of a multivariate distribution is positive semi-definite. For more insight on how to create a positive semi-definite covariance matrix. Kindly take a look at <a href="https://www.researchgate.net/post/How_to_generate_positive-definite_covariance_matrices">discussion board</a>. The approach taken in this work is to convert to the nearest positive definite matrix. This is sufficient for our purpose. Evaluate and appropriately reapply to your use case. A kind of reasonable approach is just to begin from a random positive definite matrix as a default choice for your covariance.</li>
      <li>The general form of incremental covariance. This can be best understood that the updated covariance in the presence of new data is equivalent to the weighted average of the past covariance without the new data, and covariance of the transformed input.
\(\begin{equation}
C_{t+1} = \alpha * C_{t} + \beta * v_t * {v_t}^T
\end{equation}\)
Where $v_t = A_t * z_t$ and $z_t \in R^m$ is understood in our implementation is the current data. \alpha and \beta are positive scalar values.</li>
      <li>Increment the Cholesky factor of the covariance matrix
\(\begin{equation}
A_{t+1} = \sqrt{\alpha} * A_t + \frac{\sqrt{\alpha}}{\Big\|z_t \Big\|^2} * \left( \sqrt{1 + \frac{\beta * \Big\|z_t \Big\|^2}{\alpha}} - 1 \right) * v_t * z_t
\end{equation}\)</li>
      <li>There are difficulties with setting the values of $\alpha$ and $\beta$ respectively. $\alpha + \beta = 1$ as an explicit form of exponential moving average. The author chose to set the values of $\alpha$, $\beta$ using the statistics of the data stream.
The parameters are set as $\alpha = {C_{a}}^2$, $\beta = 1 - {C_{a}}^2$ and $n$ is the size of the original data in the static settings, where ${C_{a}} = \sqrt{1 - C_{cov}}$ and $C_{cov} = \frac{2}{n^2+6}$.
The Implementation can be found here
\(\begin{equation}
A_{t+1} = {C_{a}} * A_t + \frac}{\Big\|z_t \Big\|^2} * \left( \sqrt{1 + \frac{(1 - {C_{a}}^2) * \Big\|z_t \Big\|^2}^2}} - 1 \right) * v_t * z_t
\end{equation}\)
        <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def updateCovariance(alpha, beta, C_t, A_t, z_t):
  """
  @param: alpha, beta, A_t are parameters of the model
  @param: C_t is the old covariance matrix, z_t as the new data vector
  @return: C_tplus1 is updated covariance matrix
  """
  v_t = np.dot(A_t, z_t.T)
  C_tplus1 = (alpha * C_t)  + (beta * np.matmul(v_t, v_t.T))
  print ("v_t: {}, C_tplus1: {}".format(v_t.shape, C_tplus1.shape))
  return C_tplus1
def updateCholeskyFactor(alpha, beta, A_t, z_t):
  """
  @param: alpha, beta, A_t are parameters of the model
  @param: z_t as new data vector
  @return: A_tplus1 is updated covariance matrix
  """
  v_t = np.dot(A_t, z_t.T)
  norm_z = np.linalg.norm(z_t)
  x = math.sqrt(alpha) * A_t
  w = beta * norm_z / alpha
  y = math.sqrt(alpha) * (math.sqrt(1 + w) - 1) * np.dot(v_t, z_t) / norm_z
  A_tplus1 = x + y
  print ("A_t: {}, A_tplus1: {}".format(A_t.shape, A_tplus1.shape))
  return A_tplus1
</code></pre></div>        </div>
        <p><br /><br /></p>
        <h5 id="online-inverse-covariance-matrix">Online Inverse Covariance matrix</h5>
        <p>The mathematical formulation can be found here</p>
      </li>
      <li>Estimate covariance matrix for initial data, $X \in R^{n \times m}$.
Initial covariance matrix, $C$ where $C \in R^{n \times m}$, $n$ is the number of samples, $m$ is the number of dimensions. 
\(\begin{equation}
C = X * {X}^T
\end{equation}\)
Inverse the covariance matrix, $C^{-1}$.
\(\begin{equation}
C^{-1} = \left( X * {X}^T \right)^{-1}
\end{equation}\)</li>
      <li>Perform Cholesky factorization on initial covariance matrix, $C$.
\(\begin{equation}
C_{t} = A_t * {A_t}^T
\end{equation}\)
We make use of Scipy’s Cholesky decomposition.</li>
      <li>General form of incremental covariance.
This can be best understood that the updated covariance in the presence of new data is equivalent to the weighted average of the past covariance without the new data, and covariance of the transformed input.
\(\begin{equation}
C_{t+1} = \alpha * C_{t} + \beta * v_t * {v_t}^T
\end{equation}\)
Where $v_t = A_t * z_t$ and $z_t \in R^m$ is understood in our implementation is the current data. $\alpha$ and $\beta$ are positive scalar values.</li>
      <li>Increment the Cholesky factor of the covariance matrix</li>
    </ol>
  </li>
</ul>

\[\begin{equation}
C_{t+1}^{-1} = ({\alpha * C_t} + \beta * v_t * {v_t}^T)^{-1}
\end{equation}\]

\[\begin{equation}
C_{t+1}^{-1} = \alpha^{-1} * \{C_t + \frac{\beta * v_t * {v_t}^T}{\alpha}\}^{-1}
\end{equation}\]

<p>Let us fix, $\hat{v_t} = \frac{\beta * v_t}{\alpha}$. The resulting simplification using Sherman-Morrison Formula reduces the expression to</p>

\[\begin{equation}
C_{t+1}^{-1} = \frac{1}{\alpha} * \left({{C_t}^{-1}} - \frac{{{C_t}^{-1}} * \hat{v_t} * {v_t}^T * {{C_t}^{-1}}}{1 + (\hat{v_t} * {{C_t}^{-1}} * {v_t}^T)} \right)
\end{equation}\]

<p>The Implementation can be found here</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def updateInverseCovariance(alpha, beta, invC_t, A_t, z_t):
    @param: alpha, beta, A_t are parameters of the model
    @param: invC_t is the old inverse covariance matrix, z_t as the  new data vector
    @return: invC_tplus1 is updated inverse covariance matrix
    print ("A_t: {}, z_t: {}".format(A_t.shape, z_t.shape))
    v_t = np.dot(A_t, z_t.T)
    hat_vt = (beta * v_t) / alpha
    print ("invC_t: {}, hat_vt: {}, v_t: {}, invC_t: {}".format(invC_t.shape, hat_vt.shape, v_t.shape, invC_t.shape))
    y = multi_dot([invC_t, hat_vt, v_t.T, invC_t]) / (1 + multi_dot([hat_vt.T, invC_t, v_t]))
    invC_tplus1 = (invC_t - y) / alpha
    print ("invC_tplus1: {}".format(invC_tplus1.shape))
    return invC_tplus1
</code></pre></div></div>
<h5 id="online-multivariate-anomaly-detection">Online Multivariate Anomaly Detection</h5>
<p>The probability density function makes use of ideas from hypothesis testing. We decide on a threshold which is a confidence level that is used to decide on the acceptance and rejection regions.</p>
<ol>
  <li>Use the covariance matrix, $C_{t+1}$ and inverse covariance matrix, ${C_{t+1}}^{-1}$.</li>
  <li>However, we attempt to increment the mean vector, $\mu$ as new data arrives. It is possible to simplify the Covariance matrix, $C$, which will capture a number of the dynamics of the system. Let $n$ represent the current count of data before new data has arrived.
Also, $\hat{x}$: is the new data, $\mu_{t+1}$: moving average
\(\begin{equation}
\mu_{t+1} = \frac{(n * \mu_t) + \hat{x}}{n+1}
\end{equation}\)</li>
  <li>Set a threshold to determine the acceptance and rejection regions. Items in the acceptance region are considered to be normal behavior.
\(\begin{equation}
p(x)=\frac{1}{\sqrt{(2\pi)^m|C|}} \exp\left(-\frac{1}{2}(x-\mu)^T{C}^{-1}(x-\mu) \right)
\end{equation}\)
Where $\mu$ is mean vector, $C$ is covariance matrix, $|C|$ is the determinant of $C$ matrix, $x \in R^{m}$ is data vector, and $m$ is the dimension of $x$ respectively.
The Implementation can be found here.
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def anomaly(x, mean, cov, threshold=0.001):
 """
 @param: x is the current data vector
 @param: mean is mean vector
 @param: cov is covariance matrix
 @return: score
 """
 score = multivariate_normal.pdf(x, mean=mean, cov=cov)
 return score
def updateMean(mean, z):
 mean_tplus1 = ((n * mean) + z) / (n + 1)
 return mean_tplus1
</code></pre></div>    </div>
    <p><br /><br />
We have provided a clean object-oriented programming-based solution with a cleaner API.</p>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>seed = 0
np.random.seed(seed)
X = np.random.rand(1000,15)
z_t0 = np.random.rand(1,15) # new data
single case predict
anom = probabilisticMultiEWMA()
anom.init(X)
z_t0 = np.random.rand(1,15) # new data
anom.update(z_t0)
z_t1 = np.random.rand(1,15) # next new data
print ("score: {}".format(anom.predict(z_t1)))
Z = np.random.rand(1000,15)
Bulk predict
anom = probabilisticMultiEWMA()
anom.init(X)
pred = anom.bulkPredict(Z)
print (pred)
</code></pre></div>    </div>
    <h3 id="experiments">Experiments</h3>
    <p>We have experimented to evaluate the usefulness of our algorithm by creating a simulation with 10000000 vectors with dimensions of 15. The repeated trial shows that our algorithm is not sensitive to initialization seeds and dimensions of the matrix. This requirement was a deciding factor in the choice of the evaluation metric. More information on the metric will be provided in the Discussion section.
This is to find the trade-off between the static window and the update window. The source code for the experiments can be found <a href="withheld due to dobule blind requirements">here</a>.</p>
    <h5 id="experiment-1">Experiment 1</h5>
    <p>The goal of this experiment is to check the effect of varying the size of the initial static window versus the update window
The experiment setup follows loosely the description.</p>
    <ul>
      <li>Split the data into 5 segments
train on 1st segment(static), update covariance on 2nd (online), compare with static covariance - get error</li>
      <li>Train on 1, 2 segment(static), update covariance on 3rd (online), compare with static covariance - get error.</li>
      <li>Train on 1, 2, 3 segment(static), update covariance on 4th (online), compare with static covariance - get error</li>
      <li>Train on 1, 2, 3, 4 segment(static), update covariance on 5th (online), compare with static covariance - get error
<img src="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/public/images/anomaly/2021-12-01-chart1.png" alt="Experiment 1" />
        <h5 id="experiment-2">Experiment 2</h5>
        <p>The goal of this experiment is to check the effect of varying the size of the initial static window versus the update window
The experiment setup follows loosely the description</p>
      </li>
      <li>Split the data into 5 segments.</li>
      <li>Train on 1st segment(static), update covariance on remaining segments (2,3,4,5) (online), compare with static covariance - get errors on segments (2,3,4,5)</li>
      <li>Train on 1, 2 segment(static), update covariance on remaining segments (3,4,5) (online), compare with static covariance - get errors on segments (3,4,5)</li>
      <li>Train on 1, 2, 3 segment(static), update covariance on remaining segments (4,5) (online), compare with static covariance - get errors on segments (4,5)</li>
      <li>Train on 1, 2, 3, 4 segment(static), update covariance on remaining segments (5) (online), compare with static covariance - get errors on segments (5)
<img src="https://iclr.iro.umontreal.ca/8e139ca5-2b60-4e3a-bed5-615d8a08a381_1639791001/public/images/anomaly/2021-12-01-chart2.png" alt="Experiment 2" />
        <h3 id="discussion">Discussion</h3>
        <p>Our matrix was flattened to a vector which is used as input. The length of the vector is used to make the loss metric that is agnostic to the dimension of the matrix. The loss function used in the evaluation is Absolute Average Deviation (AAD) because it gives a tighter bound on the error in comparison to MSE or MAD. This is because we take the average of the residuals divided by the ground truth for every sample in our evaluation set. If the residual is close to zero, we contribute almost nothing to the measure. However, if the residual is large, we want to know the factor of how large in comparison to the ground truth. This behavior of scaling by the ground truth may explain why this metric tends to be conservative in regression analysis.
\(\begin{equation}
AAD = \sum_{i=1}^{n} \left| \frac{\hat{Y_i} - Y_i}{Y_i} \right|
\end{equation}\)
Where $\hat{Y_i}$ is the predicted value, $Y_i$ is ground truth, and $n$ is the length of the flattened matrix.
From our experiments, we can see that building your model with more data in the init (static) phase tends to lead to smaller errors in comparison to having smaller data in the init phase and using more of the data for an update. The observation matches our intuition because when you operate in an online mode, you tend to use smaller storage space, but there is a performance trade-off in comparison to batch mode.
<br /><br />
The error at the beginning of our training is huge in both charts. This insight shows that rather than performing the expensive operation of converting a covariance matrix to have the property of positive definiteness, it is better to just use random matrices that are positive definite. More data would help us get to convergence as more data arrives.
There are many challenges with anomaly detection methods in modeling the normal behavior of the system. The abnormal behavior shows a deviation from what is the anticipated normal behavior of the system. Many anomaly detections are susceptible to adversarial attacks. In a supervised setting, getting labeled data can be expensive. The definition of noise can be ambiguous.
The success of the experiments has given us the confidence that we would have similar characteristics to the univariate case described in the original paper concerning the enhancement to support the multivariate case.</p>
        <h3 id="conclusion">Conclusion</h3>
        <p>There is no generic anomaly detection that works for every possible task. It has to be tuned for your purpose. The underlying assumption in this work is that the features that are used capture significant information about the underlying dynamic of the system. Future work can include extending the multivariate to be production-ready.</p>
        <h3 id="references">References</h3>
      </li>
      <li><a href="">[1]</a> Kevin M. Carter and William W. Streilein. Probabilistic reasoning for streaming anomaly detection. In Proceedings of the Statistical Signal Processing Workshop, pages 377–380, 2012.</li>
      <li><a href="">[2]</a> Victoria Hodge and Jim Austin. A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2):85–126, 2004.</li>
      <li><a href="">[3]</a> Gregory Ditzler and Robi Polikar. Incremental learning of concept drifts from streaming imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 25(10):2283–2301, 2013.</li>
    </ul>
  </li>
</ol>

    <hr>
  </div>
  
</div>

<div class="pagination">
  
  <span class="pagination-item older">Older</span>
  
  
  <span class="pagination-item newer">Newer</span>
  
</div>

      </div>
    </div>

    <label for="sidebar-checkbox" class="sidebar-toggle"></label>

    <script src='/public/js/script.js'></script>
  </body>
</html>
