% ====================================================================================================
\section{The \ac{ood} Blind Spot of Unsupervised Anomaly Detection}
\label{sec:methods}
The following describes the theoretical foundations of the unsupervised lesion detection framework based on \ac{vae} and its limitations due to entanglement of domain-shift and lesion effects before finally discussing proposals on how to overcome those difficulties. Note that while we discuss the limitations here on the of \ac{vae}-based approaches, these also extrapolate to other unsupervised  methods \cite{Schlegl2019} that equate anomaly detection with \ac{ood} detection.
The objective of unsupervised lesion detection (\emph{cf.} Fig. \ref{fig:teaser}) is to train a generative model $\textbf{f}_{\theta}(\cdot)$ on a set of healthy images $\textbf{X}=\{\textbf{x}^{(i)}\}_{i=1}^{N}$, where $\textbf{x}^{(i)}\in\mathbb{R}^{m\times n}$, to predict whether a query sample $\textbf{x}_{q}^{(i)}\in\mathbb{R}^{m\times n}$ is anomalous, i.e. contains lesions (Sec. \ref{sec:methods:pixel_wise_anomaly_detection}), and to obtain a pixel-wise lesion map $\textbf{l}\in \{0, 1\}^{m\times n}$. We hypothesize that a sample might be predicted to be anomalous either due to actual lesions or due to a domain-shift which might cause the model to generate unreliable predictions since the commonly used metrics for anomaly detection do not differentiate those two sources. 

% ====================================================================================================
\subsection{Background}
\label{sec:methods:background}
Using generative models to perform unsupervised anomaly detection has been widely adopted in tackling unsupervised lesion detection using MR images \cite{Baur2020a}. 
Hereby, the predominant approach is to approximate the generally intractable data distribution $p(\textbf{x})$ using the VAE framework for a set of healthy images $\textbf{X}$.

VAEs \cite{Kingma2014a} aims to solve the generally intractable integral $p(\mathbf{x})=\int p(\mathbf{x}|\mathbf{z})p(\mathbf{z})d\mathbf{z}$. It does so by introducing a surrogate posterior distribution $q(\mathbf{z}|\mathbf{x})$ and approximating the data log-likelihood $\log p(\textbf{x})$ by maximizing the so-called \textit{Evidence Lower Bound (ELBO)} $\mathcal{L}$
\begin{equation}
\label{equ:elbo-vae}
    \log p(\textbf{x}) \geq \mathcal{L} = \underbrace{E_{q_{\phi}(\textbf{\small{z}}|\textbf{\small{x}})} [\log p_{\theta}(\textbf{x}|\textbf{z})]}_\text{Reconstruction\ Error}   - \underbrace{D_{KL}[q_{\phi}(\textbf{z}|\textbf{x}) || p(\textbf{z})]}_\text{Prior Loss},
\end{equation}
where $q_{\phi}(\textbf{z}|\textbf{x})$ and $p_{\theta}(\textbf{x}|\textbf{z})$ are modelled as neural networks with parameters $\phi$ and $\theta$, respectively. The former encodes input samples into the latent space $\textbf{z}$ and thus is denoted as the \textit{encoder}. The latter is trained to reconstruct the input from this latent representation and hence is called the \textit{decoder}.
This framework allows for posterior inference via the learned approximate posterior distribution $q_{\phi}(\textbf{z}|\textbf{x})$ which is being pushed towards a prior $p(\textbf{z})$, $\mathcal{N}(\textbf{0}, I)$ in our case, during training by minimizing their KL Divergence denoted as the prior loss. 

Using a VAE trained on $p(\textbf{x})$, a potentially lesional test sample $\tilde{\textbf{x}}^{(i)}$ is being fed through the generative model to obtain a reconstruction $\hat{\textbf{x}}^{(i)}$. Being trained on healthy samples exclusively, the model is expected to not be able to reconstruct lesional components while it should succeed in reconstructing healthy parts of the input. A pixel-wise anomaly segmentation map can thus be retrieved by thresholding the pixel-wise residual $\textbf{r}=\|\hat{\textbf{x}}^{(i)} - \tilde{\textbf{x}}^{(i)}\|_p\in\mathbb{R}^{m\times n}$, where $\|\cdot\|_{p}$ is the $\ell_{p}$-norm, and $p$ is chosen to be $1$. A pixel is being marked as anomaly if the per-pixel value in the residual image $\textbf{r}=\|\hat{\textbf{x}}^{(i)} - \tilde{\textbf{x}}^{(i)}\|_p$ exceeds some threshold $\tau$ (\emph{cf.}~Sec.~\ref{sec:appendix}). 

% ====================================================================================================
\subsection{OOD Detection}
\label{sec:methods:ood_detection}
For a given dataset $\{\textbf{x}^{(i)}\}_{i=1}^{N}, \textbf{x}^{(i)} \in \mathbb{R}^{m\times n}$, sampled from a distribution $p_{data}(\textbf{x})$, OOD detection aims to answer the question whether a novel sample $\textbf{x}^{(i)}$ is sampled from the same data generating distribution $p_{data}(\textbf{x})$ or some other unknown distribution.

At test time, the model might be exposed to samples from the following three categories cf.~Fig.~\ref{fig:loss-term-hists}: 
\begin{enumerate}[label=(\roman*),nosep]
\item \textbf{healthy \& in-distribution}, anomaly-free images from the same domain as the training data (e.g. CamCAN T2)
\item \textbf{healthy \& \ac{ood}}, anomaly free-images with domain shift from the training data (e.g. BraTS T2 healthy),
\item \textbf{lesional \& \ac{ood}}, images with lesions regardless of domain shift (e.g. BraTS T2 lesional),
\end{enumerate}. 

% === Loss term statistics histograms evaluation on multiple OOD datasets
\begin{figure}[t]
\centering
  \includegraphics[height=0.3\textheight]{figures/loss_histograms.png}%
  \caption{Densities for sample-wise loss term contributions (cf.~Equ.\ref{equ:elbo-vae}) for various OOD datasets. \textit{CamCAN T2} (teal) represents the in-distribution training data containing only healthy slices. \textit{CamCAN T2 lesion} holds samples from \textit{CamCAN T2} but with artificially added Gaussian blobs to simulate lesional samples as explained in Sec.~\ref{sec:experiments} and shown in Fig.~\ref{fig:brain_samples}. All other datasets can be regarded as \ac{ood}. We can conclude that non of the commonly used metrics, that is, $D_{KL}$, $l_{1}$ (reconstruction error) or $\mathcal{L}$ is able to differentiate between whether a sample is being detected as abnormal due to a domain-shift or due to actual lesions.}
  \label{fig:loss-term-hists}
\end{figure}

To gain an intuition for the capabilities of each \ac{ood} score, we first assess their overall capability to distinguish between in- and out-of-distribution samples. Interestingly, we find that one recently proposed score indeed outperforms  all commonly used metrics in unsupervised lesion detection. The following metrics act as OOD scores for which we report the area under ROC $AU_{ROC}$ and PRC $AU_{PRC}$ curves. (1) the mean reconstruction error $\ell_{1}$ per pixel for a whole sample, (2) the KL divergence $D_{KL}$ between posterior and prior and (3) the ELBO $\mathcal{L}$ from Equ. \ref{equ:elbo-vae}. Furthermore, we exploit recent OOD metrics, namely the (4) $WAIC=E_{\theta}[p_{\theta}(\textbf{x})]-Var_{\theta}[\log p_{\theta}(\textbf{x})]$ score \cite{Choi2019}, where $\theta$ denote model parameters of an ensemble of models, and (5) the $DoSE=\sum_{j}KDE_{j}(\textbf{x})$  \cite{Morningstar2020} score. For $DoSE$, metrics (1)-(3) are being used as training statistics. 

\subsection{Disentangling Lesional and Non-Lesional \ac{ood} Samples}
Our results (cf.~Sec.~\ref{sec:results:ood_detection}) suggest that these classical OOD detection scores are incapable of discriminating between healthy and lesional samples which is in agreement with the findings from Fig. \ref{fig:loss-term-hists}. However, the \textit{DoSE} score offers the possibility to craft statistics which potentially help to disentangle OOD scores for samples originating from groups (i) - (iii). 

\paragraph{Assumption} Lesional samples are expected to have large residual errors confined in relatively small regions considering pixels within a tumor dominating the reconstruction error. On the other hand, a (healthy) \ac{ood} sample should show a steady but spread out error due to global domain-shift effects which is equivalent to a large uncertainty in the pixel-wise error distribution which might be captured by the following entropy measure.
We extend the framework with an entropy statistic $H_{\ell_{1}}$ and investigate its suitability to disentangle the underlying reasons for which a sample appears to be marked OOD. The normalized sample-wise entropy scores are calculated on a normalized residual map $\mathbf{r}$ via
    $H_{\ell_{1}}(\textbf{r})=-\sum_{j}^{n_{p}}\frac{\mathbf{r}_{j}\log \mathbf{r}_{j}}{\log n_p}.$ where $n_p$ is the number of pixels within the brain mask.

To conclude, our framework provides two scores, a global OOD score in the form of $WAIC$ or $DoSE$ and a second score using $H_{\ell_{1}}$ to address the problem of slice-wise anomaly detection in particular. 



