\section{Introduction}
MRI (Magnetic Resonance Imaging) scans pose the primary screening method to detect, assess and segment brain pathologies for diagnosis and subsequent treatment planning. While supervised Deep Learning approaches contribute state-of-the-art lesion segmentation techniques [\cite{valverde2017improving}, \cite{BraTS17}], they are constrained to the distribution of anomalies used during training and the need for corresponding pixel-wise ground-truth labels from domain experts. This process is expensive and subject to an inter-/intra-rater ambiguity \cite{moraal2010}. Unsupervised methods on the other hand have the potential to serve as a class-agnostic anomaly detection framework and might act as a quality assurance tool for practitioners without the need to curate large specialized training datasets. 

Nevertheless, unsupervised anomaly detection techniques exploit \ac{ood} detection as their working principle, making them vulnerable to actual \ac{ood} samples during inference. Those are samples that might or might not contain lesions but more importantly originate from a different domain compared to the training distribution, which might be due to different scanner models used or changed parameters such as magnetic field strength. While the issue of unreliable model behavior for \ac{ood} samples is well-known and has received much attention in the case of supervised approaches in the past \cite{Martensson2020}, it is even more pronounced for unsupervised anomaly detection - which equates anomaly detection with \ac{ood} detection - by nature. Since models performing unsupervised anomaly detection in the field of MRI will be presented with data whose generating process is governed by numerous influencing factors that might provoke a domain-shift, this is of great concern for safe model deployment and is usually not addressed assuming that test samples follow the same distribution like the training data. Furthermore, datasets that contain both healthy and lesional samples are typically not publicly available which complicates the investigation of the severity of model performance degradation under such domain-shifts and makes this a particularly challenging problem since there is no access to lesional samples originating from the same domain. Thus, it is important to investigate whether common approaches can disentangle the underlying factors for a sample to be \ac{ood} - i.e. whether it is lesional or not. To this end, we also examine whether novel scores that arise from \ac{ood} detection using neural networks yield improvements with respect to finding the underlying cause.

\begin{figure}[t]
\centering
  \includegraphics[height=0.2\textheight]{figures/teaser.png}%
  \caption{Working principle of unsupervised lesion detection based on \acp{vae} as used in this work. During training, a VAE learns to approximate the distribution of healthy images by maximizing the so-called Evidence Lower Bound (ELBO) $\mathcal{L}$, cf.~Equ.~\ref{equ:elbo-vae}. During inference, the residual map of an input test image yields the pixel-wise lesion detection map. Since this process inherently equates \ac{ood} detection with lesion detection, it is blind to domain-shift effects. Thus, disentangling the sources of abnormality of the image (\ac{ood}?, lesional?), be it due to a domain-shift or due to an actual lesion is of utmost importance and should be considered by default.}
  \label{fig:teaser}
\end{figure}

\paragraph{Unsupervised Lesion Detection}
Prior to the rise of Deep Learning-based approaches, works have been proposed to detect lesions in an unsupervised manner, such as registering images to a healthy standardized brain Atlas and fitting, amongst others, mixture models based on tissue-specific densities to detect lesions as model outliers \cite{kamber1995model,VanLeemput2001,prastawa2004brain}. More recently, Deep Generative models based on \ac{vae} \cite{Kingma2014a} and Generative Adversarial Networks (GAN) \cite{Goodfellow2014, Schlegl2017, Schlegl2019} have become popular due to their abilities to  model high dimensional distributions, essentially learning what is referred to as the normative data distribution. During inference, anomaly detection is then performed by assessing the deviation of test samples from the training distribution. One common way within \ac{vae}-based frameworks is to do this via the reconstruction error of the reconstructed to the input sample. For images, pixel-level anomaly detection is performed by thresholding the residual map between an input and reconstruction images, as shown in Fig.~\ref{fig:teaser}, which builds upon the assumption that lesional regions are expected to have high reconstruction errors since they deviate from the training distribution. Recently, different metrics for pixel-level anomaly detection have been proposed by \cite{Zimmerer2020}. \cite{Baur2019a} introduced a \ac{vae} based framework incorporating adversarial training to improve the realism of reconstructions and avoid memorization. \cite{Chen2018a} identified the lack of and improved latent space consistency by adding a regularizing constraint. \cite{Chen2020} used a VAE with mixtures of Gaussian in the latent space, which is a more expressive prior distribution. At the same time, they applied Image Restoration on the reconstructed image prior to assessing the residual image. \cite{Baur2020a} presented a comprehensive comparative study for recent approaches to unsupervised lesion detection.

\paragraph{Out-of-Distribution Detection}
Unsupervised lesion detection techniques introduced above are based on the idea of detecting \ac{ood} samples using generative models since lesional samples do not fit the training distribution. There has been a rising interest in \ac{ood} detection, driven by the need to enable safe and interpretable model deployment since machine learning models usually perform inferior on \ac{ood} samples \cite{Louizos2017, Goodfellow2014}. Recently, \cite{Martensson2020} raised awareness for this issue on medical MRI data in particular. Generative models seem to offer a principled approach to detecting \ac{ood} samples by applying a single-sided threshold on the data log-likelihood based on the training distribution \cite{bishop1994novelty}. However, recent work \cite{Choi2019,Nalisnick2019a} has shown that generative models might assign a higher likelihood to \ac{ood} data than to in-distribution data, rendering this method problematic. The latter also proposed the so-called \ac{waic} score which gives an asymptotically correct estimate of the gap between the training set and test set expectations. While this metric does not address the notion of typicality like \cite{Nalisnick2019}, that is, assessing where the largest amount of probability mass resides within a high-dimensional feature space, it works surprisingly well in practice. Recently, \cite{Morningstar2020} introduced another density-based OOD detection framework by aggregating various inference statistics, e.g. the reconstruction errors, into the so-called Density of States Estimation (DoSE) score. Specifically, they fit a Kernel Density Estimator (KDE) to each statistics distribution evaluated on the training data and mark novel samples as OOD by thresholding their sum of likelihoods under said estimators.

\paragraph{Contributions}
This work raises awareness of the issue that the predominant approach to unsupervised lesion detection is particularly vulnerable to \ac{ood} samples. This is done by assessing multiple common metrics for anomaly and \ac{ood} detection and concluding that predictions don't reflect the true underlying reason for a sample to be labeled abnormal.
While this work does not aim to provide state-of-the-art lesion segmentation performance, we explore concepts originating from recent work in \ac{ood} detection for Deep Generative Models, to enhance model robustness when presented with \ac{ood} data. More precisely, we deploy and adapt recent approaches for \ac{ood} detection to answer the following questions: Are \ac{ood} detection metrics suitable for sample-wise anomaly detection and is it possible to disentangle lesion-based \ac{ood} samples from their non-lesion-based counterparts? Finally, we explore the use of prior knowledge in the form of the entropy on the residual map in an attempt to disentangle the influencing factors of lesions and domain-shifts during inference. 



