\newpage
\section{Appendix}
\label{sec:appendix}
\subsection{Implementation Details}
We follow the generalised encoder-decoder architecture from \cite{Baur2020a}. Each layer consists of a 2D convolution followed by batch normalisation and a LeakyReLU activation function. Models are trained for 80 epochs with a batch size of 128 and linear $\beta$-annealing (weight of KL-term) from $0.0$ to $0.3$ over the first 5 epochs. The final value for $\beta$ has been found empirically by maximizing reconstruction performance (low $\beta$ better) while still producing visually coherent samples. Adam optimizer is used with an initial learning rate of $10^{-4}$.  The full source code will be made publicly available at \url{https://github.com/matthaeusheer/uncertify}.

\subsection{Pixel-wise Anomaly Detection}
\label{sec:appendix:pixel_wise_ad}

\paragraph{Determining the Pixel-wise Lesion Detection Threshold} Since, in an unsupervised setting, there is no access to ground truth labels, it is not possible to tune such hyper-parameters with regards to the final metrics of interest, e.g. the Dice score. Instead, we follow the approach in \cite{Konukoglu2018}, where we assume any pixel from the training set images marked to be anomalous by the anomaly detection algorithm to be a false positive. We set a limit $l_{FPR}$ on the false positive rate $FPR_{train}$ we are willing to accept to determine a threshold that satisfies the constraint. The threshold is computed via the Golden Section Search algorithm \cite{Kiefer1953SequentialMS} by solving the optimization problem $\tau = \underset{t}{\text{min}}\left | FPR_{train}(t)-l_{FPR} \right |$.  
Finally, the threshold $\tau$ gets deployed to convert the residual map $\textbf{r}$ to a binary lesion map for which the final segmentation metrics are being computed per patient on unseen test data.

\paragraph{Pixel-Wise Anomaly Detection Performance}
Tab.~\ref{tab:pixel-wise-anomaly-detection} shows the pixel-wise anomaly detection performance of the model used throughout this study with and without post-processing steps in comparison with the baseline from \cite{Chen2020}. Ground truth is obtained from the lesion segmentation mask and only pixels within the brain mask are considered for this analysis. For clinical applicability, Dice scores are calculated per patient reporting mean and standard deviation. It is evident that performing histogram matching prior to inference improves lesion detection performance slightly. Comparison is being made with the state-of-the-art performance \cite{Chen2020} since they use the same training and test datasets. Our results cf.~Tab.~\ref{tab:pixel-wise-anomaly-detection} outperform the baseline mainly due to post-processing applied but lack behind the state-of-the-art which implements a more powerful latent space representation and image restoration.

\begin{table}[h]
\centering
\caption{Pixel-wise anomaly segmentation (Dice) and detection ($AU_{ROC}$ / $AU_{PRC}$) performance. \textsuperscript{*} includes post-processing (smoothing \& mask-erosion of residual map) , \textsuperscript{**} no post-processing, \textsuperscript{***} results from \cite{Chen2020}.
}
\resizebox{\textwidth}{!}{%
\begin{tabular}{lcccccccccccc}
\hline
\textbf{Dataset} & \multicolumn{3}{c}{BraTS T2 HM} & \multicolumn{3}{c}{BraTS T2} & \multicolumn{3}{c}{CamCAN T2 artificial lesions} & \multicolumn{3}{c}{BraTS T1} \\ 
\textbf{Model} & Dice & AU_{ROC} & AU_{PRC} & Dice & AU_{ROC} & AU_{PRC} & Dice & AU_{ROC} & AU_{PRC} & Dice & AU_{ROC} & AU_{PRC} \\ \hline
\textbf{Baseline}\textsuperscript{***} & 0.23 ± 0.13 & 0.69 & - & - & - & - & - & - & - & - & - & - \\ \hline
\textbf{Ours VAE}\textsuperscript{*} & 0.34 ± 0.12 & 0.75 & 0.25 & 0.31 & 0.73 & 0.20 & 0.63 & 0.92 & 0.66 & 0.10 & 0.51 & 0.07 \\ \hline
\textbf{Ours VAE}\textsuperscript{**} & 0.25 ± 0.11 & 0.69 & 0.16 & 0.22 & 0.66 & 0.13 & 0.39 & 0.88 & 0.43 & 0.10 & 0.49 & 0.07 \\ \hline
\textbf{GMVAE}\textsuperscript{***} & 0.46 ± 0.23 & 0.83 & - & - & - & - & - & - & - & - & - & -\\
\hline
\end{tabular}
}
\label{tab:pixel-wise-anomaly-detection}
\end{table}


\subsection{Preprocessing and Postprocessing steps}
\label{sec:appendix_pre_post}
\paragraph{Preprocessing}
Brain scan MRI training samples from healthy patients originate from the CamCAN dataset \cite{TAYLOR2017262} and only T2 weighted MR images are used for training. Pre-processing includes bias correction using the N4 algorithm \cite{n4_bias_correction} and centering of the brain. Patient-wise histogram matching to a randomly chosen in-distribution sample ensures similar intensity profiles throughout all training samples. Finally, all pixel values within the brain mask are normalized to zero mean and unit variance, again, on a per-patient level. Empty MR slices are excluded during training. To obtain lesional samples following the same distribution as CamCAN T2, we artificially crafted lesional samples by randomly adding high-intensity Gaussian blobs to CamCAN T2-weighted images. The blobs standard deviations range from 0 to 10 (given images of size $128\times128$) and are weighted such as to have similar maximum intensities compared to real lesions. Finally, we applied vertical and horizontal flipping (V-flip, H-flip) following common practices in recent OOD detection works.

\paragraph{Post-Processing}
Postprocessing, that is eroding the brain mask inwards and applying median filtering on the residual map to reduce false positives, yields significant performance boosts. We did not perform 3D connected component filtering which is expected to increase performance even further \cite{Baur2020a}.

\begin{figure}[ht]
  \includegraphics[width=1.\textwidth]{figures/loss_histograms_hm.png}%
  \caption{Similar to Fig.~\ref{fig:loss-term-hists}, however, now the impact of \textit{histogram matching} before inference shall be emphasized. While histogram matching generally improves the reconstruction and lesion segmentation performance (cf.~Tab.~\ref{tab:pixel-wise-anomaly-detection}), it does not provide a solution to the entanglement of domain-shift and lesions. That is, it does not shift the healthy \ac{ood} samples in a way that their distributions overlap with the healthy in-distribution samples.}
  \label{fig:loss_histograms_hm}
\end{figure}


\FloatBarrier
\subsection{Volumetric Consistency of the Reconstruction Error}
\label{sec:consistency}
One might be tempted to question whether it is enough to consider individual slices in the anomaly detection process without leveraging the information of slices nearby. That is, could it happen that a slice predicted to be healthy might be embedded in two slices which are predicted to be lesional. Fig.~\ref{fig:l1_loss_slices} reveals that this is typically an unrealistic scenario for the case of hyper-intense lesions. Nevertheless, 3D convolutional neural networks might be an interesting way forward, especially when the lesions are not so distinct as the one shown in the example.

\begin{figure}[h]
  \centering
  \includegraphics[width=.75\textwidth]{figures/l1_loss_slices.png}%
  \caption{Left: All axial slices comprising a single BraTS T2 sample. Right: Corresponding slice-wise mean per-pixel reconstruction error, $l_1$. This reveals that the reconstruction error is close to being smooth and slices marked as healthy embedded in lesional slices can be expected to be rather unlikely.}
  \label{fig:l1_loss_slices}
\end{figure}


