\subsection{Experimental analysis}
\label{sec:ImplicationsTraining}


Before diving into the empirical part, we want to highlight that the change of objectives is in practice done by a simple one-line change of code compared to regular Bayesian neural network training with the ELBO as shown in \Cref{lst:code_log_exchange}.


\paragraph{Experimental set-up}
We set the number of MC samples to $S=5$ for approximating the expectation during training and $\lambda =1$ (weighting of the KL divergence). 
We always compare to the `baseline' ($S=1$) for which $\VI$ and $\ML$  are equivalent.
At test time all predictions are made based on $100$ samples drawn from $q(\theta)$ to approximate $\E_{q(\theta)}[p(y|x, \theta)]$.
We validate the findings for different model architectures and hyperparameters (see below). We report means and standard deviation for each experimental setting over $10$ random seeds. 

\definecolor{codegreen}{rgb}{0,0.6,0}
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
\definecolor{codepurple}{rgb}{0.58,0,0.82}
\definecolor{backcolour}{rgb}{0.95,0.95,0.92}

% \begin{listing}
% \centering  
\begin{lstlisting}[
    language=python,
    backgroundcolor=\color{backcolour},   
    commentstyle=\color{codegreen},
    keywordstyle=\color{magenta},
    numberstyle=\tiny\color{codegray},
    stringstyle=\color{codepurple},
    basicstyle=\ttfamily\scriptsize,
    breakatwhitespace=false,         
    breaklines=true,                 
    captionpos=b,                    
    keepspaces=true,                 
    numbers=left,                    
    numbersep=5pt,                  
    showspaces=false,                
    showstringspaces=false,
    showtabs=false,                  
    tabsize=2,
    caption=\textbf{Example implementation} of the `$\log\mathbb{E}$xchange' in the objectives (PyTorch).,
    label=lst:code_log_exchange
]
# multiple forward passes through model (S times)
for s in range(S):
    logit_s, kl = model(x)
    log_p_y_s = dists.Categorical(logit_s).log_prob(target)
    log_p_y.append(log_p_y_s)
if args.objective == 'logE': # Eq.(ML)
    E_term = torch.mean(torch.logsumexp(torch.stack(log_p_y), 0) - math.log(S))
elif args.objective == 'Elog': # Eq.(VI)
    E_term = torch.mean(torch.stack(log_p_y))
\end{lstlisting}


\paragraph{Models and Datasets}
We conduct experiments on five different datasets. Next to the classics in computer vision, i.e., MNIST~\citep{deng2012mnist} FashionMNIST~\citep{xiao2017fashion}, and CIFAR10~\citep{cifar}, we also use two medical datasets, namely PathMNIST~\citep{pathmnist}\footnote{Released under CC BY 4.0 license.} and DermaMNIST~\citep{dermamnist1, dermamnist2}\footnote{Released under CC BY-NC 4.0 license.} in the highest resolution from the MedMNIST benchmark dataset~\citep{medmnistv1,medmnistv2}. 
Furthermore, we used four different architectural designs for our stochastic models:
A small feedforward network, denoted `FF', with two hidden layers of size 256 and 128 with ReLu activation functions, and a multivariate normal distribution over the weights with standard normal distributions as our prior.
In addition, we use a feedforward network with two hidden layers (width 128) where we model the weight distribution as a matrix variate normal distribution as proposed by~\citet{louizos_structured_2016}, denoted `FF-MVN'. 
This model type assumes that the learned variance factorizes and therefore reduces the amount of variance parameters from $d_\mathrm{in} \times d_\mathrm{out}$ to $d_\mathrm{in} + d_\mathrm{out}$.
For CIFAR10 we additionally train a ResNet20 architecture utilizing the code, hyperparameters and training procedure from \cite{krishnan2022bayesiantorch}.
% (and used exactly their default hyperparameter setting and training procedure).
Lastly, with `DINOTopping' we denote a model that uses the above-described `FF' model on top of the features extracted by DINOv2~\citep{oquab2023dinov2},\footnote{Released under Apache License 2.0.} where we extract the [CLS] token from the final transformer layer as a global representation of each image.
For the experiments, we used AdamW~\citep{adamw} with a batch size of 128 and an initial learning rate of $0.001$. For more details please see \Cref{app:training_details}.

\paragraph{Analysing the prediction variance}
\label{subsec:emp_pred_var}
As outlined in \Cref{sec:theory}, the gap between the objectives for the same weight distribution $q(\theta)$ is characterized by the prediction variance. 
However, because $q(\theta)$ is continuously changing during training, the behavior of the models trained with the different objectives are not directly relatable and hence it is not clear how much and if the prediction variance of the trained model differs. Therefore, we estimate the variance empirically
\begin{equation*}
\max_c 
\left\{
\frac{1}{S}\sum_{s=1}^S p(y_c|x_n, \theta_s)^2 - \left(\frac{1}{S}\sum_{s=1}^S p(y_c|x_n, \theta_s)\right)^2 \right\} \enspace 
\end{equation*} 
and visualize the results in \Cref{fig:pred_variance}.

\begin{figure}[!ht]
    \centering
        \begin{subfigure}{0.325\linewidth}
            \includegraphics[ width=\linewidth]{figs/MNIST_pred_var_MVN.pdf}
            \caption{MNIST}
        \end{subfigure}
      \begin{subfigure}{0.325\linewidth}
            \includegraphics[ width=\linewidth]{figs/cifar10_pred_var_resnet.pdf}
            \caption{CIFAR10}
        \end{subfigure}
        \begin{subfigure}{0.325\linewidth}
            \includegraphics[ width=\linewidth]{figs/pathMNIST_pred_var.pdf}
            \caption{PathMNIST}
        \end{subfigure}

    \caption{\textbf{Histogram of test samples binned by the prediction variance.} a) Uses the FF-MVN, b) the ResNet20, and c) the DINOTopping model.}
    \label{fig:pred_variance} 
\end{figure}
As expected, we observe significantly higher prediction variances for the models trained with $\ML$ throughout all datasets and all network designs (full results presented in \Cref{tab:test_performance}).
Models trained with baseline and $\ML$ show comparable variance.\footnote{Regarding the KL divergence, we observed throughout all experiments that it is lower for $\VI$ during training than for the baseline or $\ML$ (see argument in Sec.~\ref{sec:theory} and Fig.~\ref{fig:trainstats}).}

However, high(er) prediction variance per se is not informative about the behavior of ensemble members: Ensemble members can behave similarly for a given input, i.e., giving the same ordering of class labels, or predicting entirely different labels (see \Cref{fig:schema} in \Cref{sec:Illustration_EnsembleVariance} for an illustrative examples).
To further analyze the variability in prediction, we propose to 
investigate the dissimilarity score between predictions of single drawn networks as done by \mbox{\citet{fort2020deepensemble_loss_landscape}}.
That is, we measure the dissimilarity between two networks, corresponding to parameters $\theta_i$ and $\theta_j$ drawn from the learned posterior $q(\theta)$, as the fraction of disagreeing predictions, given by
\begin{equation*}
\begin{split}
    \frac{1}{N} \sum_{n=1}^N \ind [\arg \max_c p(y_c|x_n, \theta_{i}) \neq \arg \max_c p(y_c|x_n, \theta_{j}) ] \enspace .
\end{split}
\end{equation*} 

To generate the plot shown in in \Cref{fig:dissimilarity_pred} 
we draw ten samples $\theta_i$ (for each learned posterior $q(\theta)$), i.e., each pixel represents the dissimilarity between the predictions of two distinct parameter draws.


\begin{figure}[!hbt]
\centering
\resizebox{\columnwidth}{!}{
\begin{tabular}{ %p{2cm} b{2cm} b{2cm} b{2cm} b{2cm} b{2cm} b{2cm} b{2cm}}%
 l c c c }

& $\VI$ \phantom{ab} & baseline \phantom{ad} & $\ML$ \phantom{ab} \\
\raisebox{1cm}{\rotatebox[origin=c]{90}{MNIST} } &
\includegraphics[width=0.32\columnwidth ]{figs/MNIST_dissimilarity_BNN.pdf}  &
\includegraphics[width=0.32\columnwidth ]{figs/MNIST_dissimilarity_baseline.pdf} &
\includegraphics[width=0.32\columnwidth ]{figs/MNIST_dissimilarity_IM.pdf} \\
\raisebox{1cm}{\rotatebox[origin=c]{90}{CIFAR10} } &
\includegraphics[width=0.32\columnwidth ]{figs/cifar10_dissimilarity_BNN.pdf}  &
\includegraphics[width=0.32\columnwidth ]{figs/cifar10_dissimilarity_baseline.pdf} &
\includegraphics[width=0.32\columnwidth ]{figs/cifar10_dissimilarity_IM.pdf} \\
\raisebox{1cm}{\rotatebox[origin=c]{90}{PathMNIST}} &
\includegraphics[width=0.32\columnwidth ]{figs/Pathmnist_dissimilarity_BNN.pdf}  &
\includegraphics[width=0.32\columnwidth ]{figs/Pathmnist_dissimilarity_baseline.pdf} &
\includegraphics[width=0.32\columnwidth ]{figs/Pathmnist_dissimilarity_IM.pdf} \\
\end{tabular}
}
\caption{\textbf{Dissimilarity in predictions} measured as a fraction of disagreement between predictions of networks based on single draws from $q(\theta)$.
Darker red resembles higher amount of disagreement, indicating that more diverse functions are learned.}
\label{fig:dissimilarity_pred}
\end{figure}

For models trained with $\ML$ we observed notably higher function space diversity compared to those from the models trained with $\VI$ or the baseline.
Regarding MNIST and PathMNIST, the results for $\VI$ and the one-sample approximation appear similar, while $\VI$ demonstrates higher dissimilarity for CIFAR10.
This is in line with \Cref{fig:pred_variance}, where the models trained with $\VI$ show slightly higher variance than the baseline.


The higher function space diversity is an interesting property of $\ML$-trained models, as it has been found to improve ensemble predictions in many tasks.
Amongst others, it has been argued to be the reason for good uncertainty estimates of ensembles~\citep{fort2020deepensemble_loss_landscape}, %has been 
found to be relevant to bound the PAC-Bayes error under misspecification~\citep{masegosa_model_misspecification}, 
improving uncertainty and OOD detection performance of ensembles~\citep{pagliardini2023agree}, and the motivation for function space variational inference~\citep{sun2018functional, wang2018function}. 

\paragraph{Analyzing the weight distributions}

To determine the origins of the differing prediction variance behaviors, we inspect the learned distribution $q(\theta)$ for the models trained on PathMNIST, as the architecture and weight distribution design allow for straightforward analysis.
We calculated the Kullback-Leibler divergence between each weights' univariate normal distributions and standard normal distributions, see~\Cref{fig:dist}, and observe, that the model trained with 
$\VI$ has the highest amount of `collapsed' weights, i.e., weights following the prior distribution.
Naturally, this finding also translates when comparing the weights' variances\footnote{The mean distribution does not show interesting differences, results are therefore shown in~\Cref{fig:mean} in the Appendix.} resulting from the different objectives. 
Interestingly, we find that the weight distribution of the baseline and models trained with the $\ML$ seem to behave more similarly.
Another finding is that the $\ML$ trained model has relatively more weights for which the variance is essentially zero (i.e., they behave almost deterministically).

\begin{figure}[!tbh]

    \centering
    \includegraphics[width=0.85\columnwidth]{figs/KL_Var_pathmnist.pdf}

    \caption{\textbf{Histogram of $\KL[q(\theta)|| \mathcal{N}(0,1)]$ and $\sigma^2$ for each weight in the trained network. } $\VI$ has the most weights which are essentially equal to the prior distribution.
    Weight distributions for baseline and $\ML$ seem to be more similar compared to the $\VI$ trained model.
}
    \label{fig:dist} 
\end{figure}

Thus, higher learned variances over the weights seem to correlate with lower prediction variance.
We hypothesize that the models trained with $\VI$ partly learn to `disable' high variance connections from contributing to the final prediction, effectively learning a sparser network to better comply with the KL divergence.

\section{An analysis of the effects of the prediction variance...}
Given our finding that the training objectives lead to substantial differences in the prediction variance, 
this section analyses the effects and consequences of these %properties
differences, starting with the classical performance metrics such as accuracy, negative log-likelihood, and expected calibration error.

\setlength{\tabcolsep}{6pt}
\begin{table*}[htb]%[p]
\centering
\resizebox{0.95\textwidth}{!}{%
\begin{tabular}{c c l ccccc}
\toprule
% \textcolor{green!40!gray}{NEW}
Dataset & Arch &Obj.     & Accuracy in \% $\uparrow$ & NLL $\downarrow$ & \hspace{-1mm} Avg. pred conf in \%  \hspace{0mm}  & Avg. variance & ECE $\downarrow$\\ \hline
\multirow{6}{*}{MNIST} & 
\multirow{3}{*}{FF} 
&$\VI$ &    97.94\std{0.04} & 0.081\std{0.001} & 95.41\std{0.06} & 0.012\std{0.000} & 0.025\std{0.001} \\
&&baseline& 98.12\std{0.06} & \textbf{0.072\std{0.001}} & 95.97\std{0.09} & 0.013\std{0.000} & \textbf{0.022}\std{0.001} \\
&& $\ML$ &  98.19\std{0.07} & 0.075\std{0.001} & 95.49\std{0.06} & 0.030\std{0.000} & 0.027\std{0.000} \\
\cdashline{2-8}  
%
& \multirow{3}{*}{FF-MVN} 
&$\VI$ &    97.65\std{0.06} & \textbf{0.099\std{0.001}} & 94.22\std{0.04} & 0.014\std{0.000} & 0.034\std{0.001} \\
&&baseline& 97.42\std{0.04} & 0.106\std{0.003} & 93.88\std{0.08} & 0.015\std{0.000} & 0.035\std{0.001} \\
&& $\ML$ &  97.46\std{0.09} & 0.118\std{0.001} & 92.41\std{0.13} & 0.046\std{0.001} & 0.051\std{0.002} \\
% &&& \textit{--- running ---} \\
\midrule 
%
\multirow{6}{*}{FashionMNIST} & 
\multirow{3}{*}{FF} 
&$\VI$ &     87.18\std{0.21} & 0.358\std{0.002} & 83.66\std{0.21} & 0.016\std{0.000} & 0.035\std{0.002} \\
&&baseline&  87.87\std{0.09} & 0.340\std{0.002} & 84.69\std{0.20} & 0.016\std{0.000} & 0.032\std{0.002} \\
&& $\ML$ &   \textbf{88.33\std{0.10}} & \textbf{0.328\std{0.001}} & 84.97\std{0.07} & 0.049\std{0.001} & 0.034\std{0.000} \\
\cdashline{2-8}  
& \multirow{3}{*}{FF-MVN} 
&$\VI$ &    85.94\std{0.28} & 0.393\std{0.003} & 82.81\std{0.11} & 0.014\std{0.000} & 0.032\std{0.004} \\
&&baseline& 85.73\std{0.17} & 0.398\std{0.002} & 82.55\std{0.25} & 0.015\std{0.000} & 0.032\std{0.003} \\
&& $\ML$ &  \textbf{86.53\std{0.11}} & \textbf{0.382\std{0.003}} & 82.41\std{0.26} & 0.050\std{0.001} & 0.041\std{0.002} \\
% &&& \textit{--- running ---} \\
\midrule 
%
\multirow{6}{*}{CIFAR10} & 
\multirow{3}{*}{ResNet} 
&$\VI$ &    89.95\std{0.37} & 0.314\std{0.009} & 85.88\std{0.40} & 0.049\std{0.002} & 0.041\std{0.002} \\
&&baseline& 89.59\std{0.24} & 0.312\std{0.005} & 87.63\std{0.16} & 0.036\std{0.001} & \textbf{0.021}\std{0.002} \\
&& $\ML$ &  89.48\std{0.44} & 0.347\std{0.009} & 82.94\std{0.39} & 0.077\std{0.002} & 0.065\std{0.005} \\
\cdashline{2-8} 
&\multirow{3}{*}{FF} 
&$\VI$ &    39.92\std{0.65} & 1.683\std{0.008} & 33.30\std{0.47} & 0.013\std{0.000} & 0.066\std{0.002} \\
&&baseline& 40.58\std{1.00} & 1.655\std{0.014} & 34.80\std{0.64} & 0.013\std{0.001} & 0.058\std{0.005} \\
&& $\ML$ &  \textbf{45.37\std{0.20}} & \textbf{1.550\std{0.004}} & 39.15\std{0.22} & 0.079\std{0.002} & 0.062\std{0.002} \\
\midrule 
\multirow{3}{*}{DermaMNIST} & 
\multirow{3}{*}{DINOTopping} 
&$\VI$ &      77.58\std{1.02} & 0.617\std{0.015} & 72.68\std{2.02} & 0.021\std{0.001} & 0.053\std{0.018} \\
&&baseline&   79.11\std{0.89} & 0.575\std{0.014} & 74.02\std{1.01} & 0.024\std{0.002} & 0.052\std{0.011} \\
&& $\ML$ &    \textbf{81.77\std{0.43}} & \textbf{0.515\std{0.007}} & 76.92\std{0.70} & 0.089\std{0.004} & 0.050\std{0.011} \\
% && &\multicolumn{4}{l}{ Read from Table 3 in \citep{medmnistv2}: best accuracy with Google AutoML Vision:  76.8}\\
\midrule 
\multirow{3}{*}{PathMNIST} & \multirow{3}{*}{DINOTopping} & 
$\VI$    &    94.48\std{0.38} & 0.151\std{0.009} & 93.76\std{0.38} & 0.015\std{0.001} & 0.007\std{0.002} \\
&&baseline&   94.43\std{0.11} & 0.152\std{0.004} & 93.92\std{0.29} & 0.016\std{0.001} & 0.007\std{0.002} \\
&& $\ML$ &    94.44\std{0.32} & 0.166\std{0.006} & 92.88\std{0.19} & 0.043\std{0.001} & 0.016\std{0.002} \\
% && & \multicolumn{4}{l}{ Read from Table 3 in \citep{medmnistv2}: best accuracy with ResNet-50 (28):  91.1} \\
 \bottomrule \\
\end{tabular}
}

\caption{
    \textbf{Accuracy, negative log-likelihood (NLL), average prediction confidence, average prediction variance, and expected calibration error (ECE)} for different datasets and model types on the respective test sets. 
    Previous SOTA accuracy for DermaMNIST was 76.8\% (with Google AutoML Vision), and  91.1\% for PathMNIST (with ResNet-50 (28)), see Table 3 in \cite{medmnistv2}. 
    Bold indicates the best performance in terms of accuracy, NLL or ECE whenever the effect size exceeds two standard deviations ($\ge2\sigma_\mathrm{max}$).}
    \label{tab:test_performance} 

\end{table*}


\subsection{...on Accuracy, NLL, Calibration Error and Prediction Confidence}
The relevant statistics for all objectives, model types, and datasets are presented in \Cref{tab:test_performance}, showing that the overall performance of all inspected models is decent (with FF-architecture on CIFAR10 as an intended exception).
For the DINOTopping models, we even reach state-of-the-art results on DermaMNIST and PathMNIST (which justify our setup).

\paragraph{$\ML$ is better on `hard' tasks} While accuracy is mostly comparable, we observe a significant increase in accuracy and log-likelihood for models trained with $\ML$ for the FF architecture on CIFAR10 and DINOTopping on DermaMNIST.
This increase in accuracy can be explained by the difficulty of the task: The small fully connected feedforward network (FF) is clearly unsuited for CIFAR10 (misspecified), while for DermaMNIST only few training samples are available (cf. \Cref{tab:datasets}) and overall performance is quite low (baseline achieves accuracies below 80\%).

As found by~\citet{ortega2022diversity} for general ensembles, we suspect that these accuracy advantages stem from the combination of diverse weak learners (as found in \Cref{subsec:emp_pred_var}) which lead to better accuracies through error diversification. 
This finding resonates with that of~\citet{morningstar2022pacm}, who found that the $\ML$ objective (termed \PACm{} in their work) performs better in case of misspecification, i.e., when the true data generating distribution cannot be matched by any model in the single parameter setting ($\nexists\, \theta \in \Theta: p(y | x, \theta) = p_\mathrm{data}(y|x)$).
Experimentally they demonstrate some benefits of the $\ML$ objective for neural networks when using an explicitly ill-defined regression problem\footnote{They used the upper half of images as inputs and tried to predict independently the pixel values for the lower half of the images. Because the predictions happen independently but pixel values in images are certainly correlated, it is in the misspecified regime.}  and reached comparable accuracies to $\VI$ in classification tasks, where the prior was named as the source of misspecification. 
With our experiments on FF on CIFAR10
we contribute to their finding by adding 
an instance to the list of misspecifications, namely a misspecification in form an unsuitable network architecture, where the accuracy benefits from using $\ML$. 
Furthermore, the $\ML$ objective seems to be beneficial for difficult classification tasks (thinking of DermaMNIST), which can also be regarded as another form of misspecification.

\paragraph{High prediction variance can also hurt} Throughout all experiments, we observe---in line with the ideas outlined in \Cref{sec:theory}---that the average prediction variance is highest for models trained with $\ML$.
Models trained with the one sample approximation and $\VI$ typically show similar prediction variances.
In setups where the increased prediction variance is not beneficial, especially when the overall accuracy is already high, it reduces the average prediction confidence. In turn, this negatively impacts the negative log-likelihood as well as the expected calibration error, see the results for MNIST, CIFAR10 with ResNet20 or PathMNIST. 
This resonates well with the findings of~\cite{wei2022performance}, who found that models trained with $\ML$ usually get worse negative log-likelihood scores---which at first glance contradict the positive findings reported for example by \cite{morningstar2022pacm}, \cite{futami22a}, and \cite{masegosa_model_misspecification}.
 
While \cite{wei2022performance} try to explain these finding with learning dynamics, i.e. by $\ML$ getting stuck in bad local minima (a hypothesis they falsified themselves), this behavior is expected as the increased prediction variance naturally reduces the negative log-likelihood in settings where already highly accurate and confident predictions are made.
This is because the higher diversity between single ensemble members reduces the model confidence and therefore also the negative log-likelihood---in line with the findings from~\citet{jeffares2023joint} and \cite{abe2023pathologies} that argue that artificially increasing prediction diversity during training of ensembles can in fact be counterproductive.
Interestingly though, we a) get comparable test accuracies (negative effects seem to be limited to NLL and ECE), b) do not directly optimize for increased prediction variance, and c) have the very same setup and only uni-modal normal distributions over the weights and still observe higher function space diversity with $\ML$.
In addition, we found that training with the baseline typically leads to the best calibrated models.

\subsection{...on adversarial robustness}

\label{subsec:Adv_robustness}
Recent work by~\citet{daubener_how} suggests, that higher prediction variance can have a positive effect on the adversarial robustness of models, which we test in this subsection. For this we attacked the FF-MVN network on MNIST, the FF network on FashionMNIST and the ResNet20 architecture on CIFAR10 with strong attacks, namely with the projected gradient descent method~\citep{madry2018towards}, which iteratively conducts fast gradient sign method ~\citep[FGSM,][]{Goodfellow_fgsm} updates with a smaller step size than the allowed maximal perturbation size. We used 10 iterations and 10 samples per approximation of the gradient.
This leads to 100 sampled $\theta$ in total per %sample
data point. We used the $l_{\infty}$-norm to quantify the maximal allowed perturbation which we gradually increased from $0$ to $0.25$. For the models trained on CIFAR10, we calculated adversarial examples with FGSM
where we estimated each gradient based on 10 samples of $\theta$
for computational reasons.
\Cref{fig:adv_acc} shows the accuracies under adversarial attacks for the models optimized with $\VI$, $\ML$, and the baseline.

\begin{figure}[!bth]
    \centering
        \begin{subfigure}{0.32\columnwidth}
            \includegraphics[width=\linewidth]{figs/MNIST_adv.pdf}
        \caption{MNIST}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.32\columnwidth}
        \includegraphics[width=\linewidth]{figs/FashionMNIST_adv.pdf}
            \caption{FMNIST}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.32\columnwidth}
        \includegraphics[width=\linewidth]{figs/cifar10_adv.pdf}
            \caption{CIFAR10}
    \end{subfigure} 

    \caption{\textbf{Accuracy under adversarial attack with an increasing amount of allowed perturbation.} We report the mean and standard deviation (shaded area) calculated on 10 independently trained and attacked models. a) uses the FF-MVN, b) the FF, and c) the ResNet architecture.}
    \label{fig:adv_acc}
\end{figure}


We see, that the adversarial accuracies of the baseline and the $\VI$ trained models are lower than for the $\ML$ trained model on MNIST and FashionMNIST. 
This effect is not directly observable for the models trained on CIFAR10. 


\subsection{...on out-of-distribution detection}
\label{subsec:OOD}
Lastly, this subsection investigates how capable the models trained with $\mathcal{L}_{\text{ML}}, \mathcal{L}_{\text{VI}}$ and the baseline are to detect out-of-distribution data.
To create realistic OOD samples we utilize the benchmark image corruptions by~\citet{michaelis2019dragon}.
We take the CIFAR10 test set and generate 75 OOD data sets: for each of the 15 different corruption styles we generated 5 corrupted data sets with increasing severity (see~\Cref{fig:corrupted_images} in the Appendix for example images). 
Next, we let all models predict all samples in these test sets and also for the benign test data set. 
In addition, we compute the entropy of the predictive distribution (resulting from 100 draws from $q(\theta)$) for each example, as it quantifies the uncertainty in the model’s output distribution over the classes:
\begin{equation*}
    \mathcal{H}(p(y \vert x)) = -\sum_c p(y_c \vert x) \ln\big(p(y_c \vert x)\big)
\end{equation*}
High entropy reflects uncertainty or lack of confidence, which is ideally elevated for OOD inputs, while entropy should be comparably lower on in-distribution data. 
Thus, it can serve as an effective score for OOD detection.
Based on the computed entropy values, the AUROC for distinguishing test from OOD data is calculated, yielding 75 AUROC scores.
Based on these values the AUROC for discriminating between test and OOD data set is calculated, which results in 75 AUROC values. Each experiment is repeated 10 times.
Because of the same initialization, we conducted a pairwise Wilcoxon rank-sum test with significance level $\alpha = 0.05$ to compare the AUROC values against each other in \Cref{tab:auroc}. 


\begin{table}[hbt]
\centering
\resizebox{\columnwidth}{!}{
\begin{tabular}{ 
 l c c c c c}
\toprule
  & $\VI$ & baseline & $\ML$  & Avg. Acc & Avg. AUROC\\
\midrule
$\VI$ & $\times$ & 18 & 0  & 0.3942 & 0.8806 \\
baseline & 0  & $\times$ & 0 & 0.3923 & 0.8723 \\
$\ML$  & 27  & 31 & $\times$  & 0.3945  & 0.8970\\ \bottomrule
\end{tabular} 
}
\caption{\textbf{The $\ML$ objective leads to models more capable of detecting corrupted test instances}.
The first block reports the number of successful pairwise Wilcoxon rank-sum tests based on the AUROC values for discriminating between test and OOD samples. The pairwise tests compare if the row objective leads to a significantly higher AUROC than the column objective with entropy as the score function. 
The total number of tests is 75.
Example interpretation for the bottom left entry: In 27 out of all 75 cases (i.e., 36\%) the $\ML$ objective significantly outperforms the $\VI$ objective (while $\VI$ never outperformed $\ML$).
The last two columns display the average accuracy over all seeds and corruptions, and the average AUROC.}
\label{tab:auroc}
\end{table}

\Cref{tab:auroc} shows that models trained with the $\ML$ objective lead to significantly higher AUROC values in %almost
36\% of the OOD detection tasks when compared to models trained with the other objectives. The average accuracy over all OOD datasets is similar for all models, while the average AUROC mirrors the results of the hypotheses tests, where the $\ML$ trained models lead to the highest average value. 
In this context, we see that $\VI$ performs better than the baseline (which is not the case in our other experiments).

