\section{DISCUSSION AND CONCLUSION}
\label{sec:discussion}

This work takes a closer look at the ELBO used in the variational training of Bayesian neural networks and at how this objective is approximated in practice.
The commonly used one-sample approximation of the expectation term in the ELBO, \Cref{eq:ELBO}, can be reinterpreted as the log-likelihood of a compound density model (a fact that only a subgroup of researchers in the PAC-Bayesian domain seem to be aware of).

This implies, that for $S=1$ the trained stochastic model can \textit{either} be seen as a Bayesian neural network where we try to approximate the true but unknown posterior, \textit{or} as a compound density model where we maximize the (regularized) log-likelihood of the parameters of the mixing distribution.
In practice, the difference between these two objectives becomes evident when optimizing with more than one Monte Carlo sample ($S>1$) which has theoretical and practical implications, specifically with regards to the variance of the model predictions.
More precisely, we present a simple proposition indicating that maximizing the ELBO leads to models with lower prediction variance than training with the likelihood-based $\ML$ objective. This is verified throughout extensive experiments, where we find that models trained with $\ML$ lead to comparable accuracy, increased prediction variance, and increased function space diversity compared to identically initialized models trained with the ELBO or the one-sample approximation (baseline).

The aforementioned properties are linked to model robustness concerning adversarial examples and OOD detection performance, which we also empirically find in our paper. However, encouraging function space diversity for networks that are capable of making highly confident correct predictions for a given task naturally leads to a degradation of prediction confidence and therefore also of the negative log-likelihood. Hence, the findings of~\cite{wei2022performance} are not discrediting the performance of the $\ML$, but give credit to \cite{jeffares2023joint,abe2023pathologies}, who state that in this particular setting enhancing diversity between ensemble members is not advantageous.

In contrast, we see that enhancing diversity helps when using a poorly suited model architecture---such as in our experiments with the FF on CIFAR10 experiments---or when tackling a challenging task, as with DermaMNIST. 
In such cases, $\ML$ performs favorably compared to the baseline and $\VI$. This behavior is resembling the idea of boosting where combing weak learners can yield stronger overall performance when their errors are sufficiently diverse.
In this context, we add another misspecification setting (in which the $\ML$ objective is beneficial) to the findings of \cite{morningstar2022pacm}, which is a (poorly) suited model architecture itself (next to a suitable prior and likelihood definition).
Another contribution of this paper is that our investigation of the gradients in \Cref{sec:ImplicationsTraining} can explain \citet{morningstar2022pacm}'s toy regression findings, where $\ML$ trained models can reproduce multi-modal predictive distributions and are more robust to outliers: Because of the implicitly encouraged high function space diversity, the averaged likelihood for an outlier is enhanced and thereby reduces the impact of its gradient direction to improve the model on this particular sample. 
At the same time, the same inner workings allow the model to explore and learn multimodal distributions, as the impact of some wrong predictions is not dominating the gradient direction when training with the $\ML$ objective. 

Lastly, our analysis verifies the common practice of a one-sample-approximation (baseline) as a good approximation to training with the ELBO ($\VI$), as it reproduces similar train and test behavior as the several-sample-approximation (though the weight distributions are more alike to models trained with $\ML$). 

The presented results show that the way to train stochastic neural networks should depend on the characteristics of the problem itself: When large prediction variance is advantageous to remedy misspecifications, tackling a hard classifcation task with little available data or increase robustness against adversarial or OOD inputs (and no Bayesian interpretation is needed), training a compound density model with the regularized maximum-likelihood objective $\ML$ with $S>1$ can indeed be a capable alternative to the ELBO.
