\section{INTRODUCTION}
\label{sec:intro}
With rapid advances in model performance over the past decade, the deep learning community has increasingly focused on developing methods to quantify model uncertainty---critical for ensuring reliable predictions, particularly in high-stakes applications like healthcare, autonomous systems, and scientific research.

Bayesian neural networks \citep{neal1993bayesian,mackay1992practical} are regularly utilized for this purpose, where the derivation of a posterior distribution over parameters is a main challenge. 
A central approach for deriving an approximate posterior is through variational inference, where a parametric distribution is fitted to match the true unknown distribution by minimizing the Kullback-Leibler (KL) divergence between the true posterior $p(\theta| \mathcal{D})$ and an approximate parametric posterior distribution $q(\theta)$, i.e.,%:
\begin{equation}
    \KL [q(\theta) \ \Vert \ p(\theta| \mathcal{D})] \enspace .
    \label{eq:KL}
\end{equation}
This divergence is not directly tractable due to the unknown true posterior $p(\theta| \mathcal{D})$, but can be decomposed into the logarithm of the data likelihood $p(\mathcal{D})$ (also called evidence) minus a second term called the \textit{evidence lower bound} (ELBO).
Since the evidence $p(\mathcal{D})$ does not depend on the model parameters, 
one can minimize \Cref{eq:KL} by maximizing the ELBO. 
For a data set consisting of $N$ input-output pairs $\{(x_n,y_n) \}_{n=1}^N$, this objective is given by\footnote{%
The theory proposes to set $\lambda = 1$.
In practice, tempering is often applied, i.e., choosing some $\lambda > 0$. 
Values $\lambda < 1$, however, are the reason for the ``cold posterior'' discussion in BNNs as they can increase test set accuracy~\citep{wenzel2020howgoodposterior} but alter training assumptions.}
\begin{equation} 
\begin{split}
  \label{eq:ELBO}
        \mathcal{L}_{\text{VI}}= \sum_{n=1}^N \mathbb{E}_{q(\theta) }[\ln p(y_n| x_n, \theta)]-  \lambda \,
        \KL [  q(\theta)\ \Vert \ p(\theta)]
        \enspace ,  
\end{split} \tag{VI}
\raisetag{1.5ex} 
% \vspace{-5mm}
\end{equation}
where $p(\theta)$ is a prior distribution over the parameters.
In general, however, there is no closed-form solution for the expectation term in \Cref{eq:ELBO}, such that Monte Carlo (MC) approximations are applied in practice.\footnote{
Exceptions exists for simple edge-cases, local linearization \citep{goulet2021tractable} or at stationary points \citep{dammELBOconvergence,velychko2024learning,lucke2024convergenceelboentropysums}.}
That is, for a given input-output pair $(x_n, y_n)$ the expectation is approximated by $ \frac{1}{S} \sum_{s=1}^S \ln p(y_n| x_n, \theta_s)$, with $\theta_1, \dots, \theta_S$  being i.i.d. draws from
$q(\theta)$. 
This estimate is known to converge with a convergence rate of $1/\sqrt{S}$ 
and hence needs a large amount of samples to have a small
approximation error. 

However, during training the expectation term is usually approximated with \textit{just a single MC sample} $\theta_1 \sim q(\theta)$, 
resulting in
\begin{equation}
\label{eq:baseline}
        \sum_{n=1}^N \ln p(y_n| x_n, \theta_1)-   \lambda \, \KL[q(\theta ) \ \Vert \ p(\theta)]
        \enspace.  \tag{baseline} 
\end{equation}
This frequently utilized one-sample MC estimate\footnote{The one-sample approximation is a standard choice in Bayesian neural network training, e.g., in foundational works such as Auto-Encoding Variational Bayes \citep{kingma2014auto}, as well as dedicated libraries like Bayesian Torch \citep{krishnan2022bayesiantorch} and BayesDLL 
\citep{bayesdll_kim_hospedales_2023}.} of \Cref{eq:ELBO} is \textit{also} the one-sample approximation of a regularized maximum likelihood objective
\begin{equation} % unnumbered equation as we use a custom tag
\begin{split}
  \mathcal{L}_{\text{ML}}=
        \sum_{n=1}^N \ln\big( \mathbb{E}_{q(\theta)}[p(y_n| x_n, \theta)]\big)- \lambda \, \KL[ q(\theta) \ \Vert \ p(\theta)]  \enspace.
        \label{eq:ML_objective}
\end{split}
\tag{ML} 
 \raisetag{1.5ex}
\end{equation}
This objective, $\ML$, differs from $\VI$, \Cref{eq:ELBO}, only in the first term in the order of expectation and logarithm: $\mathbb{E}_{q(\theta)}[\ln(\cdot)]$ is replaced by $\ln( \mathbb{E}_{q(\theta)}[\cdot])$. 
Maximizing $\ML$ no longer provides a guarantee to reduce the KL divergence between approximate and true posterior distribution.\footnote{An exception is the edge case where the Jensen inequality between $\mathbb{E}_{q(\theta)}[\ln(\cdot)]$ and $\ln( \mathbb{E}_{q(\theta)}[\cdot])$ 
becomes an equality.
}
The first term in $\mathcal{L}_{\text{ML}}$ corresponds to the log-likelihood under a compound distribution, where the likelihood is averaged over the mixing distribution $q(\theta)$:
\begin{equation}
    p(y|x)= \int p(y|x,\theta) \ q(\theta) \, \mathrm{d} \theta \enspace .
\end{equation}
It thus corresponds to the predictive log-loss, which is also used for test-time predictions or evaluation.
The second term acts as a regularizer, encouraging the mixing distribution $q(\theta)$ to remain close to a pre-specified distribution $p(\theta)$, as measured by the Kullback–Leibler divergence.
In contrast to the ELBO, $p(\theta)$ does not need to be a prior distribution in the Bayesian sense, but can be chosen freely.
To summarize, $\ML$ minimizes the (regularized) predictive risk (log-loss) of a compound distribution, while $\VI$ minimizes the KL divergence to the true model.

The latter objective in \Cref{eq:ML_objective} is no unknown objective. It has been shown to enable tighter generalization bounds following the PAC-Bayesian theory and is known under various names, e.g., as \emph{direct loss minimization}~\citep{sheth2020PseudoBayesian,wei2021DirectLossGaussianP,wei2022performance}, {\PACm}~\citep{morningstar2022pacm}, or \emph{predictive variational Bayesian inference}~\citep{futami22a}.
Besides the theoretically grounded advantages, $\ML$ was shown to behave favorably in practice, especially in the misspecified setting \citep{morningstar2022pacm}, for (sparse) Gaussian processes \citep{sheikh2017stochastic,jankowiak2020parametric,wei2021DirectLossGaussianP}, and in capturing aleatoric uncertainty \citep{masegosa_model_misspecification}.
On the contrary, for BNNs there exist findings indicating that $\VI$ performs favorably \citep{wei2022performance}.

However, a thorough understanding of the effects of training stochastic neural networks with $\VI$ or $\ML$, especially in comparison to their common one-sample approximation is missing so far.
We close this gap, by conducting an in-depth analysis
of the implications of the changed training objective for the multi-class classification setting.
We pay particular attention to the diversity of predictions as these are
key for performance and generalization \citep[e.g.,][]{masegosa_model_misspecification,futami22a,ortega2022diversity}.
Besides standard performance measures (NLL, accuracy, ECE) we also investigate the effect of increased prediction variance on adversarial robustness and the capability of detecting out-of-distribution samples.

The presented variance insights also clarify conflicting findings in the literature and resolve their ambiguity, thereby bridging different research branches.

\paragraph*{Main Contributions}
\begin{itemize}
    \item We observe that the ELBO ($\VI$) and the regularized maximum likelihood objective ($\ML$)
    are indistinguishable when approximating them with only a single Monte Carlo sample, i.e., when $S=1$,  raising the question how the losses and the resulting models differ for $S>1$, and whether models trained with $S=1$ are better understood as optimizing $\VI$ or $\ML$.
    \item We investigate both losses theoretically and empirically in the multi-class classification setting and demonstrate that training with $\ML$ leads to significantly higher diversity in predictions. 
    \item We find that the performance of $\ML$ relative to $\VI$ and the common one-sample approximation depends on the `hardness' of the task: for `hard' tasks and tasks with model-misspecification $\ML$ typically outperforms $\VI$ and the baseline, while NLL and ECE are typically worse on `easy' tasks. In addition, $\ML$ yields models more robust to OOD inputs and adversarial attacks.
    \item Finally, we confirm that the commonly used one-sample approximation closely resembles the standard training with $\VI$ (which justifies its use for training BNNs).
\end{itemize}
