\section{RELATED WORK}
\label{sec:related_work}

\paragraph{$\ML$ loss for neural networks}
Several works derive $\ML$ from different backgrounds.
For example, \citet{morningstar2022pacm} motivate the derivation from the distinction between the predictive risk $\mathcal{P}(q) = -\E_{\nu(X)}\left[ \ln \E_{q(\Theta)} [p(x|\Theta)] \right]$ (in this work termed $\ML$) and the inferential risk $\mathcal{R}(q) = -\E_{\nu(X)}\left[ \E_{q(\Theta)} [\ln p(x|\Theta)] \right]$ (here denoted $\VI$).
Building on \cite{masegosa_model_misspecification}, who found that in the case of model misspecifications minimizing the latter is not a tight bound for the predictive risk, \citet{morningstar2022pacm} leverage an expectation approximation trick following  \citet{burda_IWAE_2016} to derive PAC-Bayesian like guarantees for their \PACm-bound, which is identical to $\ML$ in their numerical approximation. 
However, their derived bound is vacuous for any fixed number of samples \cite[Appendix B.2]{morningstar2022pacm} 
and, therefore, can only serve as a theoretical motivation for the $\ML$ objective.

\mbox{\cite{wei2022performance}} examine $\ML$ (\textit{direct loss minimization} with an additional regularization term as they term it) empirically for BNNs and found that models trained with $\ML$ perform and generalize worse than their counterparts trained with $\VI$. That is, models trained with $\VI$ get better negative log-likelihoods across all classification tasks and models they tested, and they hypothesize that this is due to optimization difficulties or overfitting. We confirm this finding on `easy' classification tasks with sufficiently much training data (although test accuracy seems not to be affected).
In contrast, we find that $\ML$ is indeed outperforming the ELBO $\VI$ on misspecified and challenging tasks (FF-CIFAR10 and DermaMNIST, respectively).

\citet{dusenberry2020efficient} briefly empirically evaluate the impact of exchanging $\ln$ and $\E$ during training without KL regularization (here termed negative log-marginal-likelihood or mixture NLL).
They find that models trained with $\ML$ on CIFAR10 result in the worst test set performance in terms of expected calibration error, log-likelihood, and accuracy. They hypothesize that for ``misspecified models such as overparametrized neural networks, training a looser bound on the log-likelihood leads to improved predictive performance.''


\paragraph{Maximizing variances during neural network training}
Other works explicitly use methods for enhancing variances to boost prediction performance. 
For example, based on the derivation of a second-order PAC-Bayes bound, \citet{masegosa_model_misspecification} propose to include the prediction variance into the ensemble learning objective, whereas \citet{futami2021loss} boost the variances between losses in their approach. 
Another interesting work was conducted by \cite{ortega2022diversity} who investigated the interplay between generalization performance and diversity for neural network ensembles. Their main theorem gives insights into what drives ensemble diversity which is a) uncorrelated ensemble members and b) different predictions across models and data samples. Interestingly, they find that the relation between generalization and diversity is not present when operating in the ``interpolation regime'' for ResNet architectures on CIFAR10, where empirical errors are close to zero. On the contrary, it has been shown that artificially inflating diversity of neural ensembles does not generalize well and actually degrades performance \cite{jeffares2023joint,abe2023pathologies}.
