\section{An analysis of the variance}
\label{sec:analysis_variance}

We theoretically investigate the difference between the two losses of interest and find the diversity of the prediction to be the key differentiating factor. 
Consequently, we empirically validate these findings.

\subsection{Theoretical considerations}
\label{sec:theory}
By Jensen's inequality, we see that the first term in \Cref{eq:ML_objective} is at least as large as that of \Cref{eq:ELBO}.
That is, ceteris paribus, the KL divergence has a relatively lower influence for \Cref{eq:ML_objective}, compared to \Cref{eq:ELBO}, allowing for stronger deviations from the prior. 
Further analysis shows that we can characterize the Jensen gap%, i.e.,
\begin{equation}
    J(q(\theta)) := \ML(q(\theta)) - \VI(q(\theta))
\end{equation}
by variations in the predictions:

\begin{proposition}[Bounds on the Jensen Gap]
\label{theorem:JensenGap}
Consider a parametrized distribution $p: (\mathcal{X} \times \mathcal{Y}) \times \Theta $, a posterior $q(\theta)$ over the parameter space $\Theta$, and input pairs $(x_n,y_n) \in (\mathcal{X} \times \mathcal{Y})$ for $i \in \{1,\dots,N\}$.
Assume that for each $n$, $p(y_n|x_n,\theta)$ satisfies $p(y_n|x_n,\theta) \in [a_n,1]$ for $a_n > 0$ with mean $\mu_n = \mathbb{E}_\theta [p(y_n|x_n,\theta) ]$, mean absolute deviation  $\absdev_{n} = \mathbb{E}_\theta[|p(y_n|x_n,\theta) - \mu_n|]$ and variance $ \sigma_n^2 = \mathbb{E}_\theta[(p(y_n|x_n,\theta) - \mu_n)^2]$. 
Then, the Jensen gap $J(q(\theta))$ between the objectives 
is bounded by
\begin{equation}
\sum_{n=1}^N \max\left\{\frac{ \sigma_n^2}{2},\delta_{p,n}\right\} \leq J(q(\theta)) \leq \sum_{n=1}^N \min\left\{\frac{\sigma_n^2}{2 a_n^2} , \frac{\absdev_n }{a_n} \right\}
\end{equation}
where for $p>1$ and $n \in \{1,\dots,N\}$
\begin{equation}
    \delta_{p,n} 
    :=
 \ln \left( \frac{\mathbb{E}_{q(\theta)}[p(y_n| x_n, \theta)]}{\left(\mathbb{E}_{q(\theta)}\left[{p(y_n| x_n, \theta)^\frac{1}{p}}\right]\right)^p} \right) \ge 0 \enspace .
\end{equation}
\end{proposition}

The quantity $\delta_p$, which we refer to as {$p$-compressed expectation spread}, is, like the variance, a measure of variability of $p(y_n| x_n, \theta)$.
Thus, the \emph{Jensen gap can be characterized by variations in the predictions}: variance or absolute deviation for the upper bound, and variance or $\delta_p$ for the lower bound.
The difference between $\VI$ and $\ML$ grows linearly with larger variations but similarly shrinks linearly to zero with smaller variations.
Equality between the two objectives is reached if and only if $\forall n: \sigma_n^2 = \Var_{q(\theta)}[p(y_n|x_n,\theta)] = 0$.

The proof is deferred to \Cref{sec:proof}, alongside further explanations on the $p$-compressed expectation spread $\delta_p$ derived from the self-improving AM-GM inequality \citep{AMGM_2009}.
Note, that the Jensen gap is also investigated in other works, e.g., by \cite{masegosa_model_misspecification}, which present the lower bound in terms of the prediction variance to the Jensen gap, and by ~\citet{futami2021loss}. A discussion on existing results is given in \Cref{sec:proof}, and an empirical comparison of the different bounds in \Cref{fig:JensenGapComparison}.

\Cref{theorem:JensenGap} suggests that diversity in the predictions may be the key factor in analyzing the effects of the above described `$\log \E$xchange'.
A further inspection of the gradients adds to these findings. The gradients in their $S$-sample approximation read:

\begin{align}
\label{eq:gradVI}
   & \nabla_{\theta} \E \ln: \quad \frac{1}{S} \sum_{s=1}^S \frac{  \nabla_{\theta} \  p(y_n| x_n, \theta_s)}{  p(y_n| x_n, \theta_s)} \enspace ,  \\
\label{eq:gradML}
   & \nabla_{\theta} \ln \E: \quad \frac{1}{S} \sum_{s=1}^S  \frac{ \nabla_{\theta}  \ p(y_n| x_n, \theta_s)}{\frac{1}{S} \sum_{r=1}^S  p(y_n| x_n, \theta_r)}  \enspace .
\end{align}
The main difference between these gradients lies in how $\nabla_{\theta} \ p(y_n| x_n, \theta_s)$ is scaled. For $\VI$, by the likelihood of the observation for each $\theta_s$ individually (\Cref{eq:gradVI}); for $\ML$, by the \textit{average} likelihood of the observation over all $S$ draws from $q(\theta)$ (\Cref{eq:gradML}).
Suppose that a model $\theta_s$ has low confidence for a given sample $(x_n, y_n)$.
Regarding $\VI$, this strongly impacts the gradient (weighting is inversely proportional to the confidence).
On the contrary, because of the averaged predictions in the denominator of the $\ML$ gradient, the gradient magnitude from a single model with low-confidence are in comparison down-weighted whenever the overall likelihood of the mixture $\sum_{s=1}^S p(y_n| x_n, \theta_s)$ is sufficiently high. 
This effect is expected to reduce the diversity between individual posterior samples for $\VI$, while allowing for more diversity for $\ML$ (and the possibility to learn multiple modes in the posterior, as seen in the toy examples in \cite{morningstar2022pacm}).
For the one-sample approximation ($S=1$), this gradient down-weighting effect is not present and we therefore expect the one-sample approximation to behave more similar to $\VI$.
Motivated by the theoretical considerations above we proceed to investigate the manifested differences resulting from training with $\VI$ vs. $\ML$ in practice.
