\section{Addendum}\label{sec:addendum}
As mentioned in the \hyperref[sec:related-work]{related work} section, a result by \citet{giordano2023black} demonstrates the futility of using a sample size smaller than the dimension of the latent space for the ELBO optimization problem.
In this section, we provide a proof sketch of this result, adapted to our notation.


\begin{theorem}[Theorem 2 of \citet{giordano2023black}]
  Let $q_\theta$ be a Gaussian distribution with parameters $\theta = (\mu, L\T{L})$, where $\mu \in \mathbb{R}^{d_Z}$ and $L \in \mathbb{R}^{d_Z \times d_Z}$ is a lower-triangular matrix with positive diagonal elements. 
  If we draw a sample of size $n < d_Z$ from $q_{\mathrm{base}}$, denoted by $\boldsymbol{\epsilon} = \epsilon_1, \dots, \epsilon_n$, then the optimization problem in Eq.~\eqref{eq:elbo-SAA} is unbounded:
  \begin{equation*}
    \sup_{\theta \in \Theta} \,\hat{\mathcal{L}}_{\boldsymbol{\epsilon}}(\theta) = \sup_{\theta \in \Theta}\, \frac{1}{n}\sum^n_{i=1}[\ln p(z_\theta(\epsilon_i), x)-\ln q_\theta(z_\theta(\epsilon_i))] = \infty.
  \end{equation*}
\end{theorem}

\begin{proof}
  Since $n < d_Z$, there exists a nonzero vector $\col{v} \in \mathbb{R}^{d_Z}$ such that $\langle \col{v}, \epsilon_i \rangle = 0$ for all $1 \leq i \leq n$.
  Without loss of generality, assume that the largest index $\ell$ with $\col{v}_\ell \neq 0$ satisfies $\col{v}_\ell = 1$.
  Define the lower triangular matrix
  \begin{equation*}
    L_\lambda = \begin{pmatrix}
      I_{\ell-1} && \col{0} \\ 
      &\lambda\row{v} \\
      \col{0} && I_{d_Z-\ell}.
    \end{pmatrix}.
  \end{equation*}
  Then, we have $(L_\lambda \epsilon_i)_\ell = 0 = (L_0 \epsilon_i)_\ell$ for all $1 \leq i \leq n$.
  Let $\theta_\lambda = (\col{0},  L_\lambda\T{ L_\lambda})$.
  For $\lambda > 0$, we obtain
  \begin{equation*}
    \hat{\L}_{\boldsymbol{\epsilon}}(\col{0},  L_\lambda\T{ L_\lambda}) = \frac{1}{n}\sum^n_{i=1}[\ln p(L_\lambda\epsilon_i, x)-\ln q_{\theta_\lambda}(L_\lambda\epsilon_i)] = \frac{1}{n}\sum^n_{i=1}[\ln p(L_0\epsilon_i, x)-\ln q_{\theta_\lambda}(L_\lambda\epsilon_i)] = c + \ln\lambda,
  \end{equation*}
  where $c$ is a constant independent of $\lambda$.
  
  The result follows by letting $\lambda \to \infty$.
\end{proof}

With this result in mind, we decided to adapt the SAA for VI algorithm by, in the case of a dense covariance matrix approximation, drawing a sample of size $n$, set as twice the smallest power of two exceeding the latent space dimension $d_Z$.
Table~\ref{table:ratio-time-adam-min-size-adjusted} and \ref{table:ratio-time-batched-quasi-newton-min-size-adjusted} present the experimental results alongside the previously computed results.
As observed, starting with a larger sample size allows us to reduce the number of iterations required to achieve a certain accuracy.
Furthermore, this reduction is substantial. 
This outcome was anticipated because, when the problem was unbounded, the optimization process for smaller $n$ typically concluded when the maximum number of iterations was reached, meaning the entire computational budget was utilized.


\begin{table}[ht!]
  \renewcommand{\arraystretch}{1.2}
  \begin{center}
    {
    \begin{tabular}{@{}lrcrrrcrrr@{}}
      \toprule
      % {} &  \multicolumn{3}{c}{Diagonal Covariance} & \phantom{aa} &  \multicolumn{3}{c}{Dense Covariance} \\
      {} & \multicolumn{1}{c}{Adam} & \phantom{aa} & \multicolumn{3}{c}{\begin{tabular}{@{}c}SAA for VI\\original, min $n = 32$\end{tabular}} & \phantom{aa} & \multicolumn{3}{c}{\begin{tabular}{@{}c}SAA for VI\\new, min $n > d$\end{tabular}} \\
      \cmidrule{2-2} \cmidrule{4-6} \cmidrule{4-6} \cmidrule{8-10}
      {} & \multicolumn{1}{c}{Time} & {}& \multicolumn{1}{c}{Min $n$} & \multicolumn{1}{c}{Time} & \multicolumn{1}{c}{Improvement} && \multicolumn{1}{c}{Min $n$} & \multicolumn{1}{c}{Time} & \multicolumn{1}{c}{Improvement} \\
      {} & \multicolumn{1}{c}{(i)} & &  & \multicolumn{1}{c}{(ii)} & \multicolumn{1}{c}{$\mathrm{(i)}/\mathrm{(ii)}$} &&  & \multicolumn{1}{c}{(iii)} & \multicolumn{1}{c}{$\mathrm{(i)}/\mathrm{(iii)}$} \\
        \midrule
        \textbf{Bayesian log.\ regr.}\\ 
        \hspace{1em}a1a & 19.95 &  & 32 & 19.69 & 1.01 &  & 256 & 4.69 & 4.26 \\
        \hspace{1em}australian & 14.73 &  & 32 & 4.81 & 3.06 &  & 128 & 1.14 & 12.96 \\
        \hspace{1em}ionosphere & 13.47 &  & 32 & 4.33 & 3.11 &  & 128 & 0.80 & 16.85 \\
        \hspace{1em}madelon & 223.55 &  & 32 & 58.52 & 3.82 &  & 1,024 & 2.57 & 86.90 \\
        \hspace{1em}mushrooms & 29.11 &  & 32 & 17.30 & 1.68 &  & 256 & 4.43 & 6.57 \\
        \hspace{1em}sonar & 11.74 &  & 32 & 12.17 & 0.96 &  & 128 & 2.75 & 4.27 \\
        \textbf{Stan models}\\
        \hspace{1em}congress & 50.34 &  & 32 & 0.82 & 61.46 &  & 32 & 0.78 & 64.40 \\
        \hspace{1em}election88 & 1,465.89 &  & 32 & 199.76 & 7.34 &  & 256 & 45.72 & 32.06 \\
        \hspace{1em}election88Exp & --- &  & 32 & 83.68 & --- &  & 256 & 5.59 & --- \\
        \hspace{1em}electric & 235.40 &  & 32 & 42.14 & 5.59 &  & 256 & 13.27 & 17.74 \\
        \hspace{1em}electric-one-pred & 70.62 &  & 32 & 0.62 & 114.40 &  & 32 & 0.60 & 117.46 \\
        \hspace{1em}hepatitis & 264.52 &  & 32 & 96.09 & 2.75 &  & 512 & 11.49 & 23.02 \\
        \hspace{1em}hiv-chr & --- &  & 32 & 29.74 & --- &  & 512 & 4.11 & --- \\
        \hspace{1em}irt & 210.05 &  & 32 & 94.80 & 2.22 &  & 1,024 & 15.38 & 13.65 \\
        \hspace{1em}mesquite & 48.54 &  & 32 & 0.27 & 179.91 &  & 32 & 0.26 & 185.76 \\
        \hspace{1em}radon & 252.85 &  & 32 & 18.66 & 13.55 &  & 256 & 7.43 & 34.03 \\
        \hspace{1em}wells & 18.33 &  & 32 & 0.08 & 221.36 &  & 32 & 0.08 & 232.47 \\
        \bottomrule
        \end{tabular}  
    }
  \caption{
    Comparison of \textbf{running time}, in seconds, for Adam and SAA for VI across various datasets, using a Gaussian approximating distribution with a dense covariance matrix, showing the running time improvement of SAA for VI over Adam.
    The \textbf{minimum sample size} $n$ for SAA in VI is also displayed.
    We consider two settings: one where the minimum $n$ is set to 32 for all datasets, which corresponds to the configuration used in this paper [cf.~Table \ref{table:ratio-time-adam}], and another where the minimum sample size is chosen as the nearest power of 2 to twice $d_Z$, the dimension of the latent space.
    The results indicate that by avoiding the use of small sample sizes, the running time of SAA in VI can be significantly reduced.
  }
  \label{table:ratio-time-adam-min-size-adjusted}
  \end{center}
\end{table}

\begin{table}[ht!]
  \renewcommand{\arraystretch}{1.2}
  \begin{center}
    {
    \begin{tabular}{@{}lrcrrrcrrr@{}}
      \toprule
      % {} &  \multicolumn{3}{c}{Diagonal Covariance} & \phantom{aa} &  \multicolumn{3}{c}{Dense Covariance} \\
      {} & \multicolumn{1}{c}{\begin{tabular}{@{}c}Batched \\quasi-Newton\end{tabular}} & \phantom{aa} & \multicolumn{3}{c}{\begin{tabular}{@{}c}SAA for VI\\original, min $n = 32$\end{tabular}} & \phantom{aa} & \multicolumn{3}{c}{\begin{tabular}{@{}c}SAA for VI\\new, min $n > d$\end{tabular}} \\
      \cmidrule{2-2} \cmidrule{4-6} \cmidrule{4-6} \cmidrule{8-10}
      {} & \multicolumn{1}{c}{Time} & {}& \multicolumn{1}{c}{Min $n$} & \multicolumn{1}{c}{Time} & \multicolumn{1}{c}{Improvement} && \multicolumn{1}{c}{Min $n$} & \multicolumn{1}{c}{Time} & \multicolumn{1}{c}{Improvement} \\
      {} & \multicolumn{1}{c}{(i)} & &  & \multicolumn{1}{c}{(ii)} & \multicolumn{1}{c}{$\mathrm{(i)}/\mathrm{(ii)}$} &&  & \multicolumn{1}{c}{(iii)} & \multicolumn{1}{c}{$\mathrm{(i)}/\mathrm{(iii)}$} \\
        \midrule
        \textbf{Bayesian log.\ regr.}\\ 
        \hspace{1em}a1a & 8.40 &  & 32 & 20.31 & 0.41 &  & 256 & 5.32 & 1.58 \\
        \hspace{1em}australian & 2.55 &  & 32 & 4.81 & 0.53 &  & 128 & 1.14 & 2.24 \\
        \hspace{1em}ionosphere & 2.35 &  & 32 & 4.33 & 0.54 &  & 128 & 0.80 & 2.93 \\
        \hspace{1em}madelon & 384.02 &  & 32 & 62.98 & 6.10 &  & 1,024 & 7.22 & 53.22 \\
        \hspace{1em}mushrooms \hfill\xmark& 7.31 &  & 32 & 18.84 & 0.39 &  & 256 & 5.94 & 1.23 \\
        \hspace{1em}sonar & 3.72 &  & 32 & 12.48 & 0.30 &  & 128 & 2.95 & 1.26 \\
        \textbf{Stan models}\\
        \hspace{1em}congress & 4.99 &  & 32 & 0.82 & 6.10 &  & 32 & 0.78 & 6.39 \\
        \hspace{1em}election88  \hfill\xmark\\
        \hspace{1em}election88Exp  \hfill\xmark\\
        \hspace{1em}electric  \hfill\xmark\\
        \hspace{1em}electric-one-pred & 4.53 &  & 32 & 0.62 & 7.33 &  & 32 & 0.60 & 7.53 \\
        \hspace{1em}hepatitis  \hfill\xmark\\
        \hspace{1em}hiv-chr  \hfill\xmark\\
        \hspace{1em}irt \hfill\xmark& 663.15 &  & 32 & 89.94 & 7.37 &  & 1,024 & 7.24 & 91.55 \\
        \hspace{1em}mesquite & 0.95 &  & 32 & 0.27 & 3.51 &  & 32 & 0.26 & 3.63 \\
        \hspace{1em}radon & 648.76 &  & 32 & 22.06 & 29.41 &  & 256 & 10.67 & 60.78 \\
        \hspace{1em}wells & 0.50 &  & 32 & 0.08 & 6.08 &  & 32 & 0.08 & 6.38 \\
        \bottomrule
        \end{tabular}  
    }
  \caption{
    Comparison of \textbf{running time}, in seconds, for batched quasi-Newton and SAA for VI across various datasets, using a Gaussian approximating distribution with a dense covariance matrix, showing the running time improvement of SAA for VI over batched quasi-Newton.
    The \textbf{minimum sample size} $n$ for SAA in VI is displayed.
    For models where the batched quasi-Newton method did not fully converge (\xmark), we only show results for \texttt{mushrooms} and \texttt{irt}, as the others diverged.
    Two settings are considered: one with a minimum $n$ of 32 for all datasets (used in this paper [cf.~Table \ref{table:runtime-comparison-b-quasi-newton}]), and another with the minimum sample size set to the nearest power of 2 greater than twice $d_Z$, the dimension of the latent space.
    As in Table~\ref{table:ratio-time-adam-min-size-adjusted}, the results indicate that avoiding small sample sizes can significantly reduce the running time of SAA in VI.    
  }
  \label{table:ratio-time-batched-quasi-newton-min-size-adjusted}
  \end{center}
\end{table}
