\section{Detailed comparison with batched quasi-Newton}\label{appendix:batched-quasi-Newton}
In this section, we provide further details regarding experiments conducted using the batched quasi-Newton method as described by \citet{liu2021quasi}.
We compare the maximum ELBO attained by the batched quasi-Newton to that achieved by SAA for VI.
This comparison is made for both the Gaussian distribution with a diagonal covariance matrix (Table~\ref{table:batched-quasi-Newton-diagonal}) and the one with a dense covariance matrix (Table~\ref{table:batched-quasi-newton}).
While results in the diagonal scenario align closely with ours, the batched quasi-Newton method often converges to a suboptimal solution in the dense case.

Additionally, we report the wall-clock time for each experiment in Table~\ref{table:runtime-comparison-b-quasi-newton}.
We executed each experiment for 40,000 iterations and performed 20 independent runs for each one. Our method incorporates a stopping criterion based on convergence.
To ensure a fair comparison with batched quasi-Newton, we need to detect when the algorithm converges.
To approximate this, we first calculate the highest ELBO for each of the 20 independent runs using both batched quasi-Newton and SAA for VI.
Then, we compute the median ELBO value across the repetitions for each method.
Finally, we determine the minimum median ELBO value between the two methods and calculate the total time taken until the algorithm reaches within 1 nat of this minimum median ELBO value.
These results are presented in Table~\ref{table:runtime-comparison-b-quasi-newton}.


Similar to the experiments with Adam, this calculation does not account for the time spent on sample sizes that were not useful.


\begin{table}[ht]
  \renewcommand{\arraystretch}{1.2}
  \centering
  \begin{tabular}{@{}lS[round-mode=places, round-precision=2]cS[round-mode=places, round-precision=2]@{}}
    \toprule
    & \multicolumn{3}{c}{Diagonal Gaussian}\\
    \cmidrule{2-4}
    & \multicolumn{1}{r}{Batched quasi-Newton 16} &\phantom{aa} & SAA for VI \\
    % \cmidrule{2-2} \cmidrule{4-4}
    \midrule
    \textbf{Bayesian log.\ regr.}\\
    \hspace{1em}a1a                & -654.94 & & -655.51 \\
    \hspace{1em}australian         & -268.47 & & -269.35 \\
    \hspace{1em}ionosphere         & -138.49 & & -139.62 \\
    \hspace{1em}madelon            & -2466.58 & & -2466.15 \\
    \hspace{1em}mushrooms          & -210.26 & & -211.43 \\
    \hspace{1em}sonar              & -150.14 & & -151.69 \\
    \textbf{Stan models}\\
    \hspace{1em}congress           & 421.91 & & 421.79 \\
    \hspace{1em}election88         & -1426.01 & & -1420.01 \\
    \hspace{1em}election88Exp      & -1382.64 & & -1380.18 \\
    \hspace{1em}electric           & -788.89 & & -788.89 \\
    \hspace{1em}electric-one-pred  & -818.33 & & -818.36 \\
    \hspace{1em}hepatitis          & -560.58 & & -560.44 \\
    \hspace{1em}hiv-chr            & -608.58 & & -608.77 \\
    \hspace{1em}irt                & -15888.14 & & -15887.92 \\
    \hspace{1em}mesquite           & -30.08 & & -30.15 \\
    \hspace{1em}radon              & -1210.73 & & -1210.70 \\
    \hspace{1em}wells              & -2042.37 & & -2042.45 \\
    \bottomrule
\end{tabular}
    \caption{
      Comparison of the \textbf{ELBOs} obtained by batched quasi-Newton and SAA for VI when using a diagonal Gaussian distribution as the approximating distribution. 
      The batched quasi-Newton method of \citet{liu2021quasi} is executed using a sample size of 16. 
      Median results are reported from 20 independent runs for each model. 
      The corresponding results for SAA for VI can also be found in column (ii) of Table~\ref{table:comparison-adam-elbo}.
    }
    \label{table:batched-quasi-Newton-diagonal}
\end{table}

\begin{table*}[t!]
  \renewcommand{\arraystretch}{1.2}
  \begin{center}
    \begin{tabular}{@{}lrrrrrr@{}}
      \toprule
      {} & \multicolumn{6}{c}{Dense Covariance} \\
      \cmidrule{2-7}
      {} & \multicolumn{4}{c}{Batched quasi-Newton---Sample Size} & \phantom{a} & \multirow{2}{*}{SAA for VI} \\
      \cmidrule{2-5} %\cmidrule{7-7}
      & 16 & 32 & 64 & 128 & {} \\
      \midrule
      \textbf{Bayesian log.\ regr.}\\
      \hspace{1em}a1a & -636.49 &  &  &  & {} & -636.40 \\
      \hspace{1em}australian & -256.80 &  &  &  & {} & -256.73 \\
      \hspace{1em}ionosphere & -124.44 &  &  &  & {} & -124.35 \\
      \hspace{1em}madelon \hfill\xmark & -2,418.04 & -2,412.23 & -2,407.44 & -2,406.27 & {} & -2,399.65 \\
      \hspace{1em}mushrooms & -179.96 &  &  &  & {} & -179.89 \\
      \hspace{1em}sonar & -110.09 &  &  &  & {}&-110.04\\
      \textbf{Stan models}\\
      \hspace{1em}congress & 423.59 &   &  &  & {} & 423.55 \\
      \hspace{1em}election88 \hfill\xmark & $-1.15 \times 10^{12}$ & $-8.26 \times 10^{11}$ & $-7.23 \times 10^{11}$ & $-5.87 \times 10^{11}$ & {} & -1,398.03 \\
      \hspace{1em}election88Exp \hfill\xmark & $-3.47 \times 10^{19}$ & $-1.15 \times 10^{18}$ & $-3.72 \times 10^{16}$ & $-1.86 \times 10^{16}$ & {} & -1,381.79 \\
      \hspace{1em}electric \hfill\xmark & $-5.44 \times 10^{10}$ & $-6.20 \times 10^{9}$ & $-5.05 \times 10^{9}$ & $-6.08 \times 10^{9\phantom{0}}$ & {} & -786.91 \\
      \hspace{1em}electric-one-pred & -1,145.79 & -818.00 &  &  & {} & -818.01 \\
      \hspace{1em}hepatitis \hfill\xmark & $-1.99 \times 10^{10}$ & $-1.03 \times 10^{10}$ & $-9.56 \times 10^{9\phantom{0}}$ & $-1.64 \times 10^{10}$ & {} & -557.36 \\
      \hspace{1em}hiv-chr \hfill\xmark & $-6.44 \times 10^{15}$ & $-1.47 \times 10^{16}$ & $-3.59 \times 10^{15}$ & $-1.87 \times 10^{15}$ & {} & -582.78 \\
      \hspace{1em}irt \hfill\xmark & -20,481.68 & -18,573.30 & -17,263.15 & -16,099.44 & {} & -15,884.67 \\
      \hspace{1em}mesquite & -29.78 &  &  &  & {} & -29.83 \\
      \hspace{1em}radon  & $-1.58 \times 10^{6}$ & $-5.50 \times 10^{5}$ & -4,473.35 & -1,209.47 & {}& -1,209.46 \\
      \hspace{1em}wells & -2,041.90 &  &  & & {}& -2,041.95 \\
      \bottomrule
      \end{tabular}
  \caption{
     Final \textbf{ELBO} by the batched quasi-Newton method for VI using a Gaussian distribution with a dense covariance matrix \citep{liu2021quasi}. The results for SAA for VI are included as a benchmark (column (v) of Table~\ref{table:comparison-adam-elbo}). 
     The batched quasi-Newton method frequently converges to suboptimal solutions, indicated by {\xmark}, especially in models from the Stan examples repository.
     In models like \texttt{election88}, the SAA for VI method demonstrates a significant performance advantage. The initial sample size for the batched quasi-Newton method was set to 16 and increased when necessary to enhance the method's ELBO.
  }
  \label{table:batched-quasi-newton}
\end{center}

\end{table*}



\begin{table}[th]
  \renewcommand{\arraystretch}{1.2}
\centering
\begin{tabular}{@{}lrrrcrrr@{}}
  \toprule
  {} &  \multicolumn{3}{c}{Diagonal Covariance} & \phantom{} &  \multicolumn{3}{c}{Dense Covariance} \\
  \cmidrule{2-4} \cmidrule{6-8}
   {} &  \multicolumn{1}{c}{SAA for VI} & \multicolumn{1}{c}{\begin{tabular}{@{}c@{}}Batched\\quasi-Newton\end{tabular}} & \multicolumn{1}{c}{Improvement}  && \multicolumn{1}{c}{SAA for VI} & \multicolumn{1}{c}{\begin{tabular}{@{}c@{}}Batched\\quasi-Newton\end{tabular}}& \multicolumn{1}{r}{Improvement}  \\
   {} & \multicolumn{1}{r}{(i)}       & \multicolumn{1}{r}{(ii)}            & \multicolumn{1}{r}{$\mathrm{(i)}/\mathrm{(ii)}$} && \multicolumn{1}{r}{(iv)}  &  \multicolumn{1}{r}{(v)} & \multicolumn{1}{r}{$\mathrm{(iv)}/\mathrm{(v)}$}  \\
    \midrule
   \textbf{Bayesian log.\ regr.}\\
   \hspace{1em}a1a & 0.38 & 2.10 & 5.60 &  & 20.31 & 8.40 & 0.41 \\
   \hspace{1em}australian & 0.21 & 1.08 & 5.03 &  & 4.81 & 2.55 & 0.53 \\
   \hspace{1em}ionosphere & 0.17 & 1.10 & 6.50 &  & 4.33 & 2.35 & 0.54 \\
   \hspace{1em}madelon & 0.81 & 7.82 & 9.71 &  & 62.98 & 384.02 & 6.10 \\
   \hspace{1em}mushrooms \hfill\xmark& 0.37 & 2.26 & 6.07 &  & 18.84 & 7.31 & 0.39 \\
   \hspace{1em}sonar & 0.30 & 1.28 & 4.28 &  & 12.48 & 3.72 & 0.30 \\
   \textbf{Stan models}\\
   \hspace{1em}congress & 0.95 & 2.93 & 3.08 &  & 0.82 & 4.99 & 6.10 \\
   \hspace{1em}election88 \hfill\xmark& 8.96 & 1,660.06 & 185.34 &  & --- & --- & --- \\
   \hspace{1em}election88Exp \hfill\xmark& 9.75 & 799.40 & 82.02 &  & --- & --- & --- \\
   \hspace{1em}electric \hfill\xmark& 1.92 & 18.35 & 9.57 &  & --- & --- & --- \\
   \hspace{1em}electric-one-pred & 0.51 & 3.45 & 6.73 &  & 0.62 & 4.53 & 7.33 \\
   \hspace{1em}hepatitis \hfill\xmark& 2.74 & 22.29 & 8.13 &  & --- & --- & --- \\
   \hspace{1em}hiv-chr \hfill\xmark& 2.27 & 30.57 & 13.44 &  & --- & --- & --- \\
   \hspace{1em}irt \hfill\xmark& 1.70 & 37.66 & 22.09 &  & 89.94 & 663.15 & 7.37 \\
   \hspace{1em}mesquite & 0.73 & 1.39 & 1.90 &  & 0.27 & 0.95 & 3.51 \\
   \hspace{1em}radon & 1.57 & 9.80 & 6.25 &  & 22.06 & 648.76 & 29.41 \\
   \hspace{1em}wells & 0.69 & 1.04 & 1.49 &  & 0.08 & 0.50 & 6.08 \\
   \bottomrule
\end{tabular}
  \caption{
    Comparison of \textbf{running times}, in seconds, for reaching within 1 nat of the minimum median ELBO value between SAA for VI and batched quasi-Newton across various models and approximating distributions.
    The analysis for the approximation using a dense covariance matrix considers runs with a batch size of 128 for batched quasi-Newton. For models marked with \xmark, indicating failure of batched quasi-Newton in the dense covariance matrix approximation, reports are limited to \texttt{madelon} and \texttt{irt} as they closely approach the maximum ELBO. The table also shows the running time improvement of SAA for VI over batched quasi-Newton; values greater than 1 imply that SAA for VI is faster.
  }
  \label{table:runtime-comparison-b-quasi-newton}
\end{table}

\FloatBarrier