\section{Experiments}\label{sec:experiments}

We now present experiments comparing SAA to other methods in terms of optimization quality and running time.
We examine two types of models following the setup of \citet{burroni2023ustatistics}.
These include 11 models from the Stan model library\footnote{congress, election88, election88Exp, electric, electric-one-pred, hepatitis, hiv-chr, irt, mesquite, radon, and wells}~\citep{standevstancore, carpenter2017stan} as well as Bayesian logistic regression models applied to 6 UCI datasets \citep{Dua:2019}.\footnote{a1a, australian, ionosphere, madelon, mushrooms, and sonar}
Details of the datasets are in Appendix \ref{sec:datasets-description}.
For each model $p(\col Z, x)$, where $\col Z$ is a $d_Z$-dimensional random vector, the approximating distribution $q_\theta$ can either be a diagonal Gaussian or a $d_Z$-dimensional multivariate Gaussian distribution. 
The former is a product of $d_Z$ independent Gaussians, where the parameters $\mu_i$ and $\sigma^2_i > 0$ are specific to each $Z_i$. 
The latter has parameters $\mu_i$ and $L\T{L}$, where $L \in \R^{d_Z\times d_Z}$ is a lower-triangular matrix with diagonal elements that are positive, enforced by applying the \texttt{softplus} transformation.\footnote{Following \citet{kucukelbir2017automatic}, we transform the model $p$ into one with unconstrained real-valued latent variables, using PyTorch's~\citep{paszke2019pytorch} constraints framework.}
We run all our experiments on GPUs.

We conduct performance comparisons and an ablation study. 
We compare primarily to Adam with a fixed step-size, which is commonly used for black-box VI optimization, and batched quasi-Newton, a newer method that introduces second-order information in the optimization process.
We also compare to Adagrad.
For all baseline methods, we use the na\"ive gradient estimator described in Section~\ref{sec:background}.
When using Gaussian approximating distributions, this estimator corresponds to the one obtained when the entropy term is computed in closed-form.
In the ablation study, we explore how our decisions affect the algorithm's performance. 

\begin{figure*}[t]
    \centering
    \includegraphics[width=\textwidth, trim=0 3mm 0 2mm, clip]{plots/nats_comparison.pdf}
    \caption{ELBO comparison of different methods on Stan models with dense Gaussian approximation.
      For each model, ELBOs are shifted so the best model has value 100, and methods more than 100 nats worse are not shown. Only SAA achieves robust performance across all models. For quasi-Newton, we choose the best performing sample size.}
    \label{fig:nats_comparison}
\end{figure*}

\subsection{Performance comparison}\label{sec:exp-comparison}


\subsubsection{Adam}\label{sec:adam}
Adam \citep{adam} is a standard default optimizer for BBVI.
Step size choice is less relevant with Adam than with SGD but still a factor to consider.
For each experiment (combination of model and approximating family) we ran Adam with three different step-sizes (0.1, 0.01, and 0.001) and ran 20 repetitions of each combination.
At each iteration, we estimated the gradient of the ELBO by taking 16 samples from $q_\theta$. 
For each model and approximating family, we selected the step size in hindsight that provided the highest median ELBO across the 20 repetitions.
(See Appendix~\ref{appendix:adam} for more details on the Adam experiments.) 
For SAA for VI, we used the algorithm described in Section~\ref{subsec:algorithm} with the default parameter values of Table~\ref{table:hyperparameters} in the appendix.


We conducted two comparisons. 
First, we assessed the median ELBO, obtained across 20 repetitions, at the end of the optimization process using both Adam and SAA for VI.
We initially ran Adam for $40,000$ iterations, but found that more iterations were needed for some models and increased the number for models such as election88, electric, irt, madelon, and radon; see details in the appendix.

In the second comparison, we focused on the time required to reach a specified ELBO.
For each model and approximating distribution we identified as a ``benchmark ELBO'' the smaller of the two median ELBO values achieved by Adam and SAA, respectively, across their 20 repetitions. 
In other words, this ELBO value was achieved in at least half of the runs by both optimizers.
We then evaluated how long it took for each method to reach an ELBO value within 1 nat of the benchmark ELBO.


Figure~\ref{fig:elbo-adam-summary} and Table~\ref{table:comparison-adam-elbo-dense-covariance-stan-models} show summary results for Stan models and Bayesian logistic regression with dense Gaussian distributions; detailed ELBO comparisons and additional models appear in Table~\ref{table:comparison-adam-elbo} in the appendix.
See also Figure~\ref{fig:nats_comparison}, which compares the final ELBO values obtained by all methods evaluated.
Although Adam occasionally attains marginally superior median ELBO values for certain models--—due to the stopping criterion of SAA for VI—--SAA for VI consistently achieves higher median ELBOs for complex models.
We noticed that Adam's performance was erratic for models like election88Exp and had a tendency to diverge, especially for the hepatitis model when optimized beyond $40,000$ iterations.
This divergence partially accounts for the pronounced disparity in median ELBOs between Adam and SAA for VI.
We note that it's possible that Adam could achieve higher ELBO values by searching over a finer step-size grid; however, it is exactly this type of difficult and time-intensive tuning we seek to avoid with SAA.
Table~\ref{table:ratio-time-adam} in the appendix lists the time each method takes to achieve the adjusted ELBO and their respective ratios.
SAA for VI is almost always faster, often by factors of 10 to 100. 
For instance, optimizing the electric model using Adam takes about a minute, whereas SAA for VI accomplishes the same in under 2 seconds, making SAA more than 30 times faster.
Note that for Adam we only counted the compute time of the best-performing of the three learning rates, making the comparison even more favorable for SAA for VI.
Since GPUs allow for vectorized multi-sample model evaluation, the wall clock time in seconds serves as the most meaningful metric for comparing the compute time of both methods. 
Given these results, we confidently conclude that SAA for VI is a faster alternative to Adam in these scenarios.


Appendix~\ref{appendix:additional-adam-adagrad} provides additional results to explore the effect of different sample sizes (ranging from 1 to 256) for Adam as well as a different optimizer (Adagrad, ~\citealt{duchi2011adaptive}; see also Figure~\ref{fig:nats_comparison}). 
Across all settings, SAA for VI was consistently fast and robust compared to these alternatives.

Finally, to show that SAA for VI can also be effective in larger models, we learned an approximate posterior for a stochastic volatility model from \citet{chib2009multivariate}; see also \citep{naesseth2018variational,lai2022variational}. 
To make the task more challenging, we switched from monthly to daily data, increasing the data points processed and the number of latent variables.
Since this model consists of $17,228$ latent variables, using a dense covariance matrix would imply hundreds of millions of parameters, making the approach impractical. However, with a diagonal covariance matrix, SAA for VI finds a solution in less than 30 seconds, while Adam takes up to 2 minutes. (See Figure~\ref{fig:comparison-adam-elbo-diagonal-covariance-stochastic-volatility} in the appendix.)







\begin{table}[ht!]
  \renewcommand{\arraystretch}{1.2}
  \begin{center}
    {
      \begin{tabular}{@{}lrrr@{}}
        \toprule
         & \multicolumn{3}{c}{Dense Covariance} \\
        \cmidrule{2-4}
        {} & \multicolumn{1}{c}{SAA for VI} & \multicolumn{1}{c}{Adam} & \multicolumn{1}{c}{Impr.} \\
        %  {} & \multicolumn{1}{r}{(i)}  &  \multicolumn{1}{r}{(ii)} & \multicolumn{1}{r}{$\text{(i)}-\text{(ii)}$} \\
        \midrule
        \textbf{Stan models}\\
         \hspace{0.2em}congress & 423.55 & 423.58 & -0.03 \\
         \hspace{0.2em}election88 & -1,398.03 & -1,645.18 & 247.15 \\
         \hspace{0.2em}election88Exp & -1,381.79 & --- & --- \\
         \hspace{0.2em}electric & -786.91 & -859.26 & 72.35 \\
         \hspace{0.2em}electric-one-pred & -818.01 & -818.00 & 0.01 \\
         \hspace{0.2em}hepatitis & -557.36 & -618.76 & 61.40 \\
         \hspace{0.2em}hiv-chr & -582.78 & --- & --- \\
         \hspace{0.2em}irt & -15,884.67 & -15,936.06 & 51.39 \\
         \hspace{0.2em}mesquite & -29.83 & -29.78 & -0.05 \\
         \hspace{0.2em}radon & -1,209.46 & -1,216.92 & 7.46 \\
         \hspace{0.2em}wells & -2,041.95 & -2,041.90 & -0.05 \\
        \bottomrule
       \end{tabular}      
    }
    \caption{\textbf{ELBO} of SAA for VI and Adam for Stan models using a dense covariance matrix, highlighting the \emph{improvements} in ELBO by SAA for VI over Adam. Various step sizes were explored for Adam, and the best results are reported. For additional datasets and approximating distributions, see Appendix~\ref{appendix:adam}.}
  \label{table:comparison-adam-elbo-dense-covariance-stan-models}
  \end{center}
  \vspace{-2em}
\end{table}






\subsubsection{Batched quasi-Newton}
\label{sec:exp-bqn}
As noted in Section~\ref{sec:related-work}, our method differs from the batched quasi-Newton approach by \citet{liu2021quasi}, which also incorporates second-order information into VI.
We now empirically show the impact of these differences, specifically the use of a sequence of sample average approximations with an increasing number of samples.
We implemented the batched quasi-Newton method in PyTorch (without quasi-Monte Carlo sampling) and ran 20 independent runs of 40,000 iterations in each experiment.
We started with a sample size of 16, then repeatedly doubled the number of samples up to a maximum of 128 for models where the method encountered difficulties. 
We set the update frequency $B$ (see Section~\ref{sec:related-work}) to $20$ as recommended in the original paper.

With diagonal-covariance Gaussians, the batched quasi-Newton method shows performance on par with SAA for VI (Table~\ref{table:batched-quasi-Newton-diagonal} in the appendix).
However it struggles significantly with dense Gaussians, and fails to find good solutions for many models, as shown in in Figure~\ref{fig:nats_comparison} and Table~\ref{table:batched-quasi-newton} in the appendix.
%displays the median final ELBO across runs for various models. 
The batched quasi-Newton method reaches optimal performance for most Bayesian logistic regression models but faces difficulties with models from the Stan example library.
Even with a sample size of 128, a significantly larger value than commonly employed with SGD, the method still falls short of the best ELBO values achieved by other methods.
Additionally, the wall-clock time taken by the batched quasi-Newton method is often similar to or slower than the time taken by SAA for VI (Table~\ref{table:runtime-comparison-b-quasi-newton} in the appendix).

\subsection{Ablation study}
\paragraph{Impact of warm start.}
The optimization process requires a decision on whether to use warm start or draw fresh parameters for each iteration. 
Once the inner optimization process $\opt(\cdot)$ converges to parameters $\theta_t^*$, it may still be necessary to increase the sample size and run it more times, as described in Section~\ref{subsec:algorithm}.
Should we initialize the parameters with $\theta_t^*$ or instead draw a new set of parameters?


\citet{pasupathy2010choosing} provides an intuition of why using a warm start is helpful: in principle, the optimization process for larger sample sizes begin from a place that probably is close to a solution. 
%However, we wanted to empirically verify this intuition. 
To empirically verify this intuition, we conducted an experiment to compare the performance of warm start and drawing fresh parameters across different models and approximating distributions. 
For each combination of models and distribution, we ran the sequence of SAA problems until convergence, using either warm start or by sampling new parameters at the beginning of each inner optimization.
Specifically, at each iteration $t$, we initialized the process either with the previously computed optimal parameters $\theta_{t-1}^*$ (warm start) or by drawing a new random set of parameters (fresh start).
Our results, presented in Table~\ref{table:ratio-time-refresh-Q} in the appendix, show that using warm start results in a significant reduction in the total time taken to converge.
For example, on the election88 dataset, using fresh samples takes $20{\times}$ more time than using a warm start.
