\section{Introduction}

It is a long-standing research direction to develop robust inference methods that perform well on a wide range of real models. 
This is of immense practical interest in fields like astrophysics, epidemiology, political science, psychology, ecology, and others, where a scientist supplies a model and data, and the goal is to recover the posterior distribution of latent variables.
However, inference is extremely challenging in general and formally intractable except for restricted cases, so approximations are needed.

Variational inference (VI) is one of the main approximate inference approaches. 
It poses inference as an optimization problem to find a distribution from a specified family that is as close as possible to the posterior by maximizing the evidence lower bound (ELBO) \citep{wainwright2008graphical, jaakkola1997variational, beal2003variational}, or, equivalently, minimizing the KL-divergence to the posterior.

In the quest to make VI broadly applicable and ``automatic'', recent work has focused on ``black box'' variational inference (BBVI)~\citep{ranganath2014black,titsias2014doubly, kucukelbir2017automatic,pmlr-v80-yin18b, hoffman2020black, buchholz2018quasi}.
BBVI performs ELBO maximization using only ``black box'' access to the model in the form of evaluations of the log joint density or the gradient thereof. 
This allows VI to be applied to a wide range of models, especially when paired with recent modeling frameworks such as Stan~\citep{carpenter2017stan} that make it easy for users to specify models that are converted to routines for log-densities and gradients.

To achieve this generality, BBVI treats ELBO maximization as a stochastic optimization problem, which it solves via stochastic gradient descent (SGD)~\citep{wingate2013automated, blei2017variational, kucukelbir2017automatic, ranganath2014black, rezende2014stochastic, kingma2013auto} or a variant such as Adam \citep{adam} or AdaGrad \citep{duchi2011adaptive}. 
%with an unbiased gradient estimator. 
However, in practice, the difficulty of solving this stochastic optimization problem reliably and robustly has severely limited the applicability of BBVI~\citep{agrawal2020advances, welandawe2022robust}. A particular challenge is selecting step size sequences that allow rapid progress and avoid suboptimality.
%\footnote{It's worth noting that the use of stochastic gradient methods in BBVI is qualitatively very different from their use standard supervised deep learning. In BBVI, the objective is intrinsically stochastic, we can only obtain an unbiased estimate via sampling from the variational distribution, and the properties of the gradient distribution are largely unknown; in supervised learning, the true objective is deterministic and stochasticity is \emph{introduced} through data subsampling to accelerate learning process. Qualitatively, the practical issues can be quite different across these settings.}
This motivates the consideration of alternate stochastic optimization methods that can perform more reliably for BBVI problems. 

In this paper, we propose an alternative optimization approach for BBVI based on the on sample average approximation (SAA) \citep{Healy, robinson1996analysis, shapiro1996convergence, saa_Shapiro, kim2015guide}.
A key feature of SAA is that it draws a fixed random sample and then solves a \emph{deterministic} optimization problem. This enables tools such as line-search and second-order optimization, which are traditionally unavailable for BBVI but can substantially improve performance.
We focus on the application of quasi-Newton methods with line search to BBVI with Gaussian approximating families.
This is well suited to problems with up to several hundred latent variables, which covers a very large number of applied statistical models such as those that appear in the Stan model library, many of which remain very challenging for BBVI.
Quasi-Newton SAA can also scale to much larger models when using diagonal Gaussian approximating families.


\begin{figure}[t]
  \centering
 \includegraphics[width=\linewidth, trim={0cm 0cm 0cm 0cm},clip]{plots/elbo-improvement.pdf}
%  \fbox{\includegraphics[width=\linewidth, trim={.7cm 0cm 0cm 0cm},clip]{plots/electric-new.pdf}}
 \includegraphics[width=\linewidth, trim={.25cm 0cm 0cm 0cm},clip]{plots/electric-new.pdf}
 \caption[Top: ELBO improvement (nats) vs. running-time improvement of \textcolor{seaborn-3}{SAA for VI} compared to \textcolor{seaborn-2}{Adam}
 Bottom: optimization traces for ``electric'' model.]%
 {
  Top: ELBO improvement (nats) vs. running-time improvement (number of times faster) for \textcolor{seaborn-3}{SAA for VI} compared to \textcolor{seaborn-2}{Adam}, across 9 Stan models and 6 Bayesian logistic regression models using a dense-covariance Gaussian distribution. The bordered point \tikz[baseline=-0.75ex]\draw[black, thin, fill=orange] (0,0) circle (2.5pt); indicates that the models ``australian'' and ``ionosphere'' share the same coordinates.
  Bottom: Optimization traces for the ``electric'' model. 
  See Section~\ref{sec:experiments} and Appendix~\ref{appendix:adam} for details.
  }
  \label{fig:elbo-adam-summary}
\end{figure}


Figure~\ref{fig:elbo-adam-summary} illustrates the speed and accuracy benefits of SAA compared to Adam when approximating the posterior distributions of 9 real Stan models and 6 Bayesian logistic regression models (see Table~\ref{table:dataset-description}) using Gaussian distributions with dense covariance matrices.
SAA is always comparable to or better to Adam in terms of solution quality, and, for 10 out of 15 models, either achieves a much better solution, or achieves a comparable solution much faster.
Notably, nearly a third of the models are failure cases for Adam, where SAA finds a solution that is hundreds of nats better. 

To achieve this robustness, we design the SAA for VI algorithm, which applies SAA to BBVI in an efficient and automatic way whenever the approximating family is reparameterizable.
To address the Monte Carlo error introduced by using a fixed random sample within SAA, we adapt techniques from the SAA literature to solve a sequence of problems with increasing sample sizes until a stopping criterion is reached~\citep{chen2001stochastic} and develop a custom stopping criterion for BBVI as well as default schedules for samples sizes and optimization tolerances to achieve robust out-of-the-box performance.
SAA for VI also leverages the GPU-friendly nature of the SAA objective to increase optimization efficiency. 

Our empirical results demonstrate that SAA for VI on our benchmark is competitive with state-of-the-art BBVI optimization methods---including first-order methods (Adam and AdaGrad) as well as a prior second-order stochastic optimization algorithm for BBVI~\citep{liu2021quasi}---while simplifying the variational inference process.

Concurrently with our work, \citet{giordano2023black} proposed a sample average approximation algorithm for variational inference, motivated by the same challenges of stochastic gradient methods that limit the robustness and broad applicability of BBVI. 
We discuss the relationship between our method and theirs in Section~\ref{sec:related-work}.
