\section{Related work}
\label{sec:related-work}

In the existing literature, there are efforts to incorporate second-order information into stochastic optimization, which have been applied to VI.
\citet{byrd2016stochastic} introduced a method that employs the L-BFGS update formula through subsampled Hessian-vector products, referred to as batched-L-BFGS or batched quasi-Newton. 
\citet{liu2021quasi} applied the method from \citet{byrd2016stochastic} to address the variational inference problem, with the optional inclusion of quasi-Monte Carlo (QMC) sampling to further decrease the variance of the gradient estimator.
Both approaches involve a two-step algorithm: (1) updating the parameters at each iteration using L-BFGS's two-loop recursion, and (2) updating the displacement vector $\col{s}$ and gradient difference vector $\col y$ of L-BFGS every $B$ steps by employing the average of the parameters from the preceding $B$ iterations. 
In the work of \citet{liu2021quasi}, each iteration involves drawing a fixed-size sample of noise $\epsilon$ from $q_{\mathrm{base}}$ to estimate the ELBO gradient and conduct the line search. 
The sample size is not extensively discussed in their work; however, the experiments were conducted with sample sizes of $128$ or $256$. 
These values are larger than those typically used in the literature, suggesting that the sample size could indeed be a relevant factor to consider.
Our method deviates from the approach proposed by \citet{liu2021quasi} in two key ways. 
Firstly, we execute a complete deterministic optimization using a fixed set of noise, effectively reducing uncertainty.
Secondly, we seamlessly integrate the sample size consideration into the algorithm itself, consequently minimizing the need for user input.
As we demonstrate in Section~\ref{sec:exp-bqn}, these differences lead to significant improvements when handling complex target and approximating distributions.


An alternative approach to incorporating second-order information into the variational inference problem can be found in the work of \citet{zhang2022pathfinder}.
Their method employs L-BFGS to identify modes or poles of the posterior distribution.
Subsequently, the data generated by L-BFGS is utilized to estimate the posterior covariance around the mode, which is then used to parameterize an approximating distribution.
This approach more closely resembles the Laplace approximation than methods that seek approximations to a global optimizer of the ELBO from a fixed parametric family.

% MAYBE WE CAN SHORTEN THIS PARAGRAPH: We share a common goal with \citet{welandawe2022robust}, who develop a system for variational inference that requires minimal user input.
We share a common goal with \citet{welandawe2022robust}, who also drew inspiration from \citet{agrawal2020advances} to develop a system for variational inference that requires minimal user input.
However, their method employs SGD for optimizing the ELBO and uses a heuristic schedule to update the step size $\gamma_t$ during the optimization process.
They initially use a fixed step size and incorporate tools to detect when the SGD process reaches stationarity, at which point they decrease the step size by a factor $\rho$.
During the stationary regime, they calculate the average of the parameters and take it as the optimal parameters for a given step size $\theta^*_{\gamma_t}$. 
They repeat the process of decreasing the step size until the symmetrized KL divergence between the current distribution and the optimal distribution $q_*$ (for the approximating family) falls below a threshold $\xi$. 
Notably, since the optimal distribution $q_*$ is not known, the authors estimated the KL divergence between $q_*$ and the current distribution $q_{\theta^*_{\gamma_t}}$. 
The authors observed that taking the average of the parameters in the stationary regime significantly improves the approximation quality compared to considering each parameter at every iteration.


In the machine learning literature, the application of sample average approximation has been relatively rare.
Some early works include \textsc{Pegasus} by \citet{ng2000pegasus}, in which the authors addressed partially observable Markov decision processes by replacing the \emph{value of a policy} (an expectation) with the sample average of the value function applied to a finite number of states for optimization purposes.
In a different context, \citet{sheldon2010maximizing} explicitly utilized the sample average approximation technique in a network design setting, where a na\"ive greedy approach was not applicable. 
More recently, \citet{balandat2020botorch} adopted sample average approximation to optimize the acquisition function in Bayesian optimization.
SAA was previously used for VI in a specialized capacity in several papers~\citep{giordano2018covariances,domke2018importance,giordano2019swiss,domke2019divide,giordano2022evaluating}; our work and the concurrent work of \citet{giordano2023black} are the first to explore its general applicability.


As mentioned in the introduction, \citet{giordano2023black} concurrently and independently developed a method based on the sample average approximation for black-box variational inference.
The two papers employ the same basic algorithmic idea but have several differences in scope.
Unlike \citet{giordano2023black}, we focus substantially on the case where SAA with a fixed sample size has significant error and therefore one needs to solve a sequence of problems with increasing sample sizes. 
We introduce heuristics that guide the selection of sample sizes and the decision of when to halt the process.
On the other hand, \citet{giordano2023black} exploit the determinism of the SAA problem to develop techniques based on sensitivity analysis and the theory of ``linear response covariances''~\citep{giordano2015linear, giordano2018covariances} to improve posterior covariance estimates of black-box VI and to estimate the Monte Carlo error of the SAA procedure, which are outside the scope of our work.
%On the other hand, \citet{giordano2023black} present an alternative approach to ours, which is based on fresh noise, by using a formula that estimates the Monte Carlo error of the objective based on an inverse Hessian approximation. 
%They also develop techniques to correct the covariance estimate of the diagonal Gaussian approximations using the same inverse Hessian estimation and ``linear response covariances'' \citep{giordano2015linear, giordano2018covariances}, which is outside the scope of our work.
They present a theoretical result indicating a failure mode for SAA when the number of samples is too few compared to the dimension of the latent variables. Specifically, for a Gaussian approximation with a dense covariance matrix, the sample size $n$ must be at least equal to the dimension $d_Z$ of the latent space for the SAA problem to be bounded. Interestingly, although they conclude that this limitation prevents the use of SAA for VI with a dense Gaussian approximation, we show in the \hyperref[sec:experiments]{experiments section} that, for interesting models, it is indeed feasible.
Two reasons for this discrepancy are: (1) our SAA sample sizes can reach up to \(2^{18}\), unlike their usual size of 30, and (2) our largest model has 501 latent variables, whereas theirs have up to 15K.
Thus, their theoretical result provides useful guidance on the limitations of SAA for VI, while our empirical work shows that SAA for VI can be practical up to quite large sample sizes.
We have provided an \hyperref[sec:addendum]{addendum} in Appendix~\ref{sec:addendum} that uses their theoretical result to improve our method: when using a dense approximation, the sequence of SAA problems should begin with a sample size larger than \(d_Z\); this makes SAA for VI even faster by avoiding wasted effort for small sample sizes.
