
\section{Methods}
\subsection{Sample Average Approximation}
The problem of ELBO maximization in the reparameterization setting of Eq.~\eqref{eq:elbo-reparam} is formulated as an SOP where the stochasticity comes from a fixed probability distribution, i.e., a probability distribution which does not depend on $\theta$.
Furthermore, the function inside the expectation is a smooth function of the parameters $\theta$.
Solutions to these problems can be approximated using the \emph{sample average approximation} (SAA): a sample average over a \emph{fixed sample} replaces the expectation, effectively transforming the SOP into a deterministic optimization problem.

We propose to use SAA for black-box VI.
To use SAA, we take $n$ \iid\ samples $\boldsymbol{\epsilon} = \epsilon_1, \dots, \epsilon_n$ from the distribution $q_{\mathrm{base}}$ and define the deterministic \textsl{training objective} function
\begin{equation*}
  \hat{\L}_{\boldsymbol{\epsilon}}\colon \theta \mapsto \frac{1}{n}\sum^n_{i=1}[\ln p(z_\theta(\epsilon_i), x)-\ln q_\theta(z_\theta(\epsilon_i))],
\end{equation*}
which is a function of $\theta$ alone. 

Then, the optimization problem in Eq.~\eqref{eq:elbo-reparam} can be transformed into a deterministic optimization problem
\begin{align}\label{eq:elbo-SAA}
  \max_{\theta\in \Theta}\,\hat{\L}_{\boldsymbol{\epsilon}}(\theta) &= \max_{\theta \in \Theta}\, \frac{1}{n}\sum^n_{i=1}[\ln p(z_\theta(\epsilon_i), x)-\ln q_\theta(z_\theta(\epsilon_i))] \nonumber\\
  &= \max_{\theta\in \Theta} \frac{1}{n}\sum_{i=1}^{n}v_{\theta}(\epsilon_i),
\end{align}
where $v_{\theta}(\epsilon_i) = \ln p(z_{\theta}(\epsilon_i), x)-\ln q_{\theta}(z_{\theta}(\epsilon_i))$ denote the \textsl{log-weights}, also known as \textsl{log-importance ratios}. 
Since the optimization is performed with the fixed set $\boldsymbol{\epsilon}$, we refer to it as the training noise.

We want to recover the optimal parameters $\theta^*$ of $\hat{\L}_{\boldsymbol{\epsilon}}$.
In an unconstrained smooth optimization setting, we need to specify how to compute a search direction and a step size.
For the search direction, we will use L-BFGS \citep{broyden1970convergence, fletcher2013practical, goldfarb1970family, shanno1970conditioning, nocedal1980updating}.

In contrast to the SGD setting, deterministic optimization allows us to specify the step size using line search and ask for it to satisfy the \emph{strong Wolfe conditions} \citep{nocedal1999numerical}. Specifically, for $0 < c_1 < c_2 <1$, the step size $\gamma$ must simultaneously satisfy the modified curvature (MC) and sufficient increase (SI) conditions, that is, 
\begin{align*}
  \big\lvert\nabla\hat{\L}_{\boldsymbol{\epsilon}}\T{(\theta + \gamma \col r)} \col r\big\rvert &\leq c_2 \big\lvert\nabla \hat{\L}_{\boldsymbol{\epsilon}}\T{(\theta)}\col r\big\rvert,\qquad\text{and,} \tag*{(MC)}\\
  \hat{\L}_{\boldsymbol{\epsilon}}(\theta + \gamma \col r) &\geq \hat{\L}_{\boldsymbol{\epsilon}}(\theta) +c_1 \gamma \nabla \hat{\L}_{\boldsymbol{\epsilon}}\T{(\theta)}\col r.\tag*{(SI)}
\end{align*}
We will use L-BFGS with line search to find a local optimum of Eq.~\eqref{eq:elbo-SAA}, and denote the process that does so by $\opt(\theta, n, \boldsymbol{\epsilon}, \tau)$.
Here, $\tau$ is the maximum number of iterations for which L-BFGS will run, and $\theta$ is an initial value of the parameters.
Besides the arguments of $\hat{\L}_{\boldsymbol{\epsilon}}(\theta)$, we also need to specify the value of $\tau$.





\paragraph{Sandwiching the optimal {ELBO}}
Critically, the training objective $\hat{\L}_{\boldsymbol{\epsilon}}(\theta)$ and the ELBO $\L(\theta)$ may differ for a fixed $\theta$.
The ELBO, as defined in Eq.~\eqref{eq:elbo}, is an expectation over the distribution $q_{\theta}$, while the training objective is computed based on an average over a fixed sample $\boldsymbol{\epsilon}$.
In contrast, the optimal ELBO refers to the value of the ELBO achieved by the maximizer of Eq.~\eqref{eq:max-elbo}, denoted as $\theta^*$, and depends only on the target distribution and the approximating family.

During optimization with a fixed sample of training noise $\boldsymbol{\epsilon}_n = \epsilon_1, \dots, \epsilon_n$, one might wonder how much the learned parameters $\theta^*_{\boldsymbol{\epsilon}_n}$ and the distribution $q_{\theta^*_{\boldsymbol{\epsilon}_n}}$ depend on these noise samples, and,
in particular, how this dependency translates into the tightness of the gap between the ELBO $\L(\theta^*_{\boldsymbol{\epsilon}_n})$ of the learned approximation and its upper bound, the optimal ELBO $\L(\theta^*)$.
Fortunately, two results by \citet{mak1999monte} are relevant to our discussion.
Note that until the noise variables $\epsilon_1, \dots, \epsilon_n$ are realized, the quantity $\theta^*_{\boldsymbol{\epsilon}_{n}}$ and all functions of it are random.
Let $\hat{\boldsymbol{\epsilon}}_{n+1} = \hat\epsilon_1, \dots, \hat\epsilon_{n+1}$ be a sample of size $n+1$ taken \iid\ from $q_{\mathrm{base}}$.
Assuming the deterministic optimization with fixed noise converges to a global optimum, it holds that: (i) the ELBO and training objective sandwich the optimal ELBO (in expectation), that is, $\L(\theta^*_{\boldsymbol{\epsilon}_{n}}) \leq \L(\theta^*) \leq \E\hat{\L}_{\boldsymbol{\epsilon}_{n}}(\theta^*_{\boldsymbol{\epsilon}_{n}})$; and (ii) the training objective converges monotonically to the optimal ELBO from above (in expectation), that is, $\E\hat{\L}_{\hat{\boldsymbol{\epsilon}}_{n+1}}(\theta^*_{\hat{\boldsymbol{\epsilon}}_{n+1}}) \leq \E\hat{\L}_{\boldsymbol{\epsilon}_{n}}(\theta^*_{\boldsymbol{\epsilon}_{n}})$.


\begin{figure}[t]
  \centering
 \includegraphics[width=\columnwidth, trim={0.7cm .8cm 0.6cm 0.5cm},clip]{plots/violin_plus.pdf}
 \caption{Distribution of log-weights as a function of optimization sample size $n$ (\texttt{mushrooms} dataset).
 The violin plot shows the distributions, with overlaid lines indicating means for both fresh and training samples.
 These means provide estimations for the \textcolor{seaborn-0}{ELBO} and \textcolor{seaborn-1}{training objective}.} 
  \label{fig:log-weights-distribution}
\end{figure}

In particular, these results mean that we can use standard statistical techniques to quantify the discrepancy between the ELBO $\L(\theta^*_{\boldsymbol{\epsilon}_{n}})$ and the training objective $\hat{\L}_{\boldsymbol{\epsilon}}(\theta^*_{\boldsymbol{\epsilon}_{n}})$ by comparing the distribution of the log-weights $v_1, \dots, v_n$ for a fresh sample of noise, referred to as testing noise, and the training noise, a technique first used by \citet{mak1999monte}.
Figure~\ref{fig:log-weights-distribution} displays the distribution of log-weights for a growing sample size.
As the number of samples increases, the training objective value decreases and approaches that of the ELBO estimation, which in turn increases, indicating progress toward the ultimate goal of ELBO maximization, while tightening the gap around the optimal ELBO.

We adopt the classical approach of tightening this gap by solving a sequence of SAA approximations for an increasing sequence of sample sizes $\seq[t]{n_t} \subseteq \N$, which creates a sequence of solutions $\seq[t]{\theta_{n_t}^*}$.
\citet{shapiro2003monte}  give general conditions for the set of optimal solutions (or critical points) of SAA problems to converge to the corresponding set for the original stochastic optimization problem.
The conditions include uniform convergence of the SAA objective functions and compactness of the solution set (see also \citealt{kim2015guide}).
While these could likely be applied to VI problems, the conditions, especially compactness of the solution set, would be problem specific and depend, for example, on the particular parameterization of a variational distribution, and we don't explore it further.
