\subsection{Algorithm}\label{subsec:algorithm}
We present an algorithm that uses SAA to approximate the solution to the optimization problem of maximizing the ELBO. 
Our objective is to find a good approximation to the solution with a reasonable computational cost and reduce the gap between the ELBO and the training objective after optimization, as described above.
To this end, we build our stopping criteria based on comparing distributions of log-weights.
The algorithm, described in Algorithm~\ref{alg:main}, consists of two procedures: the optimizer $\opt$ and the convergence checker. 
We previously described the optimizer, in which we used a quasi-Newton method. 
The convergence checker is a function that determines whether we need to continue the optimization process, and we will describe it later in this section.

The algorithm initializes with guess $\theta_0$, sample size $n_0$, and maximum optimizer iterations $\tau_0$.
In each iteration $t$, we double the sample size to tighten the gap around the optimal ELBO.
We draw training noise $\boldsymbol{\epsilon}_{n_t} = \epsilon_1, \dots, \epsilon_{n_t}$ from the base distribution $q_{\mathrm{base}}$ and then use the optimizer to find the maximizer $\theta_{t}^*$ of the deterministic objective $\hat{\L}_{\boldsymbol{\epsilon}_{n_t}}$. 
If the optimizer reaches the iteration limit $\tau_t$, we double its value.

When the optimizer $\opt$ finishes in a small number of iterations, the parameters may remain almost unchanged, resulting in nearly identical log-weights.
Consequently, any convergence test based on these log-weights might not be indicative.
Though such behavior could signal convergence, it might be due to chance.
To address this uncertainty, we require a minimum of \texttt{VERY\_SMALL\_ITER} iterations before considering convergence.
However, if the optimizer finishes without reaching this number of iterations for three consecutive step sizes, we stop the process.



\begin{algorithm}[h]
  \centering
  \caption{SAA for VI}
  \label{alg:main}
  \begin{algorithmic}[1]
  \renewcommand{\baselinestretch}{1.1}\selectfont
  \State \textbf{Input:} $\theta$, $n$, $\tau$ \hfill \textbf{Output:} parameters $\theta^*$
  % \State \textbf{Output:} $\theta^*$
  \State $t \gets 0$, $\mathrm{count} \gets 0$
  \While{$\mathrm{count} < 3$}
  \State $t \gets t + 1$, $n \gets 2n$
  \State $\boldsymbol{\epsilon}_{n} \gets \epsilon_1, \dots, \epsilon_{n}$, \hfill $\epsilon_i \sim q_{\mathrm{base}}$
  \State $\theta \gets \opt(\theta, n, \boldsymbol{\epsilon}_{n}, \tau)$
  \State{$\eta \gets $ number of iter used by the optimizer}
  \If {$\eta = \tau$}
  \State $\tau \gets 2\tau$
  \EndIf
  \If {$\eta < \mathrm{VERY\_SMALL\_ITER}$}
  \State $\mathrm{count} \gets \mathrm{count} + 1$
  \Else
  \State $\mathrm{count} \gets 0$
  \EndIf
  \If {$\mathrm{count} = 0$ \textbf{and} converged?$(\theta, \boldsymbol{\epsilon}_{n}, t)$}
  \State \textbf{break}
  \EndIf
  \EndWhile
  \State\Return $\theta^* \gets \theta$
  \end{algorithmic}    
  \end{algorithm}

  \begin{algorithm}[ht]
    \centering
    \caption{converged?}
    \label{alg:converged}
    \begin{algorithmic}[1]
      \renewcommand{\baselinestretch}{1.1}\selectfont
      \State \textbf{Input:} $\theta$, $\boldsymbol{\epsilon}_{n}$, $t$ \hfill \textbf{Output:} $\mathrm{True}$ if converged
      % \State \textbf{Params:} $\mathrm{max\_t}$, $\delta$
      % \State \textbf{Output:} $\mathrm{converged}$, a boolean
      % \State $\mathrm{converged} \gets \mathrm{False}$
      \State $\hat{\boldsymbol{\epsilon}}_{10\mathrm{k}} \gets \hat\epsilon_1, \dots, \hat \epsilon_{10\mathrm{k}}$, \hfill $\hat \epsilon_i \sim q_{\mathrm{base}}$
      % \State Draw $\hat\epsilon_1, \dots, \hat\epsilon_{10\mathrm{k}}$ from $q_{\mathrm{base}}$
      \State $\mathrm{obj} \gets \mathrm{mean}(v_{\theta}(\boldsymbol{\epsilon}_{n}))$
      \State $\mathrm{elbo} \gets \mathrm{mean}(v_{\theta}(\hat{\boldsymbol{\epsilon}}_{10\mathrm{k}}))$
      \LineComment{Statistically compare means:} 
      \State $\mathrm{p}_{\mathrm{value}} \gets \mathrm{t\_test}(v_{\theta}(\boldsymbol{\epsilon}_{n}), v_{\theta}(\hat{\boldsymbol{\epsilon}}_{10\mathrm{k}} ))$
      \If{$\mathrm{p}_{\mathrm{value}} > 0.01$}
          \State \Return $\mathrm{True}$
      \EndIf
      \If{$\abs{\mathrm{obj} - \mathrm{elbo}} < \delta$ % \\ 
      \textbf{or} $t \geq \mathrm{max\_t}$}
        \State \Return $\mathrm{True}$
      \EndIf
      \State \Return $\mathrm{False}$
      \end{algorithmic}
    \end{algorithm}


\paragraph*{Stopping}
Algorithm~\ref{alg:converged} defines the stopping criteria for our optimization process, which involves computing log-weights. 
Specifically, given the training noise $\boldsymbol{\epsilon}_{n_t}$ and the parameters $\theta_t$, we compute the log-weights $v_{\theta_t}(\epsilon_1), \dots, v_{\theta_t}(\epsilon_{n_t})$, which we denote as $v_{\theta_t}(\boldsymbol{\epsilon}_{n_t})$.
We also compute a new set of log-weights using $10\mathrm{k}$ fresh samples of testing noise, denoted by $v_{\theta_t}(\hat{\boldsymbol{\epsilon}}_{10\mathrm{k}})$.

To decide whether to halt or continue the optimization process, we use a two-sided t-test to compare the means of log-weights.
We compare the mean log-weight calculated with the training noise, \(v_{\theta_t}(\boldsymbol{\epsilon}_{n_t})\), to the mean log-weight computed from the testing noise, \(v_{\theta_t}(\hat{\boldsymbol{\epsilon}}_{10\mathrm{k}})\).
The null hypothesis asserts that these means are the same.
If we cannot reject the null hypothesis, we terminate the optimization process.
Although the assumptions required for the t-test (e.g., that the training log-weights are \iid) might not strictly hold in all cases, we employ this statistical test as a heuristic for stopping the optimization.
Our approach draws inspiration from the methodology outlined in~\cite{mak1999monte}.
Alternatively, statistical tests such as the Kolmogorov-Smirnov or the Cram\'er-von Mises could be used to directly compare the log-weight distributions.
In Appendix~\ref{appendix:ablation-statistical-test}, we evaluate the alternatives and show that the t-test is a reasonable choice.



Our optimization process terminates when the null hypothesis cannot be rejected with a significance level of $1\%$.
Checking for convergence only when $\text{\texttt{count}} = 0$ avoids meaningless tests, as without optimizer updates, the distributions of training log-weights $v_{\theta_t}(\boldsymbol{\epsilon}_{n_t})$ and testing log-weights $v_{\theta_t}(\hat{\boldsymbol{\epsilon}}_{10\mathrm{k}})$ would be nearly identical.
We also introduce two additional stopping conditions: the maximum number of iterations $\mathrm{max\_t}$ and the threshold $\delta$ for the difference between the training objective $\hat{\L}_{\boldsymbol{\epsilon}}(\theta_t)$ and the estimated ELBO $\hat{\L}_{\hat{\boldsymbol{\epsilon}}_{10\mathrm{k}}}(\theta_t)$ .
In our experiments, we set $\mathrm{max\_t}$ to ensure that the maximum sample size was $n_{\max} = 2^{18}$, and $\delta$ to $0.01$.
In Appendix~\ref{appendix:hyperparams}, we provide a more detailed discussion of the hyperparameters used in our experiments.