\section{Background}
\label{sec:background}
We are interested in approximating the posterior distribution of a latent variable given some observed data, i.e., $p(Z \given x)$, where $Z$ is the latent variable and $x$ is the observed data.
To achieve this, we will approximate the posterior with a distribution from an indexed family of approximations $\mathcal{Q} = \set{q_{\theta}\mid \theta \in \R^d}$, where $\theta$ is a vector of parameters that parameterize the approximation $q_{\theta}(Z)$, and $d$ is the dimension of $\theta$.
% we want $\R^d$, because we need an open set to have gradients and a closed set to have converging sequences

VI proposes to approximate the posterior distribution by finding a member from $\mathcal Q$ that is closest in Kullback-Leibler divergence to the true distribution.
This is achieved by maximizing the evidence lower bound (ELBO), which is a function of the parameters:
\begin{equation}\label{eq:elbo}
  \L(\theta) = \E [\ln p(Z, x) - \ln q_\theta(Z)],\qquad Z\sim q_\theta.
\end{equation}
The optimization problem can be formulated as:
\begin{equation}
  \max_{\theta \in\Theta} \L(\theta) = \max_{\theta \in \Theta} \E [\ln p(Z, x) - \ln q_\theta(Z)],\qquad Z\sim q_\theta.\label{eq:max-elbo}
\end{equation}
Under smoothness assumptions, black-box VI presents this problem as a smooth stochastic optimization problem (SOP) and suggests solving it using methods based on stochastic gradient descent (SGD).
Specifically, it uses stochastic gradient ascent to maximize the ELBO by updating the parameters as follows:

At every iteration $t$, samples $z_1, \dots, z_n$ from $q_{\theta_t}$ are drawn and the sample mean of the function $g_{\theta_t}(Z)$ is being computed, where $g_{\theta_t}(Z)$ is a $\R^d$-valued random vector whose expectation equals the gradient. 
Then, this estimate is used, along with some $\gamma_t \in \R_+$, to update the parameters according to:
\begin{equation}\label{eq:sgd-step}
  \theta_{t+1} \gets \theta_t + \gamma_t \frac{1}{n}\sum_{i=1}^n g_{\theta_t}(z_i).
\end{equation}
This function can be obtained using various methods, including the score function estimator \citep{wingate2013automated, ranganath2014black} or, if the distribution is reparameterizable, the `reparameterization trick'~\citep{kingma2013auto, fu2006gradient, kingma2019introduction, rezende2014stochastic}, among others. 
A random variable $Z$ comes from a reparameterizable distribution $q_\theta$ if there exist a $C^1$ function $z_{\theta}$ and a density $q_{\mathrm{base}}$ such that $Z = z_{\theta}(\epsilon)$ for $\epsilon \sim q_{\mathrm{base}}$.
We refer to these $\epsilon$ values as noise.
In such case, the stochastic optimization problem becomes
\begin{equation}\label{eq:elbo-reparam}
  \max_{\theta \in\Theta} \L(\theta) = \max_{\theta \in \Theta} \E [\ln p(z_\theta(\epsilon), x)-\ln q_\theta(z_\theta(\epsilon))],
\end{equation}
where $\epsilon\sim q_{\mathrm{base}}$.
It then follows that, at every step $t$ of the optimization, the update rule of Eq.~(\ref{eq:sgd-step}) is
\begin{equation*}
  \theta_{t+1} \gets \theta_t + \gamma_t\frac{1}{n}\sum_{i=1}^n g_{\theta_t}(z_{\theta_t}(\epsilon_{ti})),\qquad \epsilon_{ti} \sim q_{\mathrm{base}}.
\end{equation*}
Despite its simplicity, the explanation above fails to convey the complexities of choosing hyperparameters, particularly the step size $\gamma_t$, also known as the learning rate.
The user can opt to use a step size schedule $\bm{\gamma} = \seq[t]{\gamma_t}\subset \R_{+}$ that meets the Robbins-Monro conditions ($\norm{\bm{\gamma}}_1 = \infty$ and $\norm{\bm{\gamma}}_2 < \infty$), which can lead to SGD converging at a critical point due to the use of unbiased estimators of the gradients \citep{robbins1951stochastic, ranganath2014black,jankowiak2018pathwise}.
However, the specific sequence of the schedule is not specified and different schedules may affect the speed of convergence differently [cf.~\citet{agrawal2020advances}].
Critically, the random nature of estimating the loss function and its gradient makes it impractical to use traditional line-search methods.
Additionally, the choice of the number of samples $n$ drawn at each iteration can affect the optimization process, as a larger $n$ provides a more accurate gradient estimate but may increase the computational cost.
Balancing this trade-off is an important aspect of algorithm design.

Moreover, controlling the variance of gradient estimates significantly influences the performance of the optimization algorithm, affecting stability and convergence properties, and further adding to the complexity of the problem.
In this context, the choice of the gradient estimator $g_{\theta_t}$ is crucial.
Instead of employing the na\"ive estimator by taking the average of the gradient of $\ln p(z_{\theta_t}(\epsilon)) - \ln q_{\theta_t}(z_{\theta_t}(\epsilon))$, one can consider alternative methods such as the sticking-the-landing estimator \citep{STL} or, when the entropy term $\mathbb H_\theta = -\mathbb E[\ln q_{\theta_t}(z_{\theta_t}(\epsilon))]$ is available in closed form, estimating the gradients of $\mathbb E[\ln p(z_{\theta_t}(\epsilon))] + \mathbb H_\theta$.
Although all these estimators are unbiased, they exhibit different variance behaviors, which can impact the optimization process.
To reduce the variance of the gradient estimator, control variates can also be applied \citep{ranganath2014black, NEURIPS2018_dead35fa}.
These choices contribute to the overall complexity of choosing hyperparameters, step size schedules, and the number of samples.

