\section{NUMERICAL EXPERIMENTS} \label{sec:experiment}

We corroborate our theoretical results on several examples and explore the trade-off between the complexity of the inference function, the quality of the approximation, and the convergence time of the optimization.
In all our examples, we do full Bayesian inference over $\theta$ and $\mbz$.
The code to reproduce the experiments can be found at \url{https://github.com/charlesm93/AVI-when-and-why}. % [RETRACTED].

\subsection{Experimental setup} 

As our variational approximation, we use the family of factorized Gaussians.
Our benchmarks are a \textit{constant factor algorithm}, which assigns the same Gaussian factor $\bar q$ to each latent variable $z_n$, and F-VI.
The constant factor and F-VI are respectively the poorest and richest variational families.

With A-VI, we learn $q(\theta)$, just as in F-VI, but amortize the inference for $q(\mbz)$: specifically, we fit an inference function between $x_n$, and the mean and variance of the Gaussian factor $q(z_n)$.
For the saw-time series example (\cref{sec:sawtime}) the input of the inference function is expanded to $(x_{n - 1}, x_n)$.

To optimize the KL-divergence, we maximize the evidence lower-bound (ELBO),
%
\begin{equation}
  \mathbb E_{q(\mbz, \theta \s \nu)} \left [\log p(\theta, {\bf z}, {\bf x}) - \log q(\theta, {\bf z}) \right],
\end{equation}
%
estimated via Monte Carlo.
For all experiments we use a conservative 100 draws to estimate the ELBO, except when training a Bayesian neural network, where we use a mini-batching strategy instead.
  
We employ the Adam optimizer \citep{Kingma:2015} in \texttt{PyTorch} \citep{Pytorch:2019} and use the reparameterization trick to evaluate the gradients \citep{Kingma:2014}.
For F-VI, the optimization is directly performed over the parameters of the factorized Gaussian $q(\theta) q(\mbz)$.
For A-VI, the optimization is over the parameters of the Gaussian $q(\theta)$ and over the parameters of the inference function (e.g. the weights of the inference neural network) which maps $x_n$ to the parameters of $q(z_n)$.
We find a learning rate of 1e-3 works well across applications.
The optimizer is stochastic because of the random initialization and the Monte Carlo estimation of the ELBO, and so we repeat each experiment 10 times.
%
Depending on the choice of variational family, the computation cost per optimization step can vary.
Therefore we report the ELBO against the wall time when evaluating the performance of each algorithm. 


\subsection{Linear probabilistic model}
  %
  We begin with the example from \Cref{sec:linear}, using $N = 10,000$ simulated observations, obtained by drawing $\theta$ and $\mbz$ from standard normals.
  A-VI's inference function is a polynomial of degree $d$ and we require $d = 1$ to learn the optimal variational parameters (Proposition~\ref{thm:linear}).
  \Cref{fig:optimization_paths} shows the optimization paths over 5,000 steps for a single seed and \Cref{fig:iter_to_convergence} summarises the number of iterations to converge across seeds for each VI algorithm.
  Consistent with our analysis in \Cref{sec:linear}, A-VI attains the same outcome as F-VI for $d \ge 1$.
  Furthermore, we find that A-VI requires an order of magnitude less time to converge.
  Naturally, using $d = 2$ also yields an optimal solution, however we observe that A-VI then converges more slowly.

  \begin{figure*}
\centering
      \includegraphics[width=1.5in]{figures/conv_time_lin_boxplot.pdf}
      \includegraphics[width=1.5in]{figures/conv_time_nonlin_boxplot.pdf}
      \includegraphics[width=1.5in]{figures/conv_time_BNN_boxplot.pdf}
      \includegraphics[width=1.5in]{figures/conv_time_sawtime_boxplot.pdf}
      \caption{ \textit{Wall time to convergence. We run each experiment 10 times and summarize the wall time required for the ELBO to converge for each VI algorithm. For the Bayesian Neural Network, we report convergence in terms of MSE for the image reconstruction.
      Algorithms with a collapsed box plot on the right do not close the amortization gap.
      }}
      \label{fig:iter_to_convergence}
\end{figure*}

  \subsection{Nonlinear probabilistic model}
  This is a variation on the previous model, with a nonlinear likelihood.
  The joint distribution is then,
  %
  \begin{align}
   p(\theta) & = \mathcal N(0, 1) \nonumber \\
   p(z_n) & = \mathcal N(0, 1) \nonumber \\
   p(x_n \mid z_n, \theta) & = \mathcal N \left (\theta + z_n (1 + \sin(z_n)), \cos^2(z_n) \right).
  \end{align}
  %
  $N = 10,000$ observations are obtained by simulation.
  The inference function $f_\phi$ is a neural network with two hidden layers of width $k$ and ReLu activation.
  A-VI can match F-VI's solution for $k \ge 4$.
  In contrast to the linear probabilistic example, an overparameterized inference function yields faster convergence as measured by the median over 10 seeds (\Cref{fig:iter_to_convergence}).
  However, A-VI is more sensitive to the seed and in some cases, can fail to converge after 20,000 iterations. A strength of F-VI relative to A-VI is therefore robustness to initialization, particularly in this example.
  Given the large number of Monte Carlo samples used to estimate the ELBO, we deduce A-VI is sensitive to the initialization.
  The choice $k=16$ produces a fast and reasonably stable algorithm.
  

\subsection{Bayesian neural network}

\begin{figure}
    \centering
    \includegraphics[width=6.5cm]{figures/mse_time_BNN_1954.pdf}
    \caption{\textit{Image reconstruction error, as measured by MSE over pixel, for a trained Bayesian neural network.
    The MSE is not a one-to-one map with the ELBO.
    For a sufficiently expressive inference network, A-VI achieves the same error as F-VI and converges faster.
    The above provides the paths for a single seed; for results across several seeds, see Figure~\ref{fig:iter_to_convergence}.
    }}
    \label{fig:BNN_mse}
\end{figure}


Next we consider a deep generative model applied to the FashionMNIST data set \citep{Xiao:2017}.
We associate with each image $x_n \in \mathbb R^{784}$ a low-dimensional representation $z_n \in \mathbb R^{64}$.
The joint distribution is then,
%
\begin{align}
  p(\theta) & = \mathcal N(0, I) \nonumber \\
  p(z_n) & = \mathcal N(0, I) \nonumber \\
    p(x_n \mid z_n, \theta) & = \mathcal N (\Omega(z_n \s \theta), I),
\end{align}
%
where $\Omega$ is a neural network with two hidden layers of width 256 and a leaky ReLu activation function.
$\theta \in \mathbb R^{57,232}$ stores the weights and biases of the network.
This generative model underlies the traditional VAE, however we estimate a posterior over $\theta$ in order to confirm that A-VI can indeed close the amortization gap when fitting a Bayesian neural network.
%
We train the model on 10,000 images and at each iteration, evaluate the ELBO on a mini-batch of 1,000 images.
Hence a single epoch contains 10 iterations, and we run each VI algorithm for 5,000 epochs.
From a pilot run, we found estimating the ELBO with a single Monte Carlo sample worked reasonably well.

Due to the non-linear landscape of the optimization, we cannot guarantee that any of the algorithms converge.
However, we find that after 5,000 epochs, A-VI achieves the same ELBO as F-VI when using a width $k \ge 64$ for the inference network (\Cref{fig:optimization_paths}).
We also study the image reconstruction error measured by the mean squared error (MSE) over pixels on the training set.
(The MSE on the test set is not available for F-VI, however in Appendix~B we provide test error for A-VI.)
For this calculation, we use the Bayes estimator $\mathbb E(\theta \mid \mbx)$ and $\mathbb E (\mbz \mid \mbx)$.
\Cref{fig:BNN_mse} plots the MSE against wall time for the same seed used in \Cref{fig:optimization_paths}.
In \Cref{fig:iter_to_convergence}, we report the wall time required to achieve an MSE below 0.03, which corresponds to F-VI's best solution.
For $k \ge 64$, A-VI requires 2-3 times less iterations to converge, however each iteration is considerably more expensive.
As a result, the speed-up when examining wall-time is $\sim$25\%.
Overparameterizing the inference network slightly improves the convergence speed.

% (An analysis on the MNIST data set \citep{LeCun:1998} is in the Appendix and yields similar conclusions.)
% %: that is we jointly estimate the posterior distribution for the latent variables $\mbz$ and for the weights of the decoder $\theta$.
% The likelihood is a Gaussian with fixed variance and the mean is given by a two-layer neural network of width 40 with a leaky ReLU activation function.
% % We consider two data sets: MNIST \citep{LeCun:1998} and FashionMNIST. 
% To alleviate the computational burden, we use a subset of 10,000 images and run the optimization for 500 steps. %, with each step using the full data.

% \begin{figure}
%     % \vspace{-0.5cm}
%     \center
%     % \includegraphics[width=1.7in]{figures/k_vae_MNIST.pdf} \ \
%     \includegraphics[width=6cm]{figures/k_vae_FashionMNIST.pdf}
%     % \includegraphics[width=1.7in]{figures/elbo_vae_FashionMNIST_1792.pdf}
%     \caption{\textit{Achieved ELBO after 500 steps for varying widths $k$ of the inference network. The solid line is the median ELBO and the shaded region extends between the minimum and maximum ELBO over 10 seeds. 
%     For FashionMNIST, A-VI consistently outperforms F-VI, with a wider inference network yielding better results.} 
%     } \label{fig:k}
% \end{figure} 
  
% We do not believe Adam maximizes the ELBO in 500 iterations, given the non-convex loss functions (Figure~\ref{fig:optimization_paths}).
% This points at a limitation of our theoretical analysis, which focused on the optimal KL-divergence, rather than the solution obtained under a computational budget.
% % For many problems, the relevant question is not how far \textit{can} the optimizer go, rather how far does it actually go.
% As a ``silver'' benchmark, we run F-VI for another 500 iterations, initialized at the best output obtained with A-VI.
% %, which is a semi-amortized approach \citep{Kim:2018}.
% We then report the highest achieved ELBO over 10 seeds as ELBO$^*$. 
% %We do not interpret this as the true optimal ELBO of F-VI, rather as evidence that we are not likely to find the optimal solution for F-VI nor A-VI without more computational resources. 
% We estimate the achieved ELBO by averaging the final 20 steps and report our results in Figure~\ref{fig:k}.
% A-VI consistently outperforms F-VI with a wider network yielding better results.
%For MNIST, A-VI returns results comparable to F-VI even when using a narrow neural network. In FashionMNIST, A-VI consistently outperforms F-VI with a wider network yielding better results.

 \subsection{Saw time series} \label{sec:sawtime}
 In this final example, we explore the benefits of extending the domain of the inference function. 
 We simulate $N = 1,000$ observations from a saw time series (\cref{eq:saw_time}), with $x_0 = 0$ and
 %
\begin{align}
  p(\theta) & = \mathcal N(0, 1) \nonumber \\
  p(z_0) & = \mathcal N(0, 1) \nonumber \\
  p(z_n \mid x_{n - 1}) & = \mathcal N(x_{n - 1}, 1) \s \nonumber \\
  p(x_n \mid z_n) & =  \mathcal N(\alpha (\theta + z_n), 1).
\end{align}
%
Once again, we fit an inference neural network  with two hidden layers of width $k$.
Additionally, we allow the network to either take in $x_n$ or $(x_{n - 1}, x_n)$ as its input. 
Only with the expanded output does A-VI attain F-VI's optimum for $k \ge 4$.
Using only $x_n$ produces a suboptimal approximation even with a comparatively large inference network (e.g. $k = 20$). 
Figure~\ref{fig:saw_time} demonstrates this behavior for one optimization path.
Across seed, we find that A-VI consistently outperforms F-VI (\Cref{fig:iter_to_convergence}).
While A-VI is sensitive to the seed for $k = 2$, the algorithm stabilizes once we overparameterize the inference network, with $k \ge 4$.


