\section{INTRODUCTION}

A latent variable model is a probabilistic model of observations $\mbx = x_{1:N}$ with corresponding local latent variables $\mbz = z_{1:N}$ and global latent parameters $\theta$.
% Consider for example a model that factorizes as
% %
% \begin{equation}  \label{eq:simple-hier}
%   p(\theta, \mbz, \mbx) = p(\theta) \prod_{n = 1}^N p(x_n \mid z_n, \theta) p(z_n \mid \theta),
% \end{equation}
% %
% where $p(z_n \mid \theta)$ and $p(x_n \mid z_n, \theta)$ have the same distributional forms for all $n$.
% This factorization, termed the \textit{simple hierarchical model} \citep{Agrawal:2021}, underlies many probabilistic models.
% In a deep generative model \citep{Kingma:2014, Tomczak:2022}, 
% %  Should we cite Rezende:2014a too? Removed for space.
% $z_n$ is a low-dimensional latent representation of $x_n$, while $\theta$ parameterizes a neural network that maps $z_n$ to the mean and variance of the likelihood $p(x_n \mid z_n, \theta)$.
% In a multilevel regression model, $z_n$ describes an effect specific to the covariate $x_n$, while $\theta$ encodes effects common to all observations \citep{Gelman:2006}.
%
With a model $p(\theta, \mbz, \mbx)$ and an observed dataset $\mbx$, the central computational problem is to approximate the posterior distribution of the latent variables $p(\theta, \mbz \mid \mbx)$. 

One widely-used method for approximate posterior inference is \textit{variational inference} (VI)~\citep{Jordan:1999, Blei:2017}.
VI sets a parameterized family of distributions $\cQ$ and finds the member of the family that minimizes the Kullback-Leibler (KL) divergence % to the posterior, 
\begin{align}
  \label{eq:vi-optim}
  q^* =
  \arg \min_{q \in \cQ}
  \kl{q(\theta, \mbz) \, || \,
  p(\theta, \mbz \g \mbx)}.
\end{align}
VI then approximates the posterior with the optimized $q^*$. (In practice, VI finds a local optimum of \Cref{eq:vi-optim}.)

\begin{figure*}
    \centering
    \begin{tikzpicture}
    [
      Empty/.style={circle, draw=cyan!10, fill=cyan!10, thick, minimum size=0mm},
      Round/.style={circle, draw=black!, fill=cyan!10, thick, minimum size=30mm},
      Round_large/.style={circle, draw=black!, fill=orange!10, thick, minimum size=30mm},
      Round_small/.style={circle, draw=black!, fill=orange!10, thick, minimum size=10mm},
      Empty_A/.style={circle, draw=orange!10, fill=orange!10, thick, minimum size=0mm},
    ]

    \node[Round] (QF) at (0, 0) { };
    \draw (-0.7, -0.7) node {$\mathcal Q_\text{F}$};
    \node[Round_small] (QA) at (0, 0.5) { };
    \draw (0, 0.5) node {$\mathcal Q_\text{A}$};
    \filldraw [black] (1.3, -0.7) circle (2pt);
    \filldraw [black] (2.4, -1.2) circle (2pt);
    \draw[black, dotted] (1.3, -0.7) -- (2.4, -1.2);
    \draw (1.35, -1) node {$q^*$};
    \draw (2.7, -1.2) node {$p$};
    \draw (-1.7, -1) node {(a)};

    \node[Round] (QF) at (6, 0) { };
    \node[Round_small] (QA) at (6.85, -0.5) { };
    \draw (5.3, -0.7) node {$\mathcal Q_\text{F}$};
    \draw (6.85, -0.5) node {$\mathcal Q_\text{A}$};
    \filldraw [black] (7.3, -0.7) circle (2pt);
    \filldraw [black] (8.4, -1.2) circle (2pt);
    \draw[black, dotted] (7.3, -0.7) -- (8.4, -1.2);
    \draw (7.35, -1) node {$q^*$};
    \draw (8.7, -1.2) node {$p$};  
    \draw (4.3, -1) node {(b)};

    \end{tikzpicture}
    \caption{\textit{The variational family $\mathcal Q_\text{A}$ for A-VI is a subset of the variational family $\mathcal Q_\text{F}$ for F-VI. (a) In general, F-VI can achieve a lower KL-divergence than A-VI.
    (b) Under certain conditions, however A-VI may still achieve the same optimal solution $q^*$ as F-VI.
    }}
    \label{fig:ordering}
\end{figure*}

To fully specify the VI objective of \Cref{eq:vi-optim}, we must decide on the variational family $\cQ$ over which to optimize. 
Many applications of VI use the \textit{fully factorized family}, also known as the \textit{mean-field family}.
It is the set of distributions where each variable is independent,
\begin{align}
  \label{eq:Q_F}
  \cQ_{\rmF} =
  \left\{
  q: q(\theta, \mbz) = q_0(\theta) \textstyle \prod_{n = 1}^N q_n(z_n)
  \right\},
\end{align}
and where the notation $q_n$ clarifies that there is a separate factor for each latent variable.
The factorized family underpins many applications where fast computation is desired to fit high-dimensional models to large data sets \citep[e.g][]{Bishop:2002, Blei:2012, Giordano:2023}.
We call an algorithm that optimizes \Cref{eq:vi-optim} over $\cQ_{\rmF}$ \textit{factorized variational inference} (F-VI). 

While the factorized family involves a separate factor $q_n(z_n)$ for each latent variable, recent applications of VI have explored the \textit{amortized variational family} \citep[e.g][]{Kingma:2014, Tomczak:2022}. In this family, the latent variables are again independent. But now the variational distribution of $z_n$ is governed by an \textit{inference function} $f_{\phi}(x_n)$, 
%
\begin{align}
  \label{eq:Q_A}
  \cQ_{\rmA} =
  \left\{
  q: q(\theta, \mbz) = q_0(\theta)
  \textstyle \prod_{n = 1}^N q(z_n \s f_{\phi}(x_n)) 
  % ; \ f_\phi \in \mathcal F
  \right\}.
\end{align}
%
The inference function $f_\phi$ maps each $x_n$ to the parameters of its corresponding latent variable's approximating factor $q_n(z_n)$.
Optimizing \Cref{eq:vi-optim} now amounts to fitting an approximate posterior $q_0(\theta)$ and the inference function $f_{\phi}$. Such an algorithm is called \textit{amortized variational inference} (A-VI).

The canonical application of amortization is in the \textit{variational autoencoder} (VAE)~\citep{Kingma:2014, Rezende:2014a}, where A-VI is used to do variational expectation-maximization.
In this application, $p(\mbx \mid \theta, \mbz)$ is specified by a deep generative model.
The inference function, termed the ``encoder'', is used to approximate the conditional posterior $p(z_n \mid x_n, \theta)$ for the expectation step.
We then estimate $\theta$ by maximizing the approximated marginal likelihood $p(\mbx \mid \theta)$, which is the maximization step.

There exist several motivations for A-VI.
One of them is scaling.
While F-VI requires fitting a separate variational factor for each of the data points, A-VI can be more efficient since what we learn about $\phi$ can be amortized across data points.
However, if A-VI’s inference function is not sufficiently expressive, it may fail to produce as sophisticated a solution as F-VI. 
We will formalize this intuition and go a little beyond, showing that no matter how expressive the inference function, $\mathcal Q_\text{A}$ is always a poorer family than $\mathcal Q_\text{F}$.

While A-VI is typically understood as a cog in the VAE, its formulation suggests a more general algorithm for approximate posterior inference.
In this paper, we study A-VI as a general-purpose alternative to F-VI. We ask: Under what conditions can A-VI achieve the same solution as F-VI?

In more detail, because $\mathcal Q_\text{A}(\mathcal F)$ is a poorer family than $\mathcal Q_\text{F}$, A-VI cannot achieve a lower KL-divergence than F-VI's optimal approximation. So, our goal is to distinguish the two scenarios illustrated in Figure~\ref{fig:ordering}. In one, the amortized family contains the optimal factorized variational distribution; in the other, the amortized family does not contain it. In the VAE literature, A-VI's potential suboptimality relative to F-VI is known as the \textit{amortization gap} 
\citep{Cremer:2018}. 

First, we characterize the class of models where A-VI can close the amortization gap and show that this class corresponds to \textit{simple hierarchical models} \citep{Agrawal:2021}, i.e. latent variable models which factorize as:
\begin{equation} \label{eq:simple-hier}
  p(\theta, \mbz, \mbx) = p(\theta) \prod_{n=1}^N p(z_n) p(x_n \mid z_n, \theta).
\end{equation}
This class includes the deep generative model that underlies the VAE and many other models in machine learning and in Bayesian statistics. Our analysis also shows that A-VI is appropriate for \textit{full Bayesian inference}, meaning we approximate $p(\theta, \mbz \g \mbx)$, rather than approximate $p(\mbz \mid \mbx, \theta)$ and point-estimate $\theta$ as in variational expectation-maximization.

Second, we generalize A-VI by expanding the domain of the inference function beyond a single data point $x_n$. We establish verifiable conditions for when an expanded function can close the amortization gap, and we provide a time-series example. Finally, we show that there are important examples, such as the hidden Markov model and the Gaussian process, where A-VI cannot attain F-VI's optimal solution, even if expanding the domain of the inference function.

% Our analysis also shows that A-VI can close the gap when approximating the joint posterior distribution $p(\theta, \mbz \mid \mbx)$ rather than the conditional posterior $p(\mbz \mid \theta, \mbx)$.
% A-VI is therefore viable for \textit{full Bayesian inference} \citep{Gelman:2013}, meaning that we construct a posterior over $\theta$, rather than produce a point estimate $\hat \theta$ as we would when doing variational expectation maximization.
% Therefore A-VI can be used to fit Bayesian hierarchical models and train Bayesian neural networks.

%
% Our main question is when can A-VI achieve the optimize F-VI --- this is known as the amortization gap.
%
% We characterize the class of models where A-VI can close the amortization gap.
%
% Our theory also shows that A-VI for full Bayesian inference


% We are interested in two generalizations from the VAE application: (i) Bayesian inference on both $\mbz$ and $\theta$, through the approximation of the joint posterior $p(\theta, \mbz \mid \mbx)$, thereby enabling the fitting Bayesian hierarchical models and the training of Bayesian neural networks; and (ii) latent variable models beyond the factorization of \Cref{eq:simple-hier}, including saw time series, hidden Markov models and Gaussian processes. 


\paragraph{Plan.} In \Cref{sec:ordering} we show that the potential for A-VI to achieve F-VI's solution amounts to implicitly solving an \textit{amortization interpolation problem} between $x_n$ and the optimal variational factors of F-VI.   For a solution to exist, two conditions must be met: (i) the interpolation problem must be well-posed, which is a condition on the model $p(\theta, \mbz, \mbx)$; and (ii) the class of inference functions over which we learn $f_\phi$ must be sufficiently expressive, which is a condition on the inference algorithm. 

In \Cref{sec:existence}, we investigate condition (i) theoretically. We show that, in general, the amortization interpolation problem admits a solution if and only if $p(\theta, \mbz, \mbx)$ is a simple hierarchical model. We then show how to expand the inference function to accommodate more models, and that there are models for which the gap cannot be closed.  

In \Cref{sec:experiment}, we empirically study condition (ii).
We find that the number of parameters of the inference function does not need to scale with $N$ for A-VI to achieve F-VI's solution, whereas the number of parameters for F-VI must scale with $N$.
We demonstrate this phenomenon across several models, including a Bayesian neural network. 
We also find that when the class of inference functions is sufficiently expressive, A-VI often converges faster than F-VI to the optimal solution. However, in some problems, the performance of A-VI may be much more sensitive to the random seed than F-VI.

% It is remarkable that this result holds in the full Bayesian inference case.
% Indeed, when estimating $p(\theta, \mbz \mid \mbx)$, the factor $p(z_n \mid \mbx)$, approximated by $q(z_n \s f_\phi(x_n))$, no longer depends on $x_n$ alone but on the entire data set; that is $p(z_n \mid \mbx) \neq p(z_n \mid x_n)$, a phenomenon termed \textit{partial pooling} \citep{Gelman:2013}.
% We show that, all the same, an inference function $f_\phi$, which only takes $x_n$ as its input, can close the amortization gap.

% Our theoretical analysis suggests A-VI is suboptimal when $p(\theta, \mbx, \mbz)$ does not factorize as in \Cref{eq:simple-hier}.
% We also show how, in some cases, we can expand the domain of $f_\phi$ beyond $x_n$ to close the amortization gap, where more general forms of A-VI can attain F-VI's solution.
% In other cases, such as for the hidden Markov model and the Gaussian process model, we show this strategy is not viable.

% Last we examine an example where a learnable inference function exists only if we increase the inference function's domain; accordingly, extending the input of the learning inference function improves the performance of A-VI.

% \begin{table}
%   \begin{center}
%     \begin{tabular}{c l r}
%     {\bf Method} & {\bf Inference function} & {\bf Final ELBO} \\
%     \hline
%     \rowcolor{Gray} F-VI &  & {\bf -3.24e+4}  \\
%     A-VI & Constant &  -3.670e+4 \\
%     A-VI & NN, width $k = 1$ & -3.56e+4 \\
%     A-VI & NN, width $k = 4$ & {\bf -3.24e+4} \\
%     \end{tabular}
%     \caption{\textit{Achieved ELBO when doing a full Bayesian analysis on a hierarchical nonlinear probabilistic model with $N = 10,000$ observations.
%     Because $\theta$ is not fixed, $p(z_n \mid \mbx) \neq p(z_n \mid x_n)$.
%     Nonetheless an inference network with width $k = 4$ which only takes in $x_n$ as its input closes the amortization gap between A-VI and F-VI.
%     See \Cref{sec:experiment} for details.
%     }}
%     \label{tab:gap}
%   \end{center}
% \end{table}

\paragraph{Related work.} The amortization gap has been extensively studied in the context of VAEs \citep{Hjelm:2016, Cremer:2018, Kim:2018, Marino:2018, Krishnan:2018, Kim:2021}. This paper goes beyond the VAE, seeking to understand when and how the amortization gap closes for latent variable models in general.
The accuracy of A-VI has also been studied for calculations on held-out likelihoods \citep{Shu:2018}. That said, our focus here is on using A-VI for posterior inference, rather than predictive distributions.

% We also consider doing full Bayesian inference on both the local variable $z_{1:N}$ and the global variable $\theta$.
% Existing work on VAEs typically uses a point estimator $\hat \theta$ which maximizes the (approximated) marginal likelihood $p(\mbx \mid \theta)$ \citep[e.g][Chapter 4.3]{Tomczak:2022}.
% In our theoretical analysis on VI's optimum, we can recover the point estimation setup by restricting the variational factor $q_0(\theta)$ to assign all its probability mass to $\hat \theta$.
% In this sense, the full Bayesian analysis is more general.

There has been some interest in applying A-VI to models other than standard VAEs \citep{Gershman:2014}, including dynamic VAEs \citep{Girin:2021}, latent Dirichlet allocation models \citep{Srivastava:2017}, and Bayesian hierarchical models \citep{Agrawal:2021}.
In this paper, we study latent variable models in general, rather than focus on a specific model.

% db : below is our discussion at the marshall chess club.

In some applications of A-VI, researchers have expanded the domain of the inference function beyond a single data point.
The conventional wisdom is that the inference function should take as input the same data that the exact posterior of the local variable $z_n$ depends on \citep[e.g][chapter 4]{Girin:2021}.
When doing full Bayesian inference, each latent variable $z_n$ typically depends on the entire data set $\mbx$, however we argue that it can be sufficient to only pass $x_n$ to $f_\phi$.
Hence the amortization interpolation problem provides a weaker condition than \textit{a posteriori} dependence on when the amortization gap can be closed.

In addition to passing more data points, it is also possible to pass latent variables to $f_\phi$, notably in hierarchical models \citep[e.g][]{Webb:2018, Agrawal:2021, Girin:2021}.
This strategy changes the factorization of $q(\theta, \mbz)$ and is aimed at closing the inference gap, i.e. further reducing $\text{KL}(q||p)$ towards 0 or equivalently increasing the evidence lower bound (ELBO), rather than the amortization gap.
This type of A-VI is beyond the scope of our paper, though extension of our analysis to such inference functions is feasible.
