\section{DISCUSSION}

%For an inference function which takes in $x_n$, this means the latent variable model must be a simple hierarchical model.
%More generally, closing the gap between the two algorithms amounts to solving an amortized interpolation problem.
% For standard latent variable models, the interpolation problem is solvable if and only if $p(\theta, \mbz, \mbx)$ is a simple hierarchical model (Theorem~\ref{thm:learnable}).
% This class of models encompasses many canonical models, both in Bayesian statistics and in deep learning.

We studied amortized variational inference (A-VI) as a general method for posterior approximation.  We derived a necessary, sufficient, and verifiable condition on the model $p(\theta, \mbz, \mbx)$ under which A-VI can achieve the same optimal solution as factorized (or mean-field) variational inference (F-VI). These results establish that A-VI is a viable method for a large class of hierarchical models, including when doing a full Bayesian analysis rather than using a point estimator for the global parameter $\theta$. 

We then examined how to extend the domain of the inference function for models beyond the simple hierarchical model. We also established that there are some models for which the amortization gap cannot be closed, even after expanding the domain of the inference function and no matter how expressive the inference function is.
In such models, our results can provide justification for methods such as semi-amortized VI \citep{Kim:2018, Kim:2021}, in which A-VI is used to converge quickly to a suboptimal solution, which is then refined with F-VI.

% Our analysis on extending the domain of the inference function $f_\phi$ generalizes graphical arguments used for point estimation \citep{Girin:2021}. Conceptually, the domain of $f_\phi$ can be learned using the dependence of $z_n$, conditional on $\mbx$ \textit{and} $\theta$, even when we are marginalizing over on $\theta$ (full Bayes case) rather than conditioning on $\theta$ (point estimation case).
% This can be understood as a particular manifestation of partial pooling.
% %
% When the amortization gap can be closed, we empirically find that the number of variational parameters for A-VI does not need to scale with the number of observations $N$, unlike for F-VI.

% db : more discussio at marshall

There remain several open questions about amortized variational inference.

Even when the model admits an ideal inference function, a persistent question is how to choose the class of inference functions in order to close the amortization gap.
The ordering of the variational families $\mathcal Q_\text{A} \subset \mathcal Q_\text{F}$ suggests an informal diagnostic: after A-VI converges, run a few steps of F-VI and see if the solution improves.
If it does then the inference function may not be sufficiently expressive to close the gap.
It is also possible that the optimizer converged to a local optimum and that changing the variational objective allows the solution to improve.
On the other hand, for high-dimensional problems with highly non-convex landscapes, it may take many iterations before the solution improves, in which case the proposed diagnostic would not detect shortcomings in the inference function.

% This could also hint at other problems, and we should keep in mind that the variational objective is usually non-convex and so an optimizer may only find a local optimum.

% db : the comment about the local optimum is a little vague. "other problems"? "keep in mind"?

A related question: For a choice of the class of inference functions, how does A-VI change the optimization landscape relative to F-VI?
Our experimental results also raise the question of whether an overparameterized class of inference functions burdens the optimization, as seen for the linear probabilistic model, or improves the convergence rate, as illustrated in the Bayesian neural network and the saw time series.
Along similar lines, it is of interest to study the advantages and drawbacks of A-VI on more complex data sets than the ones we have considered, particular case for which convergence may not be achieved within a reasonable computational budget.

% the role of the class of inference functions. even when the model admits an ideal inference function, how expressive does the class need to be to close the amortization gap.  a related question: how does A-VI, including the choice of inference function, change the optimization landscape? in practice, when we do VI, we solve a difficult non-convex optimization problem to a local optimum. Our empirical studies here suggest that A-VI might improve the landscape othat optimization.  one diagnostic: run F-VI from A-VI's solution and see if the solution improves. and what about the role of overparameterization?

% a second open area is around prediction...


% A direction for future work is to better understand the optimization paths induced by different choices of inference functions.
% When does a more expressive inference class burden the optimization, as seen for the linear probabilistic model, and when does overparameterization lead to more efficient optimization, as in the case of the nonlinear probabilistic model?
% A more in-depth analysis could help us understand how to choose the class of inference functions, which remains an important design choice for A-VI.

% db : i don't love the word "believe" below (but i don't have an alternative).

\begin{figure}
 \center
 \includegraphics[width=5.5cm]{figures/Elbo_time_time_series.pdf}
 \caption{\textit{Optimization path for saw time series.
 An inference network which only takes in $x_n$ as its input ($i = 1$ case) cannot close the amortization gap, even when using a relatively large network.
 On other hand, a network which takes in $x_{n - 1}, x_n$ ($i = 2$ case) closes the gap with a relatively small network.}}
\label{fig:saw_time}
\end{figure}

Finally, how accurate is A-VI when applied to held-out data? We expect the \textit{generalization gap} \citep{Shu:2018, Ganguly:2022} can also be analyzed by setting up an implicit interpolation problem, this time with constraints to not overfit the data.
In a full Bayesian context, how can we understand the role of A-VI for online learning? In other words, how well can $f_\phi(x_{N + 1})$ approximate $p(z_n \mid \mbx, x_{N + 1})$?
