%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023}
\usepackage{adjustbox}
\usepackage{tikz,graphics,color,fullpage,float,epsf,caption,subcaption}
%\usepackage[pdftex]{graphicx} 
\usepackage{sidecap}
\usepackage{amsmath,amsfonts}

\usepackage{hyperref}       % hyperlinks
\hypersetup{
    colorlinks,
    citecolor=blue,
    filecolor=blue,
    linkcolor=blue,
    urlcolor=black,
}

\newcommand{\shrink}[1]{}
\newcommand{\fromsakshi}[2][]{#1\frombody{red}{Sakshi}{#2}}
\newcommand{\fromali}[2][]{#1\frombody{blue}{Ali}{#2}}
\newcommand{\KL}{\text{KL}}
\usepackage{colortbl}
\newcommand{\frombody}[3]{
	\noindent
	\textcolor{#1}{
		{$\bf [\!\![\!\![$}\underline{\scshape{#2}}
		%{\scshape says:} 
		\textsl{#3}{$\bf ]\!\!]\!\!]$}}
	{}}
\usepackage{colortbl}
\usepackage[round]{natbib}
\renewcommand{\bibname}{References}
\renewcommand{\bibsection}{\subsubsection*{\bibname}}

% If you use BibTeX in apalike style, activate the following line:
\bibliographystyle{apalike}


% \documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
   % \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{A Decoder Suffices for Query-Adaptive Variational Inference}
%\title{Query-Adaptive Variational Inference for Deep Generative Models} %Enables Missing Data Imputation\\in Deep Generative Models

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1*]{\href{mailto:<sakshia1@uci.edu>}{Sakshi Agarwal}{}}
\author[1*]{Gabriel Hope}
\author[1]{Ali Younis}
\author[1]{Erik B. Sudderth}
% Add affiliations after the authors
\affil[1]{%
    Department of Computer Science\\
    University of California, Irvine
}
\affil[*]{%
    Equal contribution
}
%\affil[2]{%
%    Second Affiliation\\
%    Address\\
%    …
%}
%\affil[3]{%
%    Another Affiliation\\
%    Address\\
%    …
%  }

% Reduce spacing around floats 
\setlength{\floatsep}{10pt plus 4pt minus 2pt}
\setlength{\textfloatsep}{10pt plus 4pt minus 2pt}

\begin{document}

% Reset vertical space for equations 
% (must be after \begin{document})
\setlength{\abovedisplayskip}{2pt plus 3pt}
\setlength{\belowdisplayskip}{2pt plus 3pt}

\maketitle

\begin{abstract}
    Deep generative models like variational autoencoders (VAEs) are widely used for density estimation and dimensionality reduction, but infer latent representations via amortized inference algorithms, which require that all data dimensions are observed. VAEs thus lack a key strength of probabilistic graphical models: the ability to infer posteriors for test queries with arbitrary structure.
    %Prior approaches to imputation with VAEs include pre-processing heuristics, Monte Carlo samplers, and the learning of new conditional models. 
    We demonstrate that many prior methods for imputation with VAEs are costly and ineffective, and achieve superior performance via query-adaptive variational inference (QAVI) algorithms based directly on the generative decoder.  By analytically marginalizing arbitrary sets of missing features, and optimizing expressive posteriors including mixtures and density flows, our non-amortized QAVI algorithms achieve excellent performance while avoiding expensive model retraining.  
    On standard image and tabular datasets, our approach substantially outperforms prior methods in the plausibility and diversity of imputations.  We also show that QAVI effectively generalizes to recent hierarchical VAE models for high-dimensional images.
\end{abstract}

\section{Introduction}\label{sec:intro}

Structured probabilistic models, like graphical models~\citep{koller2009probabilistic}, are the basis for many machine learning applications.  Formalizing the data generation process enables incorporation of domain knowledge and allows unsupervised learning from unlabeled data.  Probabilistic models also enable diverse inference queries by users, for instance \emph{imputation} queries (or, inpainting in the image domain) where arbitrary subsets of features are observed and the values of missing features are predicted.  %where arbitrary subsets of features may be predicted given novel missing data patterns.
%These conditional queries are often a configuration of some variables which are observed and some which are unobserved or missing.  
A number of inference algorithms~\citep{pearl88,koller2009probabilistic} have been developed for models with discrete or Gaussian latent variables, which efficiently compute marginals of query variables given heterogeneous observations.   While exact inference is intractable for many complex models,
%simulation-based Monte Carlo methods~\citep{andrieu03} and 
optimization-based variational methods~\citep{wainwright08} often provide effective approximations.
%It is quite intuitive to perform conditional probability queries of the form $p(x_M|x_O)$ for different configurations of evidence ($x_O$) and query $x_M$ in VI. 

%Variational inference (VI)~\citep{journals/ml/JordanGJS99, wainwright08}  chooses a family of tractable surrogates, parameterized by free variational parameters to approximate the intractable conditional densities in a graphical model. 
Variational bounds were classically optimized via \emph{coordinate ascent variational inference} (CAVI,~\citet{journals/ml/JordanGJS99}) algorithms that iteratively update posterior approximations for individual variables. %while holding all others fixed to their current values.
CAVI updates are effective for many parametric models composed from conjugate priors, and can have efficient message-passing implementations~\citep{ghahramani01,winn05}. But, CAVI updates are based on integrals that often lack closed forms, requiring Monte Carlo approximations~\citep{Paisley2012VariationalBI,kucukelbir2017automatic} of uncertain quality.

Stochastic subsampling of data helps scale variational learning to big datasets~\citep{hoffman2013stochastic}, but iterative CAVI updates may still be slow for complex models. 
\emph{Amortized} variational inference \citep{10.5555/3044805.3045092}  seeks to boost training efficiency by determining variational posteriors via an inference (or recognition) network, which is shared (or amortized) across many similar inference tasks. \emph{Variational autoencoders} (VAEs,~\citet{Kingma2014,pmlr-v32-rezende14})
are deep generative models that utilize amortized inference to
jointly train a generative ``decoder'' and inference ``encoder''. Sophisticated generalizations to the encoder and decoder
networks~\citep{kingma2019introduction,sonderbyLadderVariationalAutoencoders2016, vahdat2020NVAE_nvae, child2021very} have produced hierarchical VAEs that realistically model complex image data via dozens of stochastic layers.

%show practically how we do non-amortized with deep generative models, show a gap, show amortization fails for these tasks and the right way to do it would be the classical inference approach

While amortized inference has enabled the learning of impressive deep generative models, it sacrifices the flexibility of CAVI to handle arbitrary inference queries.  Because VAEs are typically trained from fully-observed data, the encoder assumes \emph{complete} and \emph{uncorrupted} observation of every data dimension (e.g., pixel).  Simple heuristics (see Sec.~\ref{sec:background}) are sometimes used for learning VAEs with missing data \citep{pmlr-v97-mattei19a, Nazbal2020HandlingIH, collier2020vaes}, such as filling missing features with 
zeros. However, we show that in addition to requiring expensive encoder retraining for peak performance, these approaches are inaccurate unless the test inference queries are simple or known in advance (during model training). 
%
Amortized inference also produces sub-optimal variational bounds, 
and this ``amortization gap'' may be significant~\citep{cremer2018inference, Krishnan2017OnTC}. 
While quick-and-approximate inference may be sufficient to provide a noisy gradient signal in the midst of a long training process, it is problematic when applied to test queries, especially in domains like medicine where accurate uncertainty quantification is crucial.
%On the other hand, classical variational inference algorithms were designed to naturally handle queries where some dimensions are missing or noisy, a crucial requirement for probabilistic models deployed in domains from medicine to collaborative filtering [cite some?], but amortized inference is less flexible. 
 %Other semi-amortized works employ this idea but have not been studied in this context

%Moreover, existing methods to tackle conditional queries in VAEs at test time are limited and fail to address to this issue, it it exists. Previous works from \cite{pmlr-v97-mattei19a} and \citep{Nazbal2020HandlingIH}  train a VAE on conditional query formulations, however once trained, the performance of such VAEs are poor when varying queries are made at test time. %Other training methods for VAEs  might \emph{only} allow for missing data completely at random, a strong assumption to generalize to other random queries.  
% Conditional VAEs (CVAEs) \cite{NIPS2015_8d55a249}, propose a general framework to design VAEs to fit incomplete heterogeneous data, handling only missing data \emph{completely} at random. 
%Some prior works (see Sec.~\ref{sec:background}) use simple heuristics such as filling missing features with zeros or re-tuning encoder with missing data.  %~\citep{pmlr-v97-mattei19a,Collier2020VAEsIT,Nazbal2020HandlingIH} 
%Unfortunately, these heuristics are inaccurate unless the data and missing feature patterns are very simple. %, we show in our experiments that these approaches are  .


\shrink{
\begin{SCfigure}%[t!]
\includegraphics[width=0.5\linewidth]{plots/fill0s-random.pdf}
  \caption[]
{\protect\rule{0ex}{5ex} \small{\bf Left} shows an image from MNIST with $50\%$ pixels masked out, filled with $0$s. {\bf Right} shows completions from a pre-trained VAE.
} 
\label{fig:imputations-fill0s-random}
\end{SCfigure}
}

To address these challenges, we propose \emph{query-adaptive variational inference} (QAVI) methods that approximate the posterior of missing data with arbitrary patterns, given only a trained generative decoder and sparse observations.  Our QAVI approach has the same inferential flexibility as classic CAVI algorithms, and critically does not require a database with many examples of the missing-data pattern of interest.  But unlike classic CAVI algorithms, QAVI is applicable to any (differentiable) model with continuous latent variables, including deep generative models like hierarchical VAEs.
While some prior work has improved VAE training by reducing amortization gaps~\citep{pmlr-v80-kim18e,pmlr-v80-marino18a}, our application of non-amortized inference to missing-data queries for deep generative models is novel. 

We begin in Sec.~\ref{sec:background} by reviewing prior work on handling missing data with (hierarchical) VAEs.  Sec.~\ref{sec:variationalFeat} then develops QAVI algorithms that optimize variational bounds for the missing feature values, rather than filling them via greedy heuristics.  In Sec.~\ref{sec:nonamortized}, we develop an alternative QAVI algorithm  that directly optimizes the posterior of the latent data encoding, without amortization. Doing this allows exact marginalization of missing feature values, and enables flexible posterior approximations including mixture models~\citep{10.5555/308574.308663,DBLP:conf/icml/GershmanHB12} and normalizing flows~\citep{pmlr-v37-rezende15}. 
Results in Sec.~\ref{sec:resultVAE} then show substantial qualitative and quantitative improvements in capturing multimodal posterior uncertainty for VAE models of tabular data, as well as state-of-the-art hierarchical VAE models of images~\citep{child2021very}. 
%Here, we also show that our approach generalizes to state-of-the-art hierarchical VAE models of images~\citep{child2021very}. %Our results suggest that our method enables accurate  \emph{and} flexible inference queries for VAEs. 
%We show that in addition to requiring expensive encoder retraining for peak performance, these approaches are inaccurate unless the data and missing feature patterns are very simple.
%Finally, we conclude in Sec.~\ref{sec:resultDeep}. 
%Our experiments suggest that performing variational inference localised to a data point helps represent the uncertainty arisen by missing data. Our contributions are: a) we investigate learning local variational distributions over missing data at `test time' and derive an ELBO for it, b) we study different choices of posterior approximation and it's capability to  represent a multi-modal distribution over missing variables, c) we demonstrate that some parametrized distributions improve expressiveness of the approximation and play a significant role in generating multiple hypothesis for missing data.  



\section{Background and Related Work}
\label{sec:background}
\shrink{

\subsection{Variational Inference}
Consider the following generative model for $x$,
\begin{equation}
 \\
z \sim p(z), \qquad x \sim p_\theta(x \mid z). %= \phi(X|f_\theta(z)).
\end{equation}

where the unobserved $d$ dimensional random vector $z \in R^d$ follows a prior distribution $p(z)$.
}
%\pagebreak
%\pagebreak
\subsection{The Variational Autoencoder (VAE)}
VAEs model the distribution of typically high-dimensional observed data $x$ using continuous, lower-dimensional latent variables $z$, via the following generative model:
\begin{equation}
 \\
z \sim p(z), \qquad x \sim p_\theta(x \mid z). %= \phi(X|f_\theta(z)).
\end{equation}
For standard VAEs, the latent code $z \in \mathbb{R}^d$ has a factorized Gaussian prior $p(z)$.
%The unobserved $d$ dimensional random vector $z \in R^d$ follows a prior distribution $p(z)$, typically a simple factorized Gaussian. %or could partition $z$ to $L$ disjoint groups $z = \{z_1, z_2, ... z_L \}$ to increase the expressiveness of the generative model. The latter case, known as Hierarchical VAE~\citep{sonderbyLadderVariationalAutoencoders2016, 10.5555/3454287.3454545} then model $p(z) = p(z_1) \prod_{l=2}^{L} p(z_l|z_{<l})$.  
Given $z$, data is generated via a \emph{decoder} (deep) neural network with weights $\theta$. The decoder maps $z$ to a likelihood $p_\theta(x|z)$ such as a factorized Gaussian. 
%or a distribution for constrained data like the continuous Bernoulli~\citep{10.5555/3454287.3455477}.

%\subsubsection{Ammortized Variational Inference}
The VAE log-likelihood 
$\log p_\theta(x) = \log \int p_\theta(x|z)p(z)\,dz$ 
%$\log p_\theta(x) = \log \int_{R^d} p_\theta(x|z)p(z)dz$ 
is intractable. Learning thus typically employs \emph{amortized} VI, where an \emph{encoder} with parameters $\phi$ approximates the posterior over latent codes $q_\phi(z|x) \approx p_\theta(z|x)$. We jointly learn $\theta,\phi$ by maximizing the \emph{evidence lower-bound} (ELBO):
\shrink{
\begin{align}
    \mathcal{L}(\theta, \phi;x) &= E_{q_\phi(z|x)} [\log p_\theta(x,z) - \log q_\phi(z|x)] 
\label{eq:ELBO-start} \\
   &= E_{q_\phi(z|x)} [\log p_\theta(x|z)] - \KL(q_\phi(z|x)||p(z|x)).
 \label{eq:ELBO}
\end{align}
}
\begin{equation}
    \mathcal{L}(\theta, \phi;x) = E_{q_\phi(z|x)} [\log p_\theta(x|z)] - \KL(q_\phi(z|x)||p(z)).
\label{eq:ELBO}
\end{equation}
Here, $\KL$ is the Kullback–Leibler divergence, calculated analytically when $q_\phi(z|x)$ and $p(z)$ are both Gaussian. %whenever possible and otherwise approximated using Monte Carlo sampling. 
The ELBO %$\mathcal{L}(\theta, \phi;x)$ 
provides a lower bound on the evidence $\log p_\theta(x)$ that is tight when the variational posterior $q_\phi(z|x)$ is exact.
This expectation can be approximated via Monte Carlo samples from $q_{\phi}(z|x)$.
Gradients with respect to $\theta, \phi$ can then be estimated by the reparameterization ``trick'' of sampling from $q_\phi(z|x)$ via linear transforms of standard normal variables~\citep{Kingma2014,pmlr-v32-rezende14}.  %Inference for a new fully-observed data point $\hat x$ similarly involves computing the posterior $q_\phi(z|\hat x)$ via the encoder. %Hence, VAEs cannot readily handle missing data because the trained encoder is a function of fully-observed data. 

%Since the encoder operates on fully-observed data, handling VAEs with missing data  it is not straightforward to ha
%how vae requires all observations. In case missing, retunign enocder, collier + posterior matching retraines an encoder to handle missing data.. 

\subsection{Hierarchical VAEs}

Hierarchical VAEs (HVAEs,~\citet{sonderbyLadderVariationalAutoencoders2016, 10.5555/3454287.3454545}) %are another family of probabilistic latent variable models that 
extend the VAE by partitioning the latent code into $L$ disjoint groups $z = (z_1, z_2, ...,  z_L)$, increasing model expressiveness for complex data like images~\citep{vahdat2020NVAE_nvae, child2021very}. %Further, these models employ hierarchies of conditionally dependent latent variables. 
HVAEs generate these stochastic codes sequentially as $p_\theta(x|z) = p_\theta(z_1) (\prod_{\ell=2}^{L} p_\theta(z_\ell|z_{<\ell})) p_\theta(x|z_L)$, with a similar encoder: $q_\phi(z|x) = q_\phi(z_1|x) \prod_{\ell=2}^{L} q_\phi(z_\ell|z_{<\ell}, x)$. Each conditional in the decoder $p_\theta(z_\ell|z_{<\ell})$, and the encoder $q_\phi(z_\ell|z_{<\ell}, x)$, is typically Gaussian with mean and variance determined by (non-linear) neural networks.
%
The HVAE ELBO equals
\begin{multline}
\!\!\!\!\!\mathcal{L}_{H}(\theta, \phi; x) =  E_{q_\phi(z|x)} [\log p_\theta(x|z)] - \KL(q_\phi(z_1|x)||p_\theta(z_1)) \\
  - \sum_{\ell=2}^L E_{q_\phi(z_{<\ell}|x)} [\KL(q_\phi(z_\ell|z_{<\ell}, x)||p_\theta(z_\ell|z_{<\ell}))],
\label{ELBO-H}
\end{multline}
where $q_\phi(z_{<\ell}|x) = \prod_{i=1}^{\ell-1} q(z_i|z_{<i}, x)$ is the approximate posterior up to latent group $(\ell-1)$. 
Reparameterization is then used to provide Monte Carlo gradient estimates. %Again, the reparameterization trick is used to train the network parameters.

We can rewrite the conditional prior and approximate posterior for layer $\ell$ to make the set of relevant networks explicit: %In this form the conditional distributions are specified as:
\begin{align}
\label{eq:HVAEnet}
    p_{\theta}(z_\ell|z_{<\ell}) &= \mathcal{N}(z_\ell|\ \mu_{\theta_\ell}(z_{<\ell}),\ \sigma_{\theta_\ell}(z_{<\ell})), \\
    q_\phi(z_\ell|z_{<\ell}, x) &= \mathcal{N}(z_\ell|\ \mu_{\phi_\ell}(f_{\phi_\ell}(x), g_{\phi_\ell}(z_{<\ell})),\ \sigma_{\phi_\ell}(...)). \notag%f_{\phi_l}(x), g_{\phi_l}(z_{<l}))). 
\end{align}
Here, $f_{\phi_\ell}$ and $g_{\phi_\ell}$ are networks that extract feature representations of the observation $x$ and the previous layers $z_{<\ell}$, respectively. These features determine the mean and scale of the conditional Gaussian posterior via $\mu_{\phi_\ell}$, $\sigma_{\phi_\ell}$. 
%are other networks that  generate the mean and scale of the conditional Gaussian posterior given the features extracted by $f_{\phi_l}$ and $g_{\phi_l}$. 
Networks $\mu_{\theta_\ell}$, $\sigma_{\theta_\ell}$ similarly generate the prior parameters for layer $\ell$.


\subsection{Methods for Inferring Missing Data}
After training, synthetic data may be easily generated from $p_\theta(x)$ by sampling $z$ from the learned VAE or HVAE prior, and $x \sim p_\theta(x|z)$. However, the learned encoder does \emph{not} provide a direct mechanism for conditional queries about missing features, given partial or corrupted test data.

 %Throughout the paper, we consider this case where the parameters of a the trained encoder/decoder ($\theta, \phi$) are fixed. 
Let $x = (x_O,x_M)$ be a test data point with observed features $x_O$ and missing features $x_M$.  We are particularly interested in scenarios where the specific feature dimensions that are missing and observed will vary across test instances.
Given a trained (hierarchical) VAE, for which $x_O$ and $x_M$ are conditionally independent given $z$, missing data is optimally predicted via the conditional distribution:
%
\begin{equation}
    p_\theta(x_M \mid x_O) = \int p_\theta (x_M \mid z) p_\theta(z \mid x_O) \;dz.
\label{eq:sampling}
\end{equation}
%
This approach (and QAVI) are valid as long as data is \emph{missing-at-random} (MAR,~\citet{little2019statistical,pmlr-v97-mattei19a}); the mechanism that removes features must be independent of $x_M$.
Like most related work, our experiments use \emph{missing-completely-at-random} (MCAR) feature masks whose distribution is independent of $x$.
%
Exactly evaluating the predictive distribution~\eqref{eq:sampling} is infeasible due to the non-linear decoder, and intractable code posterior.  %A number of approximations have thus been proposed to address this problem.

\textbf{Heuristic Preprocessing.\,} Imputation heuristics, such as replacing missing features with statistical summaries like their mean or mode, are widely used.  %In practice, observations with missing values may also be simply removed altogether. 
Some work on training VAEs given partially missing data (\cite{pmlr-v97-mattei19a, Nazbal2020HandlingIH, collier2020vaes}) propose a \emph{Fill-Zeros} heuristic, simply replacing missing features with zeros as input to the VAE encoder.
While Fill-Zeros may be effective for learning models of MNIST digits (where zeros are common) when pixels are missing uniformly at random, our experiments show that in even slightly more complex scenarios, its performance is very poor.

\textbf{Monte Carlo methods.\,} \cite{pmlr-v32-rezende14} first proposed a simple scheme to approximately sample from $p_\theta(x_M \mid x_O)$ by starting with a random imputation, which is then stochastically encoded and decoded (or autoencoded) several times. 
%At each iteration of their \textbf{pseudo-Gibbs} sampler, a latent code is sampled from $q_\phi(z \mid x_M,x_O)$ given a current hypothesis for the missing features $x_M$, and then the missing features are resampled from $p_\theta(x_M \mid z)$. 
Because the encoder only approximates the true posterior distribution $p_\theta(z\mid x)$, this \emph{pseudo-Gibbs} sampler will not sample from the true posterior of missing features, and encoder inaccuracies may cause its equilibrium distribution to be far from $p_\theta(x_M \mid x_O)$. This approach was improved by~\citet{NEURIPS2018_0609154f}, who proposed a Metropolis-Hastings correction to each step of the pseudo-Gibbs sampler, inducing a \emph{Metropolis-in-Gibbs} sampler that asymptotically samples from $p_\theta(x_M \mid x_O)$. While Metropolis-in-Gibbs converges to the true posterior, it does so at a rate that may be impractically slow. %The dependence of these Monte Carlo methods on an inexact amortized posterior approximation $q_\phi(z|x)$ and it's dependency on fully-observed data lead to significant disadvantages.


%expensive to re-train on test data, pm -> train once but performance significantly depends on the distribution it is trained on and tested on 

\textbf{Amortized Inference for Imputation.\,} %Some works, instead design encoders to allow for partially-observed data as inputs. This introduced the
Heuristic preprocessing may be avoided by learning a new ``partial'' encoder approximating $p_\theta(z\mid x_O)$. 
\cite{collier2020vaes} %trains an additional encoder on 
concatenates zero-filled data with a binary mask indicating missing features, and optimizes the standard VAE ELBO of Eq.~\eqref{eq:ELBO}, but with the log-likelihood term calculated on only the observed data $x_O$. While this approach was developed to \emph{train} VAEs with missing data, it is trivially applicable to test-time missing data imputation by retraining the encoder with the masked test data, generating a \emph{Re-tuned Encoder}. %before evaluation. 

\emph{Posterior Matching}~\citep{strauss2022posterior} instead artificially masks the complete training data $x$ as $x_O$, and tunes a partial-encoder $q_\psi(z \mid x_O)$ to ``match'' the pre-trained encoder by maximizing $ E_{z \sim q_\phi(z|x)}[\log q_\psi(z \mid x_O)]$. In concurrent work, \citet{harvey2022conditional_conditional_image_generation} introduced an equivalent approach, calling it \emph{Inference in a Pretrained Artifact} (\emph{IPA}).  %These methods ensure the capability of a single training of the inference network to handle missing queries on the fly. %However, our experiments in Sec \ref{sec:resultVAE} suggest that its inference capability is heavily dependent on the masking distribution it is trained on. 
\citet{ivanov2018variational} optimize the partial-encoder as in Posterior Matching, but simultaneously retrain the encoder and decoder to produce a conditional model $p(x_M \mid x_O)$. Their model includes skip connections between the partial-encoder and decoder networks, as in the HVAE feature representations $f_{\phi_l}$ in Eq.~\eqref{eq:HVAEnet}. We adapt their \emph{VAEAC} (arbitrarily conditioned) training to HVAEs by fine-tuning all three networks, starting from a pre-trained HVAE. 

%Finally, \citet{ivanov2018variational} propose the \emph{Variational Autoencoder with Aribitrary Conditioning} (\emph{VAEAC}). This model optimizes a similar loss to Posterior Matching, but also simultaneously trains the generative model to produce a conditional model $p(x_M|x_O)$. The original VAEAC architecture includes explicit skip connections between the partial encoder and decoder networks. In our experiments we adapt VAEAC to HVAEs that include similar skip connections, allowing for a level comparison to other approaches based on the same architecture. We apply the VAEAC training procedure at test-time missing data imputation by fine-tuning the encoder, partial encoder and decoder networks.  

%approximate Eq. \ref{eq:sampling} by training an additional "partial" encoder (Gabe, what eq. is optimized?), along with fine-tuning the decoder $p_\theta(x_M \mid z)$. 


These amortized approaches to missing data imputation with VAEs have several drawbacks compared to our QAVI method. \emph{1)} They incur a substantial initial overhead for training the partial-encoder. \emph{2)} Partial-encoder training requires access to a relatively large set of partially-observed examples, and/or continued access to the training set along with a known missing-feature distribution. \emph{3)} They are sensitive to shifts in the distribution of queries (missing-feature patterns) between training and evaluation. If evaluated on previously unseen missing-feature patterns, performance may suffer substantially (see Fig.~\ref{fig:iwae_plot_1000}). \emph{4)} As we will demonstrate, even without distribution shifts, performance can be sub-optimal.
%is poor compared to non-amortized VI. In this work we show that closing the amortization gap with QAVI leads to more effective and robust imputation and is better choice where applicable.  
\shrink{
These amortized approaches to missing queries with VAEs have several distinct drawbacks. They require a prior knowledge of the query distribution during train time and are sensitive to shifts in the query patterns between training and test time. We show in fig. \ref{fig:iwae_plot_1000} that if evaluated on previously unseen query patterns, performance may suffer substantially. Optimal performance may require re-training of the partial encoder with large amounts of data when new queries are encountered at test time. %While doing this still leads to inaccuracies, it also incurs a substantial computational overhead, especially in the case of deep hierarchical VAEs, and  increase in the number of trained models for such tasks. This is inefficient algorithm designing for such deep learning models in the domain of graphical models. 
}

%In this work we instead develop systematic ways to perform inference with pre-trained VAEs without the need to differentiate between queries at test time, while also customizing inference for different queries across the test set. We show that our Query-Adaptive Variational Inference, \emph{QAVI} is a more effective and robust design strategy for query adaptation and is a better choice where applicable.  

%it  is per-observa

%Such amortized inference procedures are restrictive and the approximated posteriors are still constrained to be Gaussian. Reversing this choice of AVI for answering general queries at test time, we will below discuss QAVI. %In section --, we show empirically that our non-amortized approaches capture the true posterior, notably its multi-modal nature.

%problem : diverse (multiple imputations) and valid (a correct reconstruction, samples from the posterior not contradicting the observed pixels, etc..), 

\begin{figure}[t]
    \centering
   \includegraphics[width=1.03\linewidth]{plots/fig1.pdf}
\caption{ Overview of Feat. QAVI (\emph{a}, Sec.~\ref{sec:variationalFeat}) and non-amortized QAVI (\emph{b}, Sec.~\ref{sec:nonamortized}). \emph{Top:} Generative model for unobserved variables $x_M, z$ and observed data $x_O$ (shaded). We show two possible inference models for the joint variational distribution $q_\lambda(x_M, z)$ on latent variables. In (\emph{a}), missing features are modeled directly by specifying $q_\lambda(x_M)$. In (\emph{b}), queries are modeled indirectly via a tunable latent code distribution $q_\lambda(z)$. QAVI is easily adapted to existing VAEs by reusing generative (decoder) networks $p_\theta$, and possibly inference (encoder) networks $q_\phi$. \emph{Bottom:} Computational flow of QAVI for each posterior factorization. In Feat.~QAVI (\emph{a}), samples from $q_\lambda(x_M)$ are autoencoded by the pre-trained VAE to compute the ELBO. In non-amortized QAVI (\emph{b}), samples from $q_\lambda(z)$ are only passed through the decoder; the pre-trained encoder is not needed. %In both cases sampling imputations uses the same compatational path.
} 
\label{fig:latent}
\end{figure}


\section{Query Adaptive VI}

\label{sec:proposed}
%Local inference for partially-observed test data $x_O$ 
Our \emph{query adaptive variational inference} (QAVI) utilizes the pre-trained VAE decoder, defines variational free parameters for each inference query, and refines them. We give an overview in Fig. \ref{fig:latent} and show that it is adaptive across different queries without the need for additional partial-encoder networks. 
%This classical VI technique then determines a variational distribution on latent variables $q_\lambda(z)$ that is as close to the true posterior $p_\theta(z|x)$ for each data point $x$. 
%Extending the classical VI formulation to two latent variables case in a graphical model: 1. missing data $x_M$ and 2. latent variable in low dimension $z$, we
QAVI defines an explicit variational posterior over the unobserved variables $q_\lambda(z,x_M)$ with query-specific parameters $\lambda$. Fig.~\ref{fig:latent} shows two possible factorizations of the latent codes $z$ and missing features $x_M$, leading to a pair of QAVI algorithms that we elaborate below.

\subsection{VI via Missing Feature Posteriors}
\label{sec:variationalFeat}
Suppose first that the variational posterior factorizes as 
%Here, $q_\lambda(x_M)$ is a variational distribution over $x_M$ with parameters, $\lambda$: 
    \[ 
    q_{\lambda}(x_M,z \mid x_O) = q_\lambda(x_M) q_\lambda(z \mid x_M,x_O).
    \]
We fix $q_\lambda(z \mid x_M,x_O)$ to the amortized encoder $q_\phi(z | x)$, which has been pre-trained to approximate the relationship between $x$ and $z$.  Our \emph{Feat.~QAVI} method then defines an explicit posterior $q_\lambda(x_M)$ on missing features.  This approach directly captures uncertainty in the posterior of missing features (see Fig.~\ref{fig:latent}) and their impact on the latent code. 

In experiments, we found that fitting $q_\lambda(x_M)$ via the standard ELBO (see supplement for derivation) resulted in posteriors with unrealistically small variance.  
Following~\citet{higgins2017betavae}, we thus employ hyperparameters $\beta$ to more strongly encourage the latent-code posterior to align with the prior, and $\gamma$ to increase the entropy of the missing-feature posterior. %to capture the uncertainty. %By weighting the KL-divergence and entropy terms, we can direct the latent-space posterior to be closer to the prior and feature-space posterior to have larger entropy. 
The variational objective $\mathcal{L}_M$ is then:
 %Previous works like $\beta$-VAE similarly employ this originally proposed in as : 
\begin{multline}
    \mathcal{L}_M(\lambda; x) = E_{q_{\lambda}(x_M)}\big[E_{q_{\phi}(z|x_O,x_M)} [\log p_\theta(x_O, x_M|z)] \\
   - \beta \KL(q_{\phi}(z|x_M,x_O)\,||\,p(z))\big] + \gamma H(q_\lambda (x_M)),
\label{eq:beta-variationalFeat}
\end{multline}
where $H$ is the entropy. 
We approximate $\mathcal{L}_M$ with $S$ Monte Carlo  samples as in standard VAE training: 
\begin{multline}
    \mathcal{L}_M(\lambda; x) \approx  \frac{1}{S} \sum_{s=1}^S \big[ \log p_\theta(x_O,x_M^{(s)}|z^{(s)}) \\
- \beta \KL(q_\phi (z|x_O,x_M^{(s)}) \,||\, p(z))\big] + \gamma H(q_\lambda (x_M)),
\end{multline}
where $x_M^{(s)} \sim q_\lambda(x_M)$, $z^{(s)} \sim q_\phi(z|x_O,x_M^{(s)})$, 
and $\KL$ and $H$ are calculated in closed form. Automatic differentiation is used to compute gradients with respect to $\lambda$, tuning the missing-feature distributions to observed data indirectly. 

While Feat.~QAVI %correctly captures 
directly models the uncertainty in the posterior distribution of missing features, its optimization requires repeated computation of the encoder and decoder.  It also inherits any suboptimalities of the trained encoder.  Perhaps surprisingly, we will show that in this context, a fully non-amortized inference method can have both greater computational efficiency and greater accuracy. 

\subsection{Non-amortized VI}
\label{sec:nonamortized}
We can construct an alternative variational posterior via the following factorization (see Fig.~\ref{fig:latent}):
    \[ 
    q_{\lambda}(x_M,z) = q_\lambda(z) q_\lambda(x_M | z),
    \]
%\subsubsection{Training  Objective}
%Each datapoint is a collection of two latent variables $<z,x_M>$, and an observed variable, $x_O$. 
 %where we have a prior distribution for the observed data available ($p_\theta$) and we wish to approximate the variational distribution $q_\lambda(z,x_M)$ parameterized by $\lambda$. 
 By defining a variational posterior on $z$, we no longer use the encoder (except possibly for initialization), and re-use the pre-trained decoder by fixing $q_\lambda(x_M \mid z) = p_\theta(x_M \mid z)$. 
 As derived in the supplement, this leads to the following \emph{non-amortized QAVI} variational objective:
\begin{equation}
    \mathcal{L}_N(\lambda; x) = E_{q_\lambda(z)}[\log p_\theta(x_O|z)]  - \beta \KL(q_{\lambda}(z)||p(z)).
\label{eq:ELBO-NA-beta}
\end{equation}
 \shrink{
\begin{equation}
    \mathcal{L}_N(\lambda; x) = E_{q_\lambda(z)}[\log p_\theta(x_O|z)]  - \KL(q_{\lambda}(z)||p(z)).
\label{eq:ELBO-NA}
\end{equation}}
To evaluate $\mathcal{L}_N$ we must only explicitly sample from the latent code; missing data may be \emph{analytically} marginalized. Non-amortized QAVI seeks latent-code distributions that assign high likelihood to the observed features $x_O$, and are aligned with the prior via weight $\beta > 1$.
%Similar to the previous case, we employ the  $\beta$ hyper-parameter as follows,
%Intuitively, the modified $\mathcal{L}_N$ encourages maximizing the log-likelihood of the \emph{observed data} under the variational distribution for the latent code $q_\lambda(z)$ 
%while also keeping them close to the prior with a factor of $\beta$ ($>1$). %When $q_\lambda(z)$ is a simple diagonal Gaussian, 
In the simplest case where $q_\lambda(z)$ is a diagonal-covariance Gaussian,
Eq.~\eqref{eq:ELBO-NA-beta} is approximated via $S$ samples from $q_\lambda(z)$, and the code mean and variance $\lambda$ optimized via stochastic gradient ascent.

While VAE training implicitly encourages approximately Gaussian posteriors given \emph{complete} observations, for queries given missing data, posteriors are often multi-modal and poorly approximated by a Gaussian $q_\lambda(z)$.
%Often missing queries can induce multi-modal uncertainties which cannot be captured by any single mode Gaussian distribution. 
To address this, we extend non-amortized QAVI to more expressive variational distributions that better capture the true posterior. %a variant of normalizing flows called inverse autoregressive flows (IAFs) (\cite{kingmainverseflows}) and a mixture of Gaussians to provide $q_\lambda(z)$ with some flexibility.
\shrink{
For hierarchical VAEs, it is also possible to use a factorized Gaussian variational approximation: $q_\lambda(z) = \prod_{l=1}^L q_\lambda(z_l)$. In practice, we found that this approach does not adequately capture the complex dependencies between the latent variables and is highly susceptible to very poor local optima in optimization. Instead, we retain the expressiveness of the hierarchical models %pre-trained encoder and generative networks
, and define $q_\lambda(z) = q_\lambda(z_1) \prod_{l=2}^{L} q_\lambda(z_l|z_{<l})$. Here, we follow the same variational lower bound in Eq. \ref{eq:ELBO-NA} and optimize the parameters $\lambda = \{\lambda_1, \lambda_2,.... \lambda_L \}$, constituting a vector at each layer $l$ which defines $q_l(z)$. 
}

\textbf{Flow Posteriors.\,} The flow-based variational posterior aims to construct a complex distribution by transforming a simple Gaussian through a series of invertible mappings. We let 
$z_t = \mathcal{T}_t(z_{t-1},\lambda_t)$ for $t = 1,\ldots,T$,
%$z_T = \prod_{t=1}^T \mathcal{T}_t(z_{t-1}, \lambda_t)$, 
where $z_0$ is sampled from a Gaussian base distribution with parameters $\lambda_0$. $\lambda_t$ is the set of parameters specifying flow layer $\mathcal{T}_t$, $T$ is the total number of flow transformations, and $\lambda= \{\lambda_0, \lambda_1, ... ,\lambda_T \} $. 

The idea of improving amortized variational inference in VAEs with the help of \emph{normalizing flows} \citep{Tabak2013AFO, Tabak2010DENSITYEB} was first proposed by \cite{pmlr-v37-rezende15}. We instead employ autoregressive transformations in each layer $\mathcal{T}_t$ to capture high-dimensional dependencies in the latent space, producing \emph{inverse autoregressive flows} (IAF,~\cite{kingmainverseflows}). We approximate both terms in $\mathcal{L}_N(\lambda; x)$ of Eq.~\eqref{eq:ELBO-NA-beta} via $S$ samples from $q_\lambda(z_T)$; each Gaussian sample from the base distribution is transformed by $T$ flow layers.  Query-specific parameters $\lambda$ are optimized by stochastic backpropagation.  

\textbf{Gaussian Mixture Posteriors.\,}
Parameterizing the latent space distribution $q_\lambda(z)$ as a \emph{mixture of Gaussians} enables us to explicitly model different hypotheses in the latent space. Let $q_{\lambda}(z)= \sum_{t=1}^{T} w_t \mathcal{N}(z \mid \mu_t, \Lambda_t)$, where $\mu_t$ and $\Lambda_t$ are the means and (diagonal) covariances of the $T$ mixture components (posterior modes), $w_t$ are mixture weights ($\sum_{t=1}^{T} w_t = 1$), and $\lambda = \{w_t,\mu_t,\Lambda_t\}_{t=1}^T$.  
Increasing $T$ allows inference of more accurate posterior approximations.
%Consequently, we define $\lambda = \{\mu^{[1]}, ..., \mu^{[N]}\} \cup \{w^{[1]}, ..., w^{[N]}\} \cup \{\sigma^{[1]}, ..., \sigma^{[N]}\}$. %During inference, we wish to optimize the parameters $\lambda$ via Monte Carlo estimation to Eq. \ref{eq:ELBO-NA} and performing stochastic gradient descent. 

Optimizing mixture parameters $\lambda$ is not straightforward as discrete resampling from mixture weights cannot be continuously reparameterized.  
%it is difficult to reparameterize the discrete distribution over mixture weights and hence, the "reparameterization trick" does not readily extend to mixture models. 
We use \emph{implicit reparameterization gradients}~(IRG,~\citet{figurnov2018implicit}) to efficiently compute gradients of the mixture component means and covariances.  While in principle IRG could also be used to estimate gradients of mixture weights~\citep{graves2016stochastic_implicit_tech_report}, in practice this estimator has enormous variance when posterior modes are widely separated.  We instead adapt an importance-sampling gradient estimator~\citep{scibior2021differentiable_importance} for mixture weights; see supplement for details. 
Multiple samples $S$ from the variational posterior are \emph{necessary} to capture the impact of multiple posterior modes on the ELBO~\eqref{eq:ELBO-NA-beta}.
%%%%%% diveristy, correctness in samples

%\emph{Inverse Autoregressive Flows} consists of a chain of invertible transformations in the latent space, where each transformation is based on an autoregressive neural network. The work in \citep{10.5555/3157382.3157627} show that IAFs significantly improve upon diagonal Gaussian approximate posteriors. 
%Hence, choosing IAF as a potential approximation for our flexible variational inference method was natural. 
%While defining $q_\lambda(z)$ as an IAF we have, $q_{IAF}(z) = \mathcal{T}(f(z_t))$ \fromsakshi{define this?} and $\lambda$ is the set of the parameters of the auto-regressive neural network. 
%We optimize the parameters similarly to our approach for the Gaussian case.
\textbf{Hierarchical VAE Posteriors.\,}
For the hierarchical VAEs~\citep{sonderbyLadderVariationalAutoencoders2016} introduced in Sec.~\ref{sec:background}, neither Gaussians nor Gaussian-mixtures are flexible enough to capture the non-linear dependencies between latent variables at different levels of the hierarchy. % present in the original inference model. 
We therefore propose a new, more expressive (non-amortized) variational family for HVAEs that removes dependency on the observation $x$, while retaining the expressive and non-linear dependencies of the HVAE model. Our variational posterior factorizes as:
\begin{equation}
    q_{\lambda}(z) = q_\lambda(z_1) \prod_{\ell=2}^L q_\lambda(z_\ell|z_{<\ell}).
\end{equation}
As in the HVAE decoder, we let $q_\lambda(z_\ell|z_{<\ell})$ be conditionally Gaussian for all $l$. More complex conditional distributions (such as flows or mixtures) could also be used, but this conditionally Gaussian structure alone allows for expressive, multi-modal posterior approximations.

We propose a simple, generic strategy for constructing a family of non-amortized distributions given a pre-trained HVAE. Our approach applies to many recent hierarchical VAE architectures, including the ``very-deep'' HVAE~\citep{child2021very} that we use in experiments. 
\shrink{%In general for HVAEs, the conditional prior and variational posterior for layer $l$ % and in the amortized setting 
%can be specified as:
\begin{align}
    p_{\theta}(z_\ell|z_{<\ell}) = \mathcal{N}(z_\ell|\ \mu_{\theta_\ell}(z_{<\ell}),\ \sigma_{\theta_\ell}(z_{<\ell})) \\
    q_\phi(z_\ell|z_{<\ell}, x) = \mathcal{N}(z_\ell|\ \mu_{\phi_\ell}(f_{\phi_\ell}(x), g_{\phi_\ell}(z_{<\ell})),\ \sigma_{\phi_\ell}(...)) \notag%f_{\phi_l}(x), g_{\phi_l}(z_{<l}))). 
\end{align}
Where $f_{\phi_\ell}$ and $g_{\phi_\ell}$ are parameterized functions (neural networks) that extract feature representations of the observation ($x$) and the previous layers ($z_{<\ell}$) respectively. $\mu_{\phi_\ell}$, $\sigma_{\phi_\ell}$ are further functions that  generate the mean and scale parameters of the conditional Gaussian posterior for layer $\ell$.} 
To specify the non-amortized QAVI posterior, we begin with the amortized posterior defined in Eq.~\eqref{eq:HVAEnet}. Holding $g_{\phi_\ell}$, $\mu_{\phi_\ell}$, $\sigma_{\phi_\ell}$ fixed, we replace the features extracted from the observation with a new tunable parameter $\lambda_\ell$. A further set of weighting parameters $\gamma_\ell, \gamma'_\ell \in [0,1]$ interpolate these output parameters with those of the prior. Thus our hierarchical QAVI posterior for layer $\ell$ becomes:
\begin{align}
    \mu_\ell =  \gamma'_{\ell}\mu_{\phi_\ell}(\lambda_\ell, g_{\phi_\ell}(z_{<\ell})) + (1-\gamma'_{\ell})  \mu_{\theta_\ell}(z_{<\ell}), \notag \\
    \sigma_\ell =  \gamma_{\ell}\sigma_{\phi_\ell}(\lambda_\ell, g_{\phi_\ell}(z_{<\ell})) + (1-\gamma_{\ell})  \sigma_{\theta_\ell}(z_{<\ell}), \notag \\
    q_\lambda(z_\ell\mid z_{<\ell}) = \mathcal{N}(z_\ell \mid \ \mu_\ell,\ \sigma_\ell).
\end{align}
Re-using the pre-trained networks $g_{\phi_\ell}$, $\mu_{\phi_\ell}$, $\sigma_{\phi_\ell}$ allows the full amortized encoder to be used for initialization of $\lambda_\ell$ by simply setting $\lambda_\ell \leftarrow f_{\phi_\ell}(x_O, \Tilde{x}_M)$. $\Tilde{x}_M$ may be any initialization for the missing features, even Gaussian noise. 

Our approach of interpolating posterior ($\mu_{\phi_\ell}$, $\sigma_{\phi_\ell}$) and prior ($\mu_{\theta_\ell}$, $\sigma_{\theta_\ell}$) network outputs is vital when reusing $\mu_{\phi_\ell}$ and $\sigma_{\phi_\ell}$ from the original inference model. In the original, fully-observed training phase, posterior variances may become extremely small for the highly overparameterized HVAE model. But with missing data, latent variables corresponding to unobserved features should have distributions close to the prior. Expressing the variational posterior as a weighted combination of prior and posterior network outputs allows our variational family to easily produce appropriate posteriors for latent variables corresponding to both observed and missing features, without needing to re-train $\mu_{\phi_\ell}$ and $\sigma_{\phi_\ell}$.

\textbf{Hierarchical VAE Warmup.\,}
For hierarchical VAEs with a multi-scale architecture, we find that a warmup phase of optimization with a modified objective greatly accelerates posterior fitting. This idea is broadly proposed by \cite{vahdat2020NVAE_nvae}, and refined as follows:  %Our modified objective is:
$\mathcal{L}_{H}(\theta, \phi; x_O) =$ 
%
\begin{multline}
 \label{ELBO-HW}
%\mathcal{L}_{H}(\theta, \phi; x_O) = \\
%\begin{split}
E_{q_\lambda(z)} [\log p_\theta(x_O|z)] - \frac{1}{d_1} \KL(q_\lambda(z_1) \,||\, p_\theta(z_1)) \\
  - \sum_{\ell=2}^L \frac{1}{d_\ell} E_{q_\lambda(z_{<\ell})} [\KL(q_\lambda(z_\ell|z_{<\ell}) \,||\, p_\theta(z_\ell|z_{<\ell}))],
%\end{split}
\end{multline}
%
where $d_\ell$ is proportional to the size of the latent space at layer $\ell$. Intuitively, when inpainting large segments of an image, high-level structures should be determined first and details refined later. But for the unmodified ELBO, higher resolution latent layers contribute substantially more to the loss, leading to slow convergence. Fig. \ref{fig:warmup} illustrates the dramatic effect of this change during QAVI optimization.

\begin{figure}[t]
%\vspace{.3in}
\begin{subfigure}[t]{\linewidth}
   \centerline{\includegraphics[width=.95\linewidth]{supplement_plots/mean_kl_comp_cropped.pdf}}
%\vspace{.3in}
%\caption{}
 \end{subfigure}

  \caption[]
{Comparison of QAVI optimization for HVAEs without (\emph{top}) and with (\emph{bottom}) our KL-balanced warmup.
%\emph{Top row:} QAVI without rebalanced warmup phase. \emph{Bottom row:} QAVI with warmup.
%classification error on the mode class predicted by the set of samples with the true label. 
} 
\label{fig:warmup}
\end{figure}



\shrink{
\paragraph{Hierarchical VAE Posteriors.}
\shrink{
For the hierarchical VAE models \citep{sonderbyLadderVariationalAutoencoders2016} introduced in section \ref{sec:intro}, Gaussian and mixture of Gaussian posteriors are unable to capture the inter-group conditional structure of the amortized posteriors shown in equation \ref{eq:ELBO}. We}
This posterior is parameterized as $q_{\lambda}(z) = q_\lambda(z_1) \prod_{l=2}^L q_\lambda(z_l|z_{<l})$, retaining the hierarchical %conditional 
structure of the latent variables. %amortized network output, while removing the dependence on the observation ($x_O$). 
%\begin{equation}
%    q_{\lambda}(z) = q_\lambda(z_1) \prod_{l=2}^L %q_\lambda(z_l|z_{<l})
%\end{equation}
As in the original model, we let $q_\lambda(z_l|z_{<l})$ be conditionally Gaussian for all $l$. More complex conditional distributions (IAF, Mixtures, etc.) could also be used, but we find that the conditional Gaussian is able to capture a considerable uncertainty in the posterior. %alone produces good results.
In general for HVAEs, the conditional prior and variational posterior for layer $l$ % and in the amortized setting 
can be specified as:
\begin{equation}
    p_{\theta}(z_l|z_{<l}) = p(z_l| \mu_{\theta_l}(z_{<l}),\ \sigma_{\theta_l}(z_{<l}))
\end{equation}
$$
    q_\phi(z_l|z_{<l}, x) = (z_l| \mu_{\phi_l}(f_{\phi_l}(x), g_{\phi_l}(z_{<l})),\ \sigma_{\phi_l}(f_{\phi_l}(x), g_{\phi_l}(z_{<l}))).
$$

Where $f_{\phi_l}$ and $g_{\phi_l}$ are parameterized functions (neural networks) that extract feature representations of the observation ($x$) and the previous layers ($z_{<l}$) respectively. $\mu_{\phi_l}$, $\sigma_{\phi_l}$ are further functions that  generate the mean and scale parameters of the conditional Gaussian posterior for layer $l$. To specify the Gaussian QAVI distribution we hold $g_{\phi_l}$, $\mu_{\phi_l}$, $\sigma_{\phi_l}$ fixed and replace the features extracted from the observation with a new tune-able parameter $\lambda_l$. We further Thus our hierarchical QAVI posterior becomes:
\begin{equation}
    q_\lambda(z_l|z_{<l}) = q(z_l| \mu_{\phi_l}(\lambda_l, g_{\phi_l}(z_{<l})),\ \sigma_{\phi_l}(\lambda_l, g_{\phi_l}(z_{<l}))
\end{equation}
Re-using the pre-trained networks $g_{\phi_l}$, $\mu_{\phi_l}$ and $\sigma_{\phi_l}$ allows the full amortized encoder to be used to initialize $\lambda_l$ as for standard VAE, by simply setting $\lambda_l \leftarrow f_{\phi_l}(x_O, \Tilde{x}_M)$, where $\Tilde{x}_M$ is any initialization for the missing features (e.g. Gaussian noise). 
%discussion on time vs quality
We introduce a simple, generic strategy for constructing a family of non-amortized distributions of this form given a pre-trained hierarchical VAE. Our approach applies to many recent hierarchical VAE architectures, including the "very-deep" architecture of \cite{child2021very} that we use in our experiments. }
\shrink{
\subsection{Evaluation Metric}
It is important to be able to quantify the ability of the different inferred posteriors  to be able to enclose the true image sample. Hence, We  compute the marginal log-likelihood on the missing dimensions of the test data ($x_M$) by importance sampling using $S$ samples from the inferred posterior, $q_\lambda(z)$. This metric, originally proposed by \cite{pmlr-v32-rezende14}, converges to the actual marginal likelihood when $S\rightarrow\infty$ and is described as follows: 
\begin{equation}
    %IWAE(q(z)) 
    \log p_\theta(x_M) =  \log \frac{1}{S} \sum_{s=1}^S  \frac{ p_\theta(x_M,z^{(s)})}{ q(z^{(s)})}; z^{(s)} \sim q_\lambda(z)
\label{eq:IWAE}
\end{equation}
}
%\begin{equation}
    %IWAE(q(z)) 
%    \log p_\theta(x_M) =  \log \frac{1}{S} \sum_{s=1}^S  \frac{ p_\theta(x_M,z^{(s)})}{ q(x_M^{(s)})}; x_M^{(s)} \sim q_\lambda(x_M)
    
%\label{eq:IWAE-qxm}
%\end{equation}



\section{Experiments \& Results}
\label{sec:resultVAE}
\subsection{Experimental Setup}
We evaluate QAVI\footnote{Code available: {\scriptsize \url{https://github.com/SakshiAgarwal/QAVI}}} using %with hierarchical VAEs on the FFHQ-256 dataset \cite{karras2019style_ffhq}.  In this section, we empirically investigate QAVI to impute missing data from 
six tabular datasets from the UCI Machine Learning Database \citep{Dua:2019}.
For tabular data, we follow the experimental setup of \citet{pmlr-v97-mattei19a} to train our VAEs, but use a Gaussian variational posterior instead of Student's t. 

We also consider three image datasets: real-valued MNIST (\cite{lecun-mnisthandwrittendigit-2010}), Street View House Numbers (SVHN) \citep{svhn}, and FFHQ-256 \citep{karras2019style_ffhq}. We use a single-stochastic-layer VAE for MNIST and SVHN, with a WideResNet architecture \citep{Zagoruyko2016WideRN} %for the choice of VAE 
that is known to work well with images. We train them with fully-observed images from the training set (70,000 MNIST images and 73,257 SVHN images) and maximize the ELBO of Eq.~\eqref{eq:ELBO}.  For FFHQ-256, we adopt the ``very-deep'' hierarchical architecture of \cite{child2021very}, but for efficient comparison we re-trained a smaller variant of their original HVAE (5.9M vs.~115M parameters). 

%For image datasets, queries comprise of a part of an image being observed and we aim to fill the missing pixels with visually consistent imputations. Prior works (\cite{pmlr-v97-mattei19a}, \cite{Nazbal2020HandlingIH}) assume $50\%$ pixels are missing completely at random. This seems to be a simple inference task where each pixel has at-least one observed pixel in its total $8$ neighbourhood pixels with a ($(1 - \frac{1}{2^8}) = 0.996$) probability. Simple imputation methods like "Fill Zeros" %or "fill mean" 
%perform well on such tasks, however we show in Fig. \ref{fig:imputations} that they output the input missing image when contiguous patches of pixels are missing. 
While QAVI handles missing-at-random (MAR) data naturally, we setup our experiments to be consistent with most prior work on imputation with VAEs. For MNIST and SVHN we consider two missing-completely-at-random (MCAR) patterns: 1) two randomly placed patches (each of size $10$x$10$ for MNIST, $15$x$15$ for SVHN); 2) a randomly rotated mask of half of the image. We use a more challenging random mask distribution~\citep{zhao2021comodgan_comodgan} for FFHQ-256. For tabular datasets, we corrupt the test set by removing half of the features in each row uniformly at random. 

\textbf{Baselines.\,} We compare QAVI with several methods from the literature: \emph{i)} The \emph{Fill Zeros} heuristic \citep{Nazbal2020HandlingIH};
%where the unobserved features are set to zero and passed through the pre-trained VAE, 
\emph{ii)} Monte Carlo methods: \emph{pseudo-Gibbs} \citep{pmlr-v32-rezende14} %(\cite{Kingma2014}) 
and \emph{Metropolis-in-Gibbs} \citep{NEURIPS2018_0609154f};
\emph{iii)} Amortized inference methods: \emph{Re-tuned Encoder} \citep{collier2020vaes} % where an encoder is trained on a masked test set 
and \emph{Posterior Match[ing]} \citep{strauss2022posterior}.%, where another encoder is trained with random masking distribution \cite{zhao2021comodgan_comodgan} applied to the training set. 
We consider three variants of posterior matching to evaluate the importance of knowing the missing-feature pattern during training. \emph{Posterior Match (True)} trains the posterior matching encoder with exactly the same query distribution used for test evaluation. \emph{Posterior Match (Rand.)} assumes that only the fraction of missing features is known; features are removed at random with this probability. The generic \emph{Posterior Match} uses the image masking distribution of \cite{zhao2021comodgan_comodgan}, which assumes contiguous masked regions without specific knowledge of queries to be evaluated. 

%Since we see in our experiments that "Fill Zeros" and sampling methods perform poorly for even simpler VAEs, in the case of HVAEs, 
We compare also QAVI to state-of-the-art approaches to inpainting with HVAEs: \emph{Posterior Match[ing]} and \emph{VAEAC}~\citep{ivanov2018variational}. Both \emph{Posterior Match} and  \emph{VAEAC} benefitted from training with the same random mask distribution \citep{zhao2021comodgan_comodgan} used in evaluation. As our ``very-deep'' HVAE architecture already links the encoder and decoder, we do not include extra deterministic skip connections as in the original VAEAC architecture.

\begin{figure}[t]
  %\begin{subfigure}{\textwidth}
    \includegraphics[width=.99\linewidth]{plots/uci_posterior.pdf}
    %\vspace*{-8pt}
    %\label{fig:figure1}
  %\end{subfigure}%
 % \hfill
  \caption{ Kernel density visualizations of the prior and approximate posteriors for five QAVI variants, fit using $S=1000$ samples from the optimized variational distribution.  We show one test sample chosen from two UCI tabular datasets, Banknote (\emph{top}) and Concrete (\emph{bottom)},  containing two missing features (labeled on axes). We observe that the true data point (\emph{red}) is often enclosed by the different posteriors, but sometimes missed by Feat.~QAVI.  We approximate the true posterior via a highly expressive QAVI mixture of 100 Gaussians, fit by extended optimization. %, and see that other variants it fits the true data point  %fit using 1000 Monte Carlo samples, leading to a slightly  
  }
  \label{fig:posterior-uci}
\end{figure}

\begin{table*}[p]
    %\renewcommand{\arraystretch}{1.02}
    \scriptsize
    \centering
    %\refstepcounter{table}
    \caption{Test missing data log-likelihoods (LL, higher is better) and normalized root mean-square error (NRMSE, lower is better) for 6 tabular datasets from the UCI repository, estimated using $S=$10,000 samples. NRMSE per test row is the minimum across $S$ samples. QAVI variants have superior performance (highlighted in bold) for almost all data.}
    \label{table0}
    \arrayrulecolor{black}
    \begin{tabular}{lllllllllllll} 
        \hline
        & \multicolumn{2}{c}{Breast Cancer \nocite{misc_breast_cancer_wisconsin_(diagnostic)_17}} & \multicolumn{2}{c}{Red wine \nocite{misc_wine_109}} & \multicolumn{2}{c}{White wine} & \multicolumn{2}{c}{Banknote \nocite{misc_banknote_authentication_267}} & \multicolumn{2}{c}{Concrete \nocite{misc_concrete_compressive_strength_165}} & \multicolumn{2}{c}{Yeast \nocite{misc_yeast_110}}             \\ 
        \cline{2-13}
         & LL   & NRMSE & LL   & NRMSE  & LL  & NRMSE  & LL  & NRMSE & LL & NRMSE  & LL & NRMSE  \\ 
        \hline

        Mix. QAVI & \textbf{-9.16} & 0.39 & \textbf{-6.35} &  \textbf{0.26} & -7.03 & 0.35 & \textbf{-2.25} & \textbf{0.10} & -2.70 & \textbf{0.17} & +2.42 & \textbf{0.46} \\
        Flow QAVI & \textbf{-9.14} & 0.39 & -6.47 & \textbf{0.26} & -7.03 & 0.35 & \textbf{-2.24} & \textbf{0.10} & \textbf{-2.65 }& \textbf{0.17} & +2.55 & \textbf{0.46}  \\ 
        Gaus. QAVI & -9.21 & 0.39 & -6.56 & 0.30 & \textbf{-6.94} & \textbf{0.30} & -2.35 & \textbf{0.10} & -2.82 &  \textbf{0.17} & \textbf{+2.82} & \textbf{0.46} \\
        Feat. QAVI & -16.32 & 0.37 %0.097 
        & -13.20 & 0.28 & -8.80 & 0.38 & -8.27  & 0.15 & -15.14 & 0.23 %0.007 
        & -7.81 &  0.47 \\
        \hline
        Posterior Match & -15.14 & \textbf{0.33} & -14.83 & 0.42 & -9.19 & 0.39 & -4.42 & 0.14 & -15.05 & 0.40 &  -430.36 & 0.47   \\
        Re-tuned Encoder & -12.82 & 0.40 & -12.96 & 0.40 & -9.59 & 0.39 & -4.26 & 0.14 & -11.48 & 0.35 & -33.62 & 0.50 \\
        \hline
        Metropolis-in-Gibbs & -23.94 & 0.44 & -74.22 & 0.68 & -19.83 & 0.58 &  -49.51 & 0.68 & -265.62 & 0.62 &  -1.86 & 0.54 \\
        pseudo-Gibbs & -12.61 & \textbf{0.33} & -33.53 & 0.36 & -10.68 & 0.36 & -44.16 & 0.40 & -249.06 & 0.36 &  -5.11 & 0.48 \\
        Fill Zeros  & -32.56 & 0.40 & -21.97 & 0.36   & -10.09 & 0.37  & -17.46 & 0.32 & -23.33 & 0.32 & -11.27 & 0.48 \\
        %$q(x_M)$ & -15.91 & 0.16 %0.097 
        %& -31.66 & 0.44 & -8.48 & 0.10 & -21.82 &0.11 & -405 & 0.86 %0.007 
        %& -57.47 &  0.62 \\
        %\hline
        \hline 
        %Gaussian & -9.05 & 0.15 & -6.56 & 0.09 & \textbf{-6.91} & \textbf{0.09} & -2.34 & \textbf{0.01} &  -2.74 &  \textbf{0.03} & 2.81 & \textbf{0.21} \\
        %IAF & -9.20 & 0.15 & -6.56 & 0.09 & -6.95 & \textbf{0.09} & -2.34 & \textbf{0.01} & -2.77 & \textbf{0.03} & \textbf{2.82} & \textbf{0.21} \\ 
        %Mixture & \textbf{-8.88 }& 0.15 & -6.59 & \textbf{0.09} & -6.96 & 0.09 & -2.35 & \textbf{0.01} & -2.79 & 0.04 & 2.81 & \textbf{0.21} \\
        %\hline 
        %True & -8.94 & 0.1 & -7.31 & 0.03 & -7.37 & 0.08 & -3.02 & 0.01 & -4.46 & 0.01 & 2.18 & 0.19 \\
        \arrayrulecolor{black}\hline
    \end{tabular}
\end{table*}


\begin{figure*}[p]
%\vspace{.3in}
\begin{subfigure}[t]{\linewidth}
   \centerline{\includegraphics[width=0.95\linewidth]{plots/iwae_plot.pdf}}
%\vspace{.3in}
%\caption{}
 \end{subfigure}

  \caption[]
{For two image mask distributions and several inference methods (\emph{top}), we plot importance-weighted log-likelihood estimates for missing pixels and varying samples $S$.  We average over $1000$ MNIST (\emph{left}) or SVHN (\emph{right}) test images.} 
\label{fig:iwae_plot_1000}
\end{figure*}

\begin{figure*}[p]
  \centering
  \includegraphics[width=0.95\linewidth]{plots/imputations.pdf}
  \caption{Digit completion results on MNIST (\emph{left}) and SVHN (\emph{right}) images for inference queries with pixels obscured by the Rotating-Half or Random-Patches distributions. %We use the observed pixels per inference method,
We show 5 samples from each inferred posterior. %the inferred posterior over $x_M$. 
%Fill Zeros fails to infer any valid values for missing features. 
Monte Carlo %like pseudo-Gibbs and Metropolis-in-Gibbs 
and amortized inference methods propose one sometimes-valid digit completion; amortized VI is typically more accurate.  The performance of Posterior Match varies widely depending on the missing-feature distribution it is trained on. In contrast, QAVI automatically adapts to queries, and proposes multiple valid imputations that effectively capture posterior uncertainty. %that are typically closer to the true digit. Also, with expressive posteriors defined by Flow or Mixtures, QAVI effectively captures uncertainty in the digits. %Please refer to supplements for more visualizations.
} 
\label{fig:imputations}
\end{figure*}


%\vgap
\textbf{Hyperparameters.\,}
QAVI optimization for VAE models uses $S=100$ Monte Carlo samples to estimate our variational objective and gradients, %: $\mathcal{L}_N(\lambda; x)$ in Eq \ref{eq:ELBO-NA-beta} and $\mathcal{L}_M(\lambda; x)$ in Eq \ref{eq:beta-variationalFeat}. 
and optimizes variational parameters $\lambda$ for 300 steps using Adam~\citep{kingma2014method}. We refer to our posterior on missing features (Sec.~\ref{sec:variationalFeat}) as \emph{Feat. QAVI}, and our Gaussian/Flow/Mixture non-amortized variational posteriors (Sec.~\ref{sec:nonamortized}) on latent codes $z$ as \emph{Gaus./Flow/Mix. QAVI}.
For HVAE models, QAVI optimization uses $S=28$ samples to estimate the ELBO, and optimizes for 1000 total steps (including 500 warmup steps as in Fig.~\ref{fig:warmup}).
See supplement for additional details.
%More architecture and training details can be found in the supplementary material. 

 %\vgap
\textbf{Metrics.\,} For a quantitative analysis, we estimate the marginal log-likelihood of the missing features $x_M$ using the importance sampling estimator of \cite{burdaiwae}:% is shown in eq. \ref{eq:IWAE}.
\begin{multline}
    %IWAE(q(z)) 
    \log p_\theta(x_M) \geq \mathbb{E}_{z^{(s)} \sim q(z)} \left[ \log \frac{1}{S} \sum_{s=1}^S  \frac{ p_\theta(x_M,z^{(s)})}{ q(z^{(s)})} \right] \\
    \approx \log \frac{1}{S} \sum_{s=1}^S  \frac{ p_\theta(x_M,z^{(s)})}{ q(z^{(s)})}, \qquad z^{(s)} \sim q(z).
\label{eq:IWAE}
\end{multline}
The true log-likelihood of missing features is constant for all inference methods, since the generative model $p_\theta(x)$ is fixed, but better posterior approximations lead to tighter lower bounds for a fixed number of samples $S$. Fig.~\ref{fig:iwae_plot_1000} shows estimated log-likelihoods versus the number of samples.
%, over 1000 held-out test data points, by importance sampling using $S$ samples from the inferred posterior in the latent space, $q(z)$ . For instance, to sample from $q(z)$ for the case of "Feat. QAVI" means we first draw
%$S$ samples $x \sim q_\lambda(x_M)$, followed by one sample $z \sim q(z \mid x)$. 
%Refer to section 2 in the supplement for more details regarding this posterior for individual baseline methods.
%This metric, originally proposed by \cite{pmlr-v32-rezende14} is shown in eq. \ref{eq:IWAE}.
%\begin{equation}
    %IWAE(q(z)) 
%    \log p_\theta(x_M) =  \log \frac{1}{S} \sum_{s=1}^S  \frac{ p_\theta(x_M,z^{(s)})}{ q(z^{(s)})}; z^{(s)} \sim q(z)
%\label{eq:IWAE}
%\end{equation}

\begin{figure*}[t!]
%\vspace{.3in}
\begin{subfigure}[t]{0.5\linewidth}
   \centerline{\includegraphics[width=\linewidth]{plots/vd_imgs/img_0-2.pdf}}
%\vspace{.3in}
%\caption{}
 \end{subfigure}
 \begin{subfigure}[t]{0.5\linewidth}
   \centerline{\includegraphics[width=\linewidth]{plots/vd_imgs/img_1-1.pdf}}
%\vspace{.3in}
%\caption{}
 \end{subfigure}
 \begin{subfigure}[t]{0.5\linewidth}
   \centerline{\includegraphics[width=\linewidth]{plots/vd_imgs/img_2-1.pdf}}
%\vspace{.3in}
%\caption{}
 \end{subfigure}
 \begin{subfigure}[t]{0.5\linewidth}
   \centerline{\includegraphics[width=\linewidth]{plots/vd_imgs/img_3-1.pdf}}
%\vspace{.3in}
%\caption{}
 \end{subfigure}
 \begin{subfigure}[t]{0.5\linewidth}
   \centerline{\includegraphics[width=\linewidth]{plots/vd_imgs/img_4-1.pdf}}
%\vspace{.3in}
%\caption{}
 \end{subfigure}
 \begin{subfigure}[t]{0.5\linewidth}
   \centerline{\includegraphics[width=\linewidth]{plots/vd_imgs/img_5-1.pdf}}
%\vspace{.3in}
%\caption{}
 \end{subfigure}
  \caption[]
{Inpainting results on the FFHQ-256 dataset, comparing our non-amortized deep QAVI inpainting with VAEAC and Posterior Matching. We also compare QAVI results for the reduced-size models used in Table~\ref{table0} to inpaintings from the original ``very-deep" HVAE of \citet{child2021very}. We show the true and masked images, and 5 posterior samples for each method.
} 
\label{fig:hvae}
\end{figure*}

As the importance-weighted likelihood estimator becomes expensive and unreliable to evaluate for high-dimensional models like HVAEs, we use perceptual metrics to evaluate inpainting results for HVAEs on the FFHQ dataset. Table \ref{table1} reports three metrics on a test set of 1000 images: FID \citep{heusel2017gans_fid_score} as well as P-IDS and U-IDS~\citep{zhao2021comodgan_comodgan}.  We modify P-IDS and U-IDS slightly to reduce sensitivity to the test set size; see supplement for details. %We evaluate at 3 different sampling temperatures: 0.7, 0.85, and 1.0 (see \cite{child2021very}) for "Posterior Match", "VAEAC" and "Gaus. QAVI" methods.  
%We then quantify and characterise the differences in inferred latent space posteriors induced by all inference techniques using the metric in Eq. \ref{eq:IWAE} for all datasets. 

%We also apply the inference methods on a variety of VAEs, from single-stochastic layer VAEs to very deep hierarchical VAEs. 

%We compare our non-amortized inference methods to a variety of approaches that have been previously proposed. 


%The particular VAEs used in our experiments were trained separately for MNIST, SVHN and individual tabular datasets, by maximizing the evidence lower bound (ELBO) in Eq. -- . We use a pre-trained "very deep" hierarchical VAE from \cite{child2021very} as our generative model for the FFHQ-256 dataset.  %Specific implementation details are provided as supplementary material. 

\renewcommand{\arraystretch}{0.75}

% \begin{table*}
% %\renewcommand{\arraystretch}{1.02}
% \scriptsize
% \centering
% %\refstepcounter{table}
% \caption{Test missing data log-likelihoods (LL) and normalized root mean-square error (NRMSE) for 6 UCI datasets using 10,000 samples. NRMSE per test row is the minimum mean-square error calculated across these 10,000 samples. Lower is better for MSE, and higher is better for LL. We highlight the best performing inference methods in bold and, observe that QAVI outperforms across datasets.}
% \label{table0}
% \arrayrulecolor{black}
% \begin{tabular}{lllllllllllll} 
% \hline
% & \multicolumn{2}{c}{Breast Cancer} & \multicolumn{2}{c}{Red wine} & \multicolumn{2}{c}{White wine} & \multicolumn{2}{c}{Banknote} & \multicolumn{2}{c}{Concrete} & \multicolumn{2}{c}{Yeast}             \\ 
% \cline{2-13}
%  & LL   & NRMSE & LL   & NRMSE  & LL  & NRMSE  & LL  & NRMSE & LL & NRMSE  & LL & NRMSE  \\ 
% \hline

% Fill 0s  & -32.56 & 0.4 & -21.97 & 0.36   & -10.09 & 0.37  & -17.46 & 0.32 & -23.33 & 0.32 & -11.27 & 0.48 \\
% Re-tuned  & -9.78 & \textbf{0.33} & -13.60  & 0.4 & -10.90  & 0.4 & -4.23 & 0.14  & -13.89  & 0.41 & -1.11 &  0.48  \\
% Collier & -12.82 & 0.4 & -12.96 & 0.4 & -9.59 & 0.39 & -4.26 & 0.14 & -11.48 & 0.35 & -33.62 & 0.5 \\
% PM & -15.14 & \textbf{0.33} & -14.83 & 0.42 & -9.19 & 0.39 & -4.42 & 0.14 & -15.05 & 0.4 &  -430.36 & 0.47   \\
% \hline
% PG & -12.61 & \textbf{0.33} & -33.53 & 0.36 & -10.68 & 0.36 & -44.16 & 0.4 & -249.06 & 0.36 &  -5.11 & 0.48 \\
% MWG & -23.94 & 0.44 & -74.22 & 0.68 & -19.83 & 0.58 &  -49.51 & 0.68 & -265.62 & 0.62 &  -1.86 & 0.54 \\
% \hline
% %$q(x_M)$ & -15.91 & 0.16 %0.097 
% %& -31.66 & 0.44 & -8.48 & 0.10 & -21.82 &0.11 & -405 & 0.86 %0.007 
% %& -57.47 &  0.62 \\
% Feat. QAVI & -16.32 & 0.37 %0.097 
% & -13.2 & 0.28 & -8.8 & 0.38 & -8.27  & 0.15 & -15.14 & 0.23 %0.007 
% & -7.81 &  0.47 \\
% %\hline
% Gaus. QAVI & -9.21 & 0.39 & -6.56 & 0.3 & -6.94 & \textbf{0.3} & -2.35 & \textbf{0.1} & -2.82 &  \textbf{0.17} & \textbf{2.82} & \textbf{0.46} \\
% Flow QAVI & -9.14 & 0.39 & -6.47 & \textbf{0.26} & -7.03 & 0.35 & \textbf{-2.24} & \textbf{0.1} & \textbf{-2.65 }& \textbf{0.17} & 2.55 & \textbf{0.46}  \\ 
% Mix. QAVI & -9.16 & 0.39 & \textbf{-6.35} &  \textbf{0.26} & -7.03 & 0.35 & -2.25 & \textbf{0.1} & -2.70 & \textbf{0.17} & 2.42 & \textbf{0.46} \\
% \hline 
% %Gaussian & -9.05 & 0.15 & -6.56 & 0.09 & \textbf{-6.91} & \textbf{0.09} & -2.34 & \textbf{0.01} &  -2.74 &  \textbf{0.03} & 2.81 & \textbf{0.21} \\
% %IAF & -9.20 & 0.15 & -6.56 & 0.09 & -6.95 & \textbf{0.09} & -2.34 & \textbf{0.01} & -2.77 & \textbf{0.03} & \textbf{2.82} & \textbf{0.21} \\ 
% %Mixture & \textbf{-8.88 }& 0.15 & -6.59 & \textbf{0.09} & -6.96 & 0.09 & -2.35 & \textbf{0.01} & -2.79 & 0.04 & 2.81 & \textbf{0.21} \\
% %\hline 
% %True & -8.94 & 0.1 & -7.31 & 0.03 & -7.37 & 0.08 & -3.02 & 0.01 & -4.46 & 0.01 & 2.18 & 0.19 \\
% \arrayrulecolor{black}\hline
% \end{tabular}
% \end{table*}




 

\subsection{Results}
%\vgap
\textbf{QAVI improves imputation quality.\,} Fig.~\ref{fig:iwae_plot_1000} and Table~\ref{table0} compare the log-likelihood of missing features across MNIST, SVHN, and tabular datasets. We see that heuristic imputation and Monte Carlo methods perform poorly. Re-tuned Encoder and Posterior Match show relatively higher likelihoods, but do not match QAVI across any dataset or missingness pattern. Feat.~QAVI is competitive for tabular data, but for high-dimensional images it is susceptible to local optima. Fig.~\ref{fig:accuracies2} similarly shows that QAVI provides reliably strong performance for downstream tasks. % Overall, QAVI designed to optimize the latent space parameters and disregarding any amortization, have the highest log-likelihoods across the datasets. Moreover, expressive posteriors like Flow and Mixture of Gaussians, further seem to improve log-likelihoods across datasets.   

%Table \ref{table0} reports the test missing data log-likelihoods (LL) from Eq. \ref{eq:IWAE} and normalized root mean-square error (NRMSE) over 10,000 samples across the different inference methods for 6 tabular datasets. We observe that "Feat. QAVI" is competitive for low-dimensional tabular data, but for high-dimensional image data it is susceptible to local optima, and hence, alternatively, posteriors on the latent space seem to be a better choice to impute missing features in images. 
%in fact, using the pre-trained amortized inference network while

%\vgap
\textbf{QAVI can capture multi-modal posterior uncertainty.\,} We show imputations for four test examples in Fig.~\ref{fig:imputations} to highlight differences in inference methods, and to explore the uncertainty in the inferred posterior over the missing features. We see that Gaussian and Flow QAVI defined on the latent space are capable of capturing uncertainty in the missing features. With a mixture of Gaussians variational family, QAVI produces multiple visually-plausible imputations. The classification performance in Fig. \ref{fig:accuracies2} shows high relative classification accuracy for expressive posteriors despite the increased variance in samples. %For a toy visualization of the posteriors resulting with QAVI, we visualize Kernel Density estimates for 2 test examples from the UCI datasets in Fig. \ref{fig:posterior-uci} and observe that QAVI is able to model the variance in the plausible missing values. 

%\vgap
\textbf{QAVI benefits extend to HVAEs.\,} We find that QAVI also integrates well with hierarchical VAE models. The results in Table~\ref{table1} show that our QAVI approach for HVAEs produces imputations with higher perceptual scores than prior methods for leveraging HVAEs for inpainting. Figure \ref{fig:hvae} shows that samples produced by QAVI are qualitatively more visually plausible, and also capture substantial diversity in possible feature imputations. 
%Further results in the supplement show the diversity of plausible samples. 

%\vgap
\textbf{Amortized imputation is sensitive to training queries.\,} Fig. \ref{fig:iwae_plot_1000} compares the performance for Posterior Matching when trained with different masking distributions. We see that with the absence of any prior knowledge of the true masking distribution at train time, the performance of Posterior Matching can be as poor as simple heuristics like Fill Zeros. Even in the unrealistic case where the exact distribution of missingness is known at train time, posterior matching does not outperform QAVI. This sensitivity to the choice of missingness for training is significant: adaptation to new patterns of missingness require full retraining. In sensitive domains such as medicine, access to the original training set may be restricted or even impossible. QAVI is indifferent to the structure of queries, requires no retraining, and \emph{still} outperforms the ``best case'' amortized imputation.

\textbf{QAVI smoothly trades off time and performance.} Fig.~\ref{fig:timecomplexity} illustrates the tradeoff between performance and optimization time for three variants of non-amortized QAVI, for 100 images from the MNIST dataset. We see that the performance of Posterior-Matching is far lower than QAVI, \emph{and} requires substantial overhead to train the partial encoder (over 3.5 hours). Gaussian and Flow QAVI converge in about one minute. Mixture QAVI converges a bit more slowly, but ultimately reaches the best solutions of any method.  Posterior-Match amortization would have computational advantages for very-large query sets (thousands of images), but would still have inferior inference accuracy.  % We see that QAVI converges quickly and outperforms the Posterior Match baseline, even when allowing Posterior Match training to fully converge.
%we can observe the trade-off between time and performance for QAVI and the best performing amortized approach, Posterior Match for a batch of 100 MNIST images with patches missing. For a small batch of data, QAVI performs better than Posterior Match and with less time. We also observe that Mix. QAVI outperforms Gaus. QAVI when given more training time.

\begin{table}[t]
\renewcommand{\arraystretch}{1.01}
\centering
%\refstepcounter{table}
\caption{Quantitative comparison of perceptual inpainting quality on the FFHQ-256 dataset. We compare QAVI against two state-of-the-art adaptations of HVAEs to inpainting, using the same base ``very-deep'' HVAE architecture.}
\label{table1}
\arrayrulecolor{black}
\begin{tabular}{llll}
\hline
                                                  Method        & FID $\downarrow$  & P-IDS* $\uparrow$ & U-IDS* $\uparrow$ \\ \hline
QAVI                                                      & \textbf{21.21} & \textbf{6.20}                     & \textbf{24.98}                    \\
\begin{tabular}[c]{@{}l@{}}Posterior Match\end{tabular} & 23.68 & 3.36                     & 21.55                    \\
VAEAC                                                       & 26.41 & 2.31                     & 18.19         \\  
\hline
\end{tabular}
\end{table}


%, we compare the methods using FID scores %(\cite{heusel2017gans_fid_score}) computed with the true FFHQ-256 images.

%For tabular datasets, we report the estimated log-likelihoods for missing data with $10k$ samples and additionally report minimum mean squared error over $10,000$ imputed samples with the true values, averaged across all test images. 
%e present this in Table \ref{table:uci}. 




\section{Conclusion}
\label{sec:resultDeep}
We have presented a simple and a general framework that has been unexplored in prior work employing VAEs for the imputation of missing data. Previous state-of-the-art approaches make use of a restrictive inference network as an imputation strategy. We instead take an existing VAE generative model (decoder), allocate variational parameters for the latent code of each missing data point, and train the parameters stochastically to optimize the induced variational bound.  The simple structure of our bounds enables efficient and accurate approximation of the posterior distribution of missing features, given any pattern of observed features. %he derived ELBO that, when combined with the pre-trained model, is able to represent $p_\theta(x_M|x_O)$ when any subset of data $x$ is missing. 

We evaluated QAVI on a variety of VAEs, including current state-of-the-art hierarchical VAEs, and several datasets. We found that non-amortized QAVI with Gaussian, and especially Flow or Mixture, posterior approximations outperforms previous heuristic and amortized inference methods data imputation with VAEs. Importantly, we find that a Gaussian Mixture posterior is able to effectively capture the multi-modality that often arises given missing data. 

\begin{figure}[t]
%\vspace{.3in}
\begin{subfigure}[t]{\linewidth}
   \centerline{\includegraphics[width=\linewidth]{plots/bar_chart_without_bottom_row.pdf}}
%\vspace{.3in}
%\caption{}
 \end{subfigure}

  \caption[]
{Classification accuracies using 100 samples from the inferred posterior for randomized missing queries on MNIST (\emph{top}) and SVHN (\emph{bottom}). We use a trained discriminative model, with WRN-28-2 architecture~\citep{Zagoruyko2016WideRN}, to predict class labels.
%classification error on the mode class predicted by the set of samples with the true label. 
} 
\label{fig:accuracies2}
\end{figure}


  \begin{figure}[h]
%\vspace{.3in}
\begin{subfigure}[t]{\linewidth}
   \centerline{\includegraphics[width=\linewidth]{supplement_plots/pmvsQAVI.pdf}}
%\vspace{.3in}
%\caption{}
 \end{subfigure}

  \caption[]
{Importance-weighted log-likelihood (IWAE) estiamates for missing pixels versus wall-clock training time (in hours).  Likelihoods are estimated at each step of optimization for 100 MNIST images with pixels missing via Random-Patches.  We plot the mean and standard deviation across 10 runs of QAVI methods. We compare to Posterior-Match, whose amortized inference network requires over 3.5 hours to train. The dip in mixture log-likelihoods occurs at random re-initialization of mixture parameters to avoid local optima; see supplement Sec.~2.4 for details.
} 
\label{fig:timecomplexity}
\end{figure}

In this work, we do not consider \emph{missing-not-at-random} (MNAR) data~\citep{ipsen2020not}, but we conjecture that QAVI will provide a foundation for future advances in MNAR inference.  QAVI provides a simple, effective, and general approach for inference of missing data with arbitrary patterns, that is attractive when queries are unknown during training and uncertainty in missing data is high. 

\begin{acknowledgements} 
This research supported in part by NSF Robust Intelligence Award No.~IIS-1816365, and by the HPI Research
Center in Machine Learning and Data Science at UC Irvine.
\end{acknowledgements}

\clearpage

% References
\bibliography{agarwal_747}
\end{document}
