\documentclass[accepted]{uai2022}
\usepackage[table]{xcolor}

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros

% -- begin our custom commands and packages not included in uai2022.cls or its template  --
\usepackage{amsfonts,amssymb}
\usepackage{graphicx}
\usepackage{wrapfig}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{definition}{Definition}
\newcommand{\KL}{{\rm KL}}     % KL
\newcommand{\CE}{\mathcal{CE}} % Cross entropy
\renewcommand{\H}{\mathcal{H}} % Entropy
\newcommand{\MI}{\mathcal{I}}  % Mutual Information
\newcommand{\E}{\mathbb{E}}    % Expectations
\renewcommand{\L}{\mathcal{L}}   % Lagrange
\newcommand{\q}{{\rm q}}       % component
\newcommand{\p}{{\rm p}}       % true p
\newcommand{\Z}{{\rm Z}}       % normalizing constant
\newcommand{\x}{\mathbf{x}}    % the var we're inferring
\newcommand{\randomt}{\mathbf{t}}    % random direction in x-space
\newcommand{\D}{\mathcal{D}}   % condition on data
\newcommand{\m}{{\rm m}}       % mixture
\newcommand{\rd}{{\rm d}}      % derivative
\newcommand{\FIM}{\mathcal{F}} % Fisher Information Matrix
\newcommand{\thetastar}{\theta^*} % Best theta (VI solution)
\DeclareMathOperator*{\argmin}{arg\,min}
\newcommand{\qed}{\hfill\ensuremath{\blacksquare}}

% Enable tighter layout of figures and text on the same page (see https://aty.sdsu.edu/bibliog/latex/floats.html)
\renewcommand{\topfraction}{0.9}    % max fraction of floats at top
\renewcommand{\bottomfraction}{0.8} % max fraction of floats at bottom
%   Parameters for TEXT pages (not float pages):
\setcounter{topnumber}{2}
\setcounter{bottomnumber}{2}
\setcounter{totalnumber}{4}     % 2 may work better
\setcounter{dbltopnumber}{2}    % for 2-column pages
\renewcommand{\dbltopfraction}{0.9} % fit big float above 2-col. text
\renewcommand{\textfraction}{0.07}  % allow minimal text w. figs
%   Parameters for FLOAT pages (not text pages):
\renewcommand{\floatpagefraction}{0.7}  % require fuller float pages
% N.B.: floatpagefraction MUST be less than topfraction !!
\renewcommand{\dblfloatpagefraction}{0.7}   % require fuller float pages

\usepackage{xr}
\makeatletter
\newcommand*{\addFileDependency}[1]{% argument=file name and extension
  \typeout{(#1)}
  \@addtofilelist{#1}
  \IfFileExists{#1}{}{\typeout{No file #1.}}
}
\makeatother
\newcommand*{\myexternaldocument}[1]{%
    \externaldocument{#1}%
    \addFileDependency{#1.tex}%
    \addFileDependency{#1.aux}%
}
\myexternaldocument{lange_658-supp}
% -- end custom definitions --

\title{Interpolating Between Sampling and Variational Inference with Infinite Stochastic Mixtures}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
% \author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2022 paper}{Jane~J.~von~O'L\'opez}{}}
% \author[1]{Harry~Q.~Bovik}
% \author[1,2]{Further~Coauthor}
% \author[3]{Further~Coauthor}
% \author[1]{Further~Coauthor}
% \author[3]{Further~Coauthor}
% \author[3,1]{Further~Coauthor}
\author[1]{\href{mailto:lange.richard.d@gmail.com}{Richard D. Lange}{}}
\author[1]{Ari S. Benjamin}
\author[2]{Ralf M. Haefner$^*$}
\author[3]{\href{mailto:xaq@rice.edu}{Xaq Pitkow$^*$}{}}
% % Add affiliations after the authors
\affil[1]{%
    Dept. of Neurobiology\\
    University of Pennsylvania\\
    Philadelphia, Pennsylvania, USA
}
\affil[2]{%
    Dept. of Brain and Cognitive Sciences\\
    University of Rochester\\
    Rochester, New York, USA
}
\affil[3]{%
    Baylor College of Medicine\\
    Rice University\\
    Houston, Texas, USA
  }
\affil[*]{equal contribution}

\begin{document}

\maketitle

\begin{abstract}
Sampling and Variational Inference (VI) are two large families of methods for approximate inference that have complementary strengths. Sampling methods excel at approximating arbitrary probability distributions, but can be inefficient. VI methods are efficient, but may misrepresent the true distribution. Here, we develop a general framework where approximations are stochastic mixtures of simple component distributions. Both sampling and VI can be seen as special cases: in sampling, each mixture component is a delta-function and is chosen stochastically, while in standard VI a single component is chosen to minimize divergence. We derive a practical method that interpolates between sampling and VI by analytically solving an optimization problem over a mixing distribution. Intermediate inference methods then arise by varying a single parameter. Our method provably improves on sampling (reducing variance) and on VI (reducing bias+variance despite increasing variance). We demonstrate our method's bias/variance trade-off in practice on reference problems, and we compare outcomes to commonly used sampling and VI methods. This work takes a step towards a highly flexible yet simple family of inference methods that combines the complementary strengths of sampling and VI.
\end{abstract}

\section{Introduction}

\begin{figure*}[ht]
    \centering
    \includegraphics{figures/figure_concept_v2}
    \caption{Conceptual introduction on a simple 2D example -- the ``banana'' distribution.
    \textbf{a)} Sampling methods approximate the underlying $\p(\x)$ with a stochastic set of representative points, $\x \sim \p(\x)$.
    \textbf{b)} Variational Inference (VI) methods begin by selecting an approximating distribution family, $\q(\x;\theta)$, here a Gaussian with diagonal covariance plotted as an ellipse at its $1\sigma$ contour. The optimal parameters $\thetastar$ are chosen to minimize $\KL(\q(\x;\theta)||\p(\x))$.
    \textbf{c)} We propose using a stochastic mixture of component distributions, where \emph{parameters} $\theta$ are sampled from a ``mixing distribution'' $\psi(\theta)$, i.e. $\theta \sim \psi(\theta)$.
    }
    \label{fig:concept}
\end{figure*}

We are concerned with the familiar and general case of approximating a probability distribution, such as occurs in Bayesian inference when both the prior over latent variables and the likelihood function connecting them to data are known, but computing the posterior exactly is intractable. There are two largely separate families of techniques for approximating such intractable inference problems: Markov Chain Monte Carlo (MCMC) sampling, and Variational Inference (VI) \citep{Bishop2006,Murphy2012}.

Sampling-based methods, including MCMC, approximate a distribution with a finite set of representative points. MCMC methods are stochastic and sequential, generating a sequence of sample points that, given enough time, become representative of the underlying distribution increasingly well. MCMC sampling is (typically) asymptotically unbiased, at the expense of high variance, leading to long run times in practice. Similar to the approach we take here, sampling methods are studied at different scales: both in terms of their asymptotic limit (i.e. their bias at infinitely many samples) and their practical behavior for finite samples or other resource limits \citep{Korattikara2013,Angelino2016}.

Variational Inference (VI) refers to methods that produce an approximate distribution by minimizing some quantification of divergence between the approximation and the desired posterior distribution \citep{Blei2017,Zhang2019}. For the purposes of this paper, we will use VI to refer to the most common flavor of variational methods, namely minimizing the Kullback-Leibler ($\KL$) divergence between an approximate distribution from a fixed family and the desired distribution \citep{Bishop2006,Wainwright2008,Murphy2012,Blei2017}. %Whereas sampling methods can be thought of as taking place ``in the space of random variables,'' VI takes place ``in the space of parameters.'' For instance, samples of a real-valued scalar random variable are themselves real-valued scalars, but the variational problem of finding the best-fitting Gaussian is an optimization problem on the two-dimensional manifold defined by the Gaussian's mean and variance. 
The best-fitting approximate distribution is often used directly as a proxy for the true posterior in subsequent calculations, which can greatly simplify those downstream calculations if the approximate distribution is itself easy to integrate. In contrast to MCMC, VI is often used in cases where speed is more important than asymptotic bias \citep{Angelino2016,Blei2017,Zhang2019}.

Our goal is to develop an intermediate family of methods that ``interpolate'' between MCMC and VI, inspired by a simple and intuitive picture (Figure \ref{fig:concept}): we propose applying sampling methods \emph{in the space of variational parameters} such that the resulting approximation is a stochastic mixture of variational ``component'' distributions \citep{Yin2018}. This extends sampling by replacing the sampled points with extended components, %analogous to kernel density estimation but in the space of inferred rather than observed variables. It 
and it extends VI by replacing the single best-fitting variational distribution with a stochastic mixture of more localized components. This is qualitatively distinct from previous variational methods that use \emph{stochastic optimization}: rather than stochastically optimizing a single variational approximation \citep{Hoffman2013,Salimans2015}, we use stochasticity to construct a \emph{random mixture} of variational components that achieves lower asymptotic bias than any one component could. As we will show below, this framework generalizes both sampling and VI, where sampling and VI emerge as special cases of a single optimization problem.

This paper is organized as follows. In section \ref{sec:setup}, we set up the problem and our notation, and describe how both classic sampling and classic VI can be understood as special cases of stochastic mixtures. In section \ref{sec:framework}, we introduce an intuitive framework for reasoning about infinite stochastic mixtures and define an optimization problem that captures the trade-off between sampling and VI. Section \ref{sec:algorithm} introduces an approximate objective and closed-form solution and describes a simple practical algorithm. Section \ref{sec:bias_variance} gives empirical and theoretical results that show how our method interpolates the bias and variance of sampling and VI. Finally, section \ref{sec:discussion} concludes with a summary, related work, limitations, and future directions.

\section{SETUP AND NOTATION}\label{sec:setup}

Let $\p^*(\x)=\Z\p(\x)$ denote the unnormalized probability distribution of interest, with unknown normalizing constant $\Z$. For instance, in the common case of a probabilistic model with latent variables $\x$, observed data $\D$, and joint distribution $\p(\x,\D)$, we are interested in approximations to the posterior distribution $\p(\x|\D)$. This is intractable in general, but we assume that we have access to the un-normalized posterior $\p^*(\x|\D) = \frac{1}{\Z} \p(\D|\x) \p(\x)$.\footnote{To reduce clutter, $\D$ will be dropped in the remainder of the paper, and we will use only $\p(\x)$ and $\p^*(\x)$.} %We will use $\x^{(i)}$ to denote the $i$th sample of $\x$ in a sequence of samples, and $\q(\x;\theta)$ to denote a single variational component with parameters $\theta$. Samples of $\theta$ will likewise be denoted $\theta^{(i)}$.
Let $\q(\x;\theta)$ be any ``simple'' distribution that may be used used in a classic VI context (such as mean-field or Gaussian), and let $\m_T(\x)$ be a mixture containing $T$ of these simple distributions as components, defined by a set of $T$ parameters $\lbrace{\theta^{(1)}, \ldots, \theta^{(T)}\rbrace}$:
\begin{equation}\label{eqn:define_m_t}
    \m_T(\x) \equiv \frac{1}{T} \sum\limits_{t=1}^T \q(\x ; \theta^{(t)}) \, .
\end{equation}
For example, if $\q$ is a multivariate normal with mean $\mu$ and covariance $\Sigma$, then $\theta^{(t)}=\lbrace{\mu^{(t)},\Sigma^{(t)}}\rbrace$ and $\m_T(\x)$ would be a mixture of $T$ component normal distributions \citep{Gershman2012b,Zobay2014}.

We will study properties of mixing distributions, or distributions over component parameters, which we denote $\psi(\theta)$ \citep{Ranganath2016}. If the set of $\theta^{(t)}$ is drawn randomly from $\psi(\theta)$, then as $T \rightarrow \infty$, $\m_T(\x)$ approaches the idealized infinite mixture,
\begin{equation}\label{eqn:define_m}
    \m(\x) \equiv \int_\theta \q(\x;\theta) \psi(\theta) \rd \theta \, .
\end{equation}

% Consider approximating the intractable distribution of interest, $\p(\x)$, with a mixture of variational components as follows:
% \begin{equation}
% \begin{split}
% \text{such that} \quad \m(\x) \approx \p(\x)
% \end{split}
% \end{equation}
% For example, if $\x$ is real-valued and $n-$dimensional, then $\q(\x;\theta)$ could be an $n-$dimensional multivariate normal with isotropic covariance, in which case $\theta$ would live in $\mathbb{R}^{n+1}$, consisting of $\{\mu_1, \mu_2, \ldots, \mu_n, \sigma \}$. We refer to $\theta$ as the parameters of the variational component $\q(\x;\theta)$. With these components, $\m$ is a mixture of $T$ such multivariate normal distributions \citep{Gershman2012b}.

\paragraph{Sampling and VI as special cases of the mixing distribution.} Let $\thetastar=\argmin_\theta \KL(\q(\x;\theta)||\p(\x))$ be the parameters corresponding to the classic single-component variational solution. VI corresponds to the special case where the mixing distribution $\psi(\theta)$ is a Dirac delta around $\thetastar$, or  $\psi(\theta) = \delta(\theta-\thetastar)$, in which case the mixture $\m_T(\x)$ is equivalent to $\q(\x;\thetastar)$ regardless of the number of components $T$. Sampling can also be seen as a special case of $\psi(\theta)$ in which each component narrows to a Dirac delta ($\psi(\theta)$ places negligible mass on regions of $\theta$-space where components have appreciable width), and the means of the components are distributed according to $\p(\x)$. This requires that the component family $\q(\x;\theta)$ is capable of expressing a Dirac-delta at any point $\x$, such as a location-scale family. Thus, both sampling and VI can be seen as limiting cases of stochastic mixture distributions, $\m_T(\x)$, defined by a distribution over component parameters, $\psi(\theta)$. In what follows, we will show how designing the mixing distribution $\psi(\theta)$ allows us to create mixtures that trade-off the complementary strengths of sampling and VI. %Note that we will analytically study $\m(\x)$, but in practice $T$ will be finite.

\begin{figure*}[ht]
    \centering
    \includegraphics{figures/figure_mi_kl_space_v2}
    \caption{\textit{Left}: Understanding mixtures in terms of Mutual Information and Expected KL.
    \textbf{a)} The quality of any infinite mixture (in terms of $\KL(\m||\p)$) is given by its distance from the y=x line (black diagonal line).
    \textbf{b)} Two unreachable regions are shaded in gray: above the y=x line (because $\KL(\m||\p) \geq 0$), and to the left of the single-component variational solution, since VI achieves the minimum $\KL(\q||\p)$.
    \textbf{c)} When $\psi(\theta)=\delta(\theta-\thetastar)$ as in classic VI, Expected KL is at its minimum and Mutual Information is zero. %In our framework, this is achieved in the $\lambda \rightarrow \infty$ limit. 
    Increasing the expressiveness of $\q$ corresponds to moving left along the x-axis (blue arrow).
    \textbf{d)} Because sampling is unbiased, it is a mixture that lives on the $\KL(\m||\p)=0$ or $y=x$ line. If $\x$ is discrete, the coordinates of the point marked (d) are $(\H[\x], \H[\x])$, i.e. the entropy of $\p(\x)$. When $\x$ is continuous, both Mutual Information and Expected KL grow unboundedly together as the individual components narrow. %In our framework, the $\m(\x)$ mixture behaves like classic sampling as $\lambda \rightarrow 1^+$.
    \textbf{e)} Any point on the y=x line implies $\m(\x)=\p(\x)$, and this may be possible without resorting to sampling for certain combinations of $\p$ and $\q$. However, such mixtures are not guaranteed to exist for all problems, and are difficult to find due to the intractability of Mutual Information.
    \textbf{f)} We propose a family of mixture approximations, parameterized by $\lambda$, that connects VI to sampling in a natural and principled way. Points on this curve correspond to solutions to the (approximate version of the) objective in (\ref{eqn:weighted_optim}).
    \textit{Middle}: Examples in a 1D toy problem, where $\p(\x)$ is an unequal mixture of two heavy-tailed distributions (black lines), and $\q(\x;\theta)$ is a single Gaussian component with parameters $\theta=\lbrace{\mu, \log\sigma}\rbrace$ (translucent red components). \textit{Right}: Varying $\lambda$ controls the mixing distribution over $\theta$ (image). Red points correspond to the Gaussian components in the middle.
    }
    \label{fig:mi_kl_space}
\end{figure*}

\section{Conceptual Framework}\label{sec:framework}

\subsection{Decomposing $\KL(\m||\p)$ Into Mutual Information and Expected KL}

The idealized infinite mixture $\m(\x)$ is fully defined by the chosen component family $\q(\x;\theta)$ and the mixing distribution $\psi(\theta)$. Consider the variational objective with respect to the entire mixture, $\KL(\m || \p)$:
\begin{equation}\label{eqn:kl}
\begin{split}
    \KL(\m||\p) = \int_\x \m(\x) \log \frac{\m(\x)}{\p^*(\x)} \rd \x + \log \Z\, ,
\end{split}
\end{equation}
where $\Z$ is the normalizing constant of $\p^*(\x)$ and is irrelevant for constructing $\m(\x)$. Instead of (\ref{eqn:kl}), one can use the equivalent objective of maximizing the \emph{Evidence Lower BOund} or ELBO \citep{Bishop2006,Murphy2012,Blei2017}. Regardless, minimizing (\ref{eqn:kl}) or maximizing the ELBO for mixtures is intractable in general. However, as first shown by \citet{Jaakkola1998} for finite mixtures, it admits the following useful decomposition:
\begin{equation}\label{eqn:kl_mi}
\begin{split}
    \KL(\m||\p) &= \underbrace{\int_\theta \psi(\theta) \int_\x \q(\x;\theta) \log \frac{\q(\x;\theta)}{\p^*(\x)} \rd\x\,\rd\theta}_\text{(i) Expected KL} \\
    &- \underbrace{\int_\theta \psi(\theta) \int_\x \q(\x;\theta) \log \frac{\q(\x;\theta)}{\m(\x)} \rd\x\,\rd\theta}_{\text{(ii) Mutual Information } \MI[\x;\theta]}
\end{split}
\end{equation}
(dropping $\log \Z$).
The first term, (i), is the \textbf{Expected KL Divergence} for each component when the parameters are drawn from $\psi(\theta)$. This term quantifies, on average, how well the mixture components match the target distribution. In isolation, Expected KL is minimized when all components individually minimize $\KL(\q||\p)$, i.e. when $\psi(\theta) \rightarrow \delta(\theta-\thetastar)$. This tendency to concentrate $\psi(\theta)$ to the single best variational solution is balanced by the second term, (ii), which is the \textbf{Mutual Information} between $\x$ and $\theta$, which we will write $\MI[\x;\theta]$, under the joint distribution $\q(\x;\theta)\psi(\theta)$. This term should be \emph{maximized}, and, importantly, it does not depend on $\p^*(\x)$. Mutual Information is maximized when the components are as diverse as possible, which encourages the components to become narrow and to spread out over diverse regions of $\x$ \emph{regardless} of how well they agree with $\p(\x)$.
This decomposition of $\KL(\m||\p)$ into Mutual Information (between $\x$ and $\theta$) and Expected KL (between $\q$ and $\p$) is convenient because approximations to Mutual Information are well-studied, and minimizing Expected KL can leverage standard tools from VI.

\subsection{Trading Off Between Mututal Information and Expected KL}

We will refer back to this decomposition of the $\KL(\m||\p)$ objective into Expected KL and Mutual Information throughout. Figure \ref{fig:mi_kl_space} depicts a two-dimensional space with Expected KL on the x-axis and Mutual Information on the y-axis. Any given mixing distribution $\psi(\theta)$ can be placed as a point in this space, but in general many $\psi(\theta)$'s may map to the same point.

Sampling and VI live at extreme points in this space. Classic VI, where $\psi(\theta)=\delta(\theta-\thetastar)$, corresponds to the blue point (c), because by definition $\thetastar$ achieves the minimum possible $\KL$, and $\MI[\x;\theta]$ is zero. Classic sampling corresponds to the green point (d), with $\psi(\theta)$ placing mass only on Dirac-delta-like components, and selecting each component with probability $\p(\mu)$, where $\mu$ is the mean of $\q$ determined by $\theta$.

Towards the goal of constructing mixtures that trade-off properties of sampling and VI, we propose to view the two terms in (\ref{eqn:kl_mi}) as separate objectives that may be differently weighted, and maximizing the objective
\begin{equation}\label{eqn:weighted_optim}
    \L(\psi,\lambda) = \MI[\x;\theta] - \lambda \E_\psi\left[\KL(\q||\p)\right]
\end{equation}
for a given hyperparameter $\lambda$ with respect to the mixing distribution $\psi(\theta)$. %\footnote{Multiplying Expected KL by $\lambda$ or Mutual Information by $\lambda^{-1}$ lead to the same results.} 
This objective may alternatively be viewed as the Lagrangian of a constrained optimization problem over the mixing density $\psi(\theta)$, where Mutual Information is maximized subject to a constraint on Expected KL. This is a concave maximization problem with linear constraints, defining a Pareto front of solutions that each achieve a different balance between Expected KL and Mutual Information. 
Maximizing Mutual Information necessitates approximations \citep{Poole2019}, so there may be good mixture approximations that are not found in practice, such as the yellow point (e) in Figure \ref{fig:mi_kl_space}. In section \ref{sec:algorithm} below, we use an approximation to Mutual Information that has the property, illustrated by the orange curve (f) in Figure \ref{fig:mi_kl_space}, of connecting VI (c) to sampling (d), controlled by varying $\lambda$. As shown on the right of Figure \ref{fig:mi_kl_space}, our method produces mixtures that behave like classic samples when $\lambda=1$, that behave like classic VI when $\lambda\rightarrow\infty$, and that exhibit intermediate behavior at intermediate values of $\lambda$.

We emphasize that this frame is quite general: any stochastic mixture can be reasoned about in terms of its Expected KL and Mutual Information, and this is a natural space in which to think about interpolating sampling and VI. A similar decomposition of $\KL(\m||\p)$ (or the ELBO) has been used by previous methods that optimize mixtures \citep{Zobay2014,Jaakkola1998,Gershman2012b,Yin2018}. The primary difference between these previous methods is how they approximate (or lower-bound) Mutual Information. In the next section, we introduce a new approximation that is particularly efficient, and is the first to our knowledge that replicates sampling-like behavior with finitely many components.

\section{Approximate objective}\label{sec:algorithm}

Maximizing Mutual Information, as is required by (\ref{eqn:weighted_optim}), is a notoriously difficult problem that arises in many domains, and there is a large collection of approximations and bounds in the literature \citep{Jaakkola1998,Brunel1998,Gershman2012b,Wei2016,Kolchinsky2017,Poole2019}. Previous work has optimized \emph{finite} mixtures by considering how each of $T$ components interacts with the other $T-1$ components, resulting in quadratic scaling with $T$ \citep{Gershman2012b,Zobay2014,Guo2016,Miller2017,Kolchinsky2017,Nalisnick2017,Yin2018,Poole2019}. Beginning instead with \emph{infinite} mixtures, we find that the local geometry of $\theta$-space is sufficient to provide an approximation to Mutual Information \emph{that can be evaluated independently for each value of $\theta$}, leading to linear scaling with $T$.

\subsection{Stam's inequality}

Mutual Information between $\x$ and $\theta$ can be written as
\begin{align}
\MI[\x;\theta] &= \H[\theta] - \E_{\m(\x)}\left[\H[\hat{\theta}|\x]\right] \nonumber \\
    &=\H[\theta] - \E_{\psi(\theta)}\big[\underbrace{\E_{\q(\x|\theta)}[\H[\hat\theta|\x]]}_{\H[\hat\theta|\theta]}\big]\label{eqn:mi_theta_hat}
\end{align}
where $\H[\theta]$ is the entropy of $\psi(\theta)$ and $\H[\hat\theta|\x]$ is the entropy of $\q(\hat\theta|\x) = \frac{\q(\x;\hat\theta)\psi(\hat\theta)}{\m(\x)}$, i.e. the distribution of \emph{inferred} $\theta$ values for a given $\x$. The second line follows simply from expanding the definition of $\m(\x)$ in the outer expectation. The term $\H[\hat\theta|\theta]$ can be thought of in terms of a statistical estimation problem: $\hat\theta$ is the ``recovered'' value of $\theta$ after passing through the ``channel'' $\x$. Bounding the error of such estimators is a well-studied problem in statistics.

From (\ref{eqn:mi_theta_hat}), a lower-bound on Mutual Information can be derived from an \emph{upper bound} on $\H[\hat\theta|\theta]$ for each $\theta$. % -- in other words, by upper-bounding the entropy of an estimator of the parameter $\theta$. 
For this, we draw inspiration from Stam's inequality \citep{Stam1959,Dembo1991,Wei2016}, which states
\begin{equation}\label{eqn:stams}
    \H[\hat\theta|\theta] \leq \frac{1}{2}\log\left|2\pi e \FIM(\theta)^{-1}\right| \, ,
\end{equation}
where $| \cdot |$ is a determinant, and $\FIM(\theta)$ is the Fisher Information Matrix, defined as
\begin{align*}
    \FIM(\theta)_{ij} = -\E_{\q(\x;\theta)}\left[\frac{\partial^2}{\partial \theta_i\partial\theta_j}\log\q(\x;\theta)\right]\, .
\end{align*}
The Fisher Information Matrix is also the local metric on the \emph{statistical manifold} with coordinates $\theta$ \citep{Amari2016}; it is used to quantify how ``distinguishable'' $\theta$ is from $\theta+d\theta$. Note that (\ref{eqn:stams}) can be viewed as the entropy of a Gaussian approximation to $\q(\hat\theta|\x)$ with precision matrix $\FIM(\theta)$; this approximation is most accurate when $\q(\x;\theta)$ itself is narrow and approximately Gaussian \citep{Wei2016}. 

Combining (\ref{eqn:mi_theta_hat}) and (\ref{eqn:stams}), we propose to use
\begin{equation}\label{eqn:mi_stams}
    \MI_\FIM[\x;\theta] \equiv \H[\theta] + \frac{1}{2}\E_{\psi(\theta)}\left[\log\left|2\pi e \FIM(\theta)\right|\right]
\end{equation}
as a proxy for the intractable $\MI[\x;\theta]$ in (\ref{eqn:weighted_optim}), having used $\log|\FIM^{-1}|=-\log|\FIM|$. %\footnote{It is notable that (\ref{eqn:mi_stams}) is equivalent to of $\KL(\psi(\theta)||J(\theta))$ where $J(\theta)$ is the Jeffreys prior.}

Note that $\MI_\FIM[\x;\theta]$ has not been proven to be a strict \emph{bound} on $\MI[\x;\theta]$, but may be seen as an \emph{approximation} to it \citep{Wei2016}. Briefly, this is because the original Stam's inequality, as stated in (\ref{eqn:stams}), assumes $\theta$ is a scalar location parameter, and assumes the high-precision limit where $\q(\hat\theta|\x)$ is well-approximated by a Gaussian. Despite this, $\MI_\FIM[\x;\theta]$ is well-suited for our purposes, since (i) it leads to a remarkably simple and easy to implement expression for $\psi(\theta)$ below; (ii) we can prove that it leads to sampling when $\lambda=1$ and VI when $\lambda\rightarrow\infty$; and (iii) we suspect that the inequality in (\ref{eqn:stams}) is nonetheless strict, since we neglect the prior information contained in $\psi(\theta)$ and therefore over-estimate the conditional entropy $\H[\hat\theta|\theta]$. By analogy to the Bayesian Cram{\' e}r-Rao bound \citep{Gill1995,Fauss2021}, a tighter variant of (\ref{eqn:stams}) could be derived that takes into account the prior, though possibly at the expense of added complexity; we leave this to future work.

\subsection{Closed-form mixing distribution}

Substituting $\MI_\FIM[\x;\theta]$ for $\MI[\x;\theta]$ in (\ref{eqn:weighted_optim}) gives the following approximate objective,
\begin{equation}\label{eqn:weighted_optim_f}
    \L_\FIM(\psi,\lambda) = \H[\theta] + \E_\psi\left[\frac{1}{2}\log|\FIM| - \lambda\, \KL(\q||\p^*)\right]
\end{equation}
having dropped additive constants.
This now resembles a maximum-entropy problem with an expected-value constraint, which has the following simple closed-form solution:
\begin{equation}\label{eqn:log_psi_stams}
    \log\psi(\theta) = \frac{1}{2}\log|\FIM(\theta)| - \lambda\, \KL(\q(\x;\theta)||\p^*(\x))
\end{equation}
again dropping additive constants. Equation (\ref{eqn:log_psi_stams}) is strikingly simple, and amenable to many existing MCMC sampling methods for drawing samples of $\theta$ from $\psi(\theta)$. %In particular, there are highly efficient samplers such as Hamiltonian Monte Carlo \citep{Neal2010,Hoffman2014} that require only the unnormalized $\log\psi(\theta)$ and its gradient. These terms are all readily computed: $\log|\FIM(\theta)|$ is known in closed-form for certain families of $\q$ such as Multivariate Gaussians, and $\KL(\q||\p)$ is simply the objective from classic VI for which numerous estimators exist.\footnote{Some common estimators of $\KL(\q||\p)$ are stochastic, e.g. by using Monte Carlo samples from $\q$ \citep{Kucukelbir2017} or batches of data \citep{Hoffman2013}, in which case sampling parameters $\theta \sim \psi(\theta)$ can be done using stochastic gradient methods from the MCMC sampling literature \citep{Ma2015}.}

Despite being derived from an approximation to our original objective, (\ref{eqn:log_psi_stams}) nonetheless contains both sampling and VI as special cases. As $\lambda\rightarrow\infty$, the $\KL$ term dominates and $\psi(\theta)$ concentrates to $\delta(\theta-\thetastar)$, reproducing VI. When $\lambda=1$, the resulting mixture recovers the behavior of ``sampling'' in the following sense:
\begin{definition}[Sampling]\label{def:sampling}
A stochastic mixture, defined by the component family $\q(\x;\theta)$ and mixing distribution $\psi(\theta)$, is considered to be ``sampling'' if it is \textbf{unbiased} and it consists of \textbf{non-overlapping components}. 
An \textbf{unbiased} mixture is one where $\m(\x) = \p(\x)$. 
A mixture consists of $T$ \textbf{non-overlapping components} if $\sum_{t=1}^T\q(\x;\theta_t) \approx \max_t \q(\x;\theta_t)$ with high probability.
\end{definition}
Lemma \ref{lem:sampling_approximate_solution} in Appendix \ref{app:sampling} establishes that $\psi(\theta)$ with $\lambda=1$ leads to sampling as defined here, assuming mixture components $\q$ are Gaussian. However, we conjecture that sampling arises from a broader class of $\q$ components as well, though computing and differentiating through $\FIM(\theta)$ for non-Gaussian component families poses additional challenges.

\subsection{Implementation}

Equation (\ref{eqn:log_psi_stams}) provides a closed-form unnormalized log probability density, which is straightforward to sample from using any of a large number of existing sampling methods. For example, discrete Langevin dynamics are
\begin{align*}
    % \theta^{(t+1)} = \underbrace{\theta^{(t)} - \gamma\lambda \nabla_\theta \KL(\q(\x;\theta)||\p^*(\x))}_{(i)} + \underbrace{\frac{\gamma}{2}\nabla_\theta \log|\FIM(\theta)|}_{(ii)} + \underbrace{\sqrt{2\gamma}\eta_t}_{(iii)}
    \theta^{(t+1)} = \underbrace{\theta^{(t)} - \gamma\lambda \nabla_\theta \KL(\q||\p)}_{(i)} + \underbrace{\frac{\gamma}{2}\nabla_\theta \log|\FIM|}_{(ii)} + \underbrace{\sqrt{2\gamma}\eta_t}_{(iii)}
\end{align*}
where $\gamma$ is the step size and $\eta_t$ is unit isotropic Gaussian noise. This update rule is remarkably simple: $(i)$ is equivalent to gradient descent of $\KL(\q||\p)$, as done in ADVI \citep{Kucukelbir2017}, $(ii)$ biases the updates towards regions where $|\FIM(\theta)|$ is large (i.e. narrower components), and $(iii)$ adds noise.

For our experiments below, we implemented sampling from (\ref{eqn:log_psi_stams}) in Stan \citep{Carpenter2017}, an open-source framework for probabilistic models and approximate inference algorithms. We set $\q$ to be a multivariate Gaussian with diagonal (axis-aligned) covariance, and sampled $\theta$ from $\psi(\theta)$ using Stan's default implementation of the No U-Turn Sampler (NUTS)  \citep{Hoffman2014}, but we emphasize that samples can be drawn from (\ref{eqn:log_psi_stams}) using a variety of off-the-shelf sampling methods. We computed the $\KL(\q||\p)$ term using 200 random samples from $\q$ per evaluation, using the reparameterization trick to compute the gradient $\nabla_\theta\KL(\q||\p)$ and resampling the reparameterization noise only once per NUTS trajectory. This incurred a high cost in terms of number of function evaluations per sample of $\theta$, but this cost can in principle be significantly reduced by using a sampler that accepts stochastic function evaluations \citep{Korattikara2013,Ma2015}. All comparisons to existing methods were with Stan's built-in NUTS sampler (over $\x$) and its built-in Automatic-Differentiation VI (ADVI) \citep{Kucukelbir2017}.
%Here, we demonstrate how varying $\lambda$ interpolates sampling and VI in a suite of standard reference problems ranging from estimating the weight of a coin to Gaussian Process regression. For each problem, we select an arbitrary $f$ to serve as an example function, and report the bias and variance of estimating $\E[f(\x)]$.

\section{Navigating Bias/Variance Trade-Offs For Finite $T$}\label{sec:bias_variance}

\begin{figure*}[ht]
    \centering
    \includegraphics{figures/figure_bias_variance_v2}
    \caption{$\lambda$ controls a bias/variance tradeoff, interpolating between sampling and VI. 
    \textbf{a-c)} Behavior of our method on the ``banana'' distribution for low, medium, and high values of $\lambda$. %For an example 2D distribution (the ``banana'' distribution), we set $\q$ to Gaussian with diagonal covariance and sampled $\theta\sim\psi(\theta)$ using NUTS (see Appendix B.2 for sampling details).
    \textbf{b)} Example $f(\x)$ with $\alpha=-1.5$, constructed using a random mixture of sinusoids of different frequencies and directions. Green points are values of $\x$ sampled using NUTS, shown for reference.
    \textbf{c)} The expected value of $f(\x)$ from (b) using our method, compared with NUTS and ADVI. Error bars for NUTS and ours indicate standard deviation across runs with $T=30$ samples each. At low $\lambda$, our method provides an unbiased but high variance estimate of $\E[f(\x)]$, matching NUTS, while at high $\lambda$ it provides a bias near that of ADVI and a vanishing variance. Variance of ADVI is across 10 runs with random initializations.
    \textbf{d-f)} We repeated the analysis in (c) across many random $f$s (all $\alpha=-1.5$) and report the mean $\pm$ standard error of bias$^2$, variance, and MSE. MSE for our method is minimal around $\lambda \approx 1.3$ (for $T=30,\alpha=-1.5$). %As $T$ increases further, variances shrinks and this optimal lambda moves closer to sampling-like behavior or $\lambda \rightarrow 1$.
    }
    \label{fig:bias_variance}
\end{figure*}

\subsection{Reducing Mean Squared Error (MSE)}

In this section, we expound the sense in which our method ``interpolates'' sampling and VI in terms of bias and variance. In our experiments, we quantify bias and variance in terms of the Mean Squared Error (MSE) of the expectation of an arbitrary $f(\x)$ using a random mixture of $T$ components, $\m_T(\x)$. 
% That is, for each $f$,
% \begin{align*}
%     \text{bias}^2 &= \left(\E_\p[f(\x)] - \E_\m[f(\x)]\right)^2 \\
%     \text{variance} &= \E_{\theta_1,\ldots,\theta_T}\left[\left(\E_{\m_T}[f(\x)] - \E_\m[f(\x)]\right)^2\right] \, .
% \end{align*}
For ADVI, we measured variance across runs with different random initializations. In Figure \ref{fig:bias_variance}, we show empirically that by increasing $\lambda$ one can interpolate between the low bias but high variance solution, equivalent to sampling, and the low variance but high bias solution, equivalent to VI. Between these extremes, our method smoothly interpolates both bias and variance. Further implementation details can be found in Appendix \ref{app:numerical_details}.

Computing bias and variance requires choosing a class of functions $f(\x)$. We construct random smooth functions by discrete Fourier synthesis. Specifically, we select a series of sinusoid plane waves in the space of $\x$ with increasing frequency $\omega$, random directions $\randomt$ and phase $\phi$, such that $f(\x) = \sum_{\omega=1}^N a_\omega \sin(\omega \randomt^\top\x + \phi_\omega)$.
The amplitudes $a_\omega$ are set according to a power law: $a_\omega=\omega^{-\alpha}$. Note that by choosing a random direction $\randomt$ for each frequency, it is easy to apply this definition of a random function to arbitrarily high-dimensional inference problems. An example of a 2-dimensional $f(\x)$ is shown in Figure \ref{fig:bias_variance}b with $\alpha=-1.5$. We vary the smoothness of the integrated function in Figure \ref{fig:wiggliness} by varying $\alpha$ \citep{stein2011fourier}. 

We also tested our algorithm on three reference problems from posteriordb \citep{posteriordb}, as well as on a 32-dimensional regression problem with synthetic data and known ground-truth parameters, shown in Supplemental Figure \ref{fig:posteriordb}. The conclusion is similar: across many random $f$s, our algorithm performs on average as well as or better than both sampling (by reducing variance) and VI (by reducing bias).

\subsection{Considerations for selecting $\lambda$}

A first practical consideration for the choice of $\lambda$ is the particular function $f(\x)$ to be integrated. Since MSE can be decomposed into the sum of squared bias and variance, the value of $\lambda$ that minimizes MSE occurs when $\frac{\partial\textrm{bias}^2}{\partial\lambda}=-\frac{\partial\textrm{var}}{\partial\lambda}$. Any factor that increases the variance but not the bias for a fixed number of components $T$ will push the optimal $\lambda$ towards higher values (closer to VI).

\begin{figure*}[ht]
    \centering
    \includegraphics{figures/wiggliness_v3}
    \caption{Interactions between $\lambda$ and integrand wiggliness for the ``banana'' distribution with $T=30$.
    \textbf{a)} Example random integrands, $f(\x)$, of varying degrees of wiggliness, as in Figure \ref{fig:bias_variance}b.
    \textbf{b)} Wiggliness of $f$ is governed by $\alpha$, the slope of amplitude versus frequency of its component sinusoids in a log-log plot.
    \textbf{c)} The $\lambda$ with the smallest MSE for a fixed number samples depends on the integrand’s smoothness.
    \textbf{d)} Bias vanishes near $\lambda=1$.
    \textbf{e)} Variance is higher for smaller $\lambda$ and more wiggly integrands. 
    \textbf{f)} MSE is the sum of bias$^2$ and variance. Overlaid lines correspond to slices shown in panel (c).
    }
    \label{fig:wiggliness}
\end{figure*}

One such factor is the smoothness of $f(\x)$. Classic sampling can have problematically high variance when $f(\x)$ is very jagged, as single points are not very representative of the surrounding function. Intuitively, then, higher $\lambda$ (more VI-like mixtures) is preferred when $f(\x)$ is more ``wiggly.'' To show this, we generated a random function with varying smoothness and computed their expectations them over random mixtures. The resulting MSE, bias, and variance are shown in Figure \ref{fig:wiggliness}. We adjusted smoothness by varying the power law decay, $\alpha$, for a fixed set of phases and wave directions. At any value of $\lambda$, variance can be seen to increase as $f$ is made more wiggly ($\alpha \rightarrow -1$). With all else held equal, it is better to trade some variance for bias when the integrand changes quickly with $\x$. 

In Figures \ref{fig:bias_variance} and \ref{fig:wiggliness}, we evaluated bias and variance on the ``banana'' distribution. Similar results on higher-dimensional problems can be found in Supplemental Figure \ref{fig:posteriordb}. For illustration purposes, we chose $T$ so that the variance of NUTS was the same order of magnitude as the bias of ADVI, and estimated bias and variance by randomly subsampling sets of size $T$ from much longer chains. This approach provides theoretical insights on bias/variance trade-offs for $T$ \emph{independent} samples, but in practice bias will be higher due to burn-in time, variance will be higher due to sampler autocorrelations, and each of these may depend nontrivially on $\lambda$. Because MCMC samplers are most effective when they have been tuned to the problem at hand, a challenge for future work will be to adapt the sampler parameters on the fly as $\lambda$ changes.

Another factor that affects the optimal $\lambda$ is the computational budget. In our experiments we set a fixed $T$ to demonstrate our algorithm's properties. However, if the time budget is not known in advance, a practitioner may wish to decrease $\lambda$ adaptively over time. Since variance is $\mathcal{O}(T^{-1})$, for sufficiently large $T$ error is dominated by bias, and so the optimal $\lambda$ will decay towards $1$. This would result in VI-like behavior for small $T$ and sampling-like behavior for large $T$. How quickly $\lambda$ should decay will depend on the particular problem, specifically, on $\frac{\partial\textrm{Bias}^2}{\partial\lambda}$, and on how efficiently one can produce $T$ \emph{independent} samples of $\theta$.

\subsection{Analytical results}

While the MSE of the expected value of some $f(\x)$ is a useful way to compare approximate inference methods, it depends on the somewhat arbitrary choice of $f$, and in practice, the $f$'s of interest are often not known at the time of inference. This motivates using the following alternative definition of error that is independent of $f$ and closely related to the variational objective of minimizing $\KL$ divergence:
\begin{equation}\label{eqn:kl_error}
\begin{split}
    \text{KL error} = \E[\KL(\m_T(\x)||\p(\x))] = \\
    \underbrace{\KL(\m(\x)||\p(\x))}_\text{KL bias} + \underbrace{\E\left[\KL(\m_T(\x)||\m(\x))\right]}_\text{KL variance} \, .
\end{split}
\end{equation}
That is, we define \textbf{KL bias} as the $\KL$ divergence from the infinite mixture $\m(\x)$ to the true distribution, and \textbf{KL variance} as the average $\KL$, over realizations of $T$ independent mixture components, from $\m_T(\x)$ to the infinite mixture $\m(\x)$. Note that KL bias is identical to the infinite-mixture objective we started with in (\ref{eqn:kl_mi}).

The following theorem establishes that, for all finite $T$, we can reduce the KL error relative to sampling using some $\lambda > 1$.
\begin{theorem}[Improve on sampling]\label{thm:improve_sampling}
If a mixture is sampling as in Definition \ref{def:sampling}, then $\frac{\rd}{\rd\lambda}\text{KL bias}=0$ and $\frac{\rd}{\rd\lambda}\text{KL variance} < 0$. Thus, $\frac{\rd}{\rd\lambda}\text{KL error}<0$.
Proof: see Appendix \ref{app:sampling}.
\end{theorem}
This theorem establishes the intuitive result that the variance of sampling can be reduced, minimally impacting its bias, by replacing samples with narrow mixture components. %Intuitively, if samples are given some ``width'' on a scale where $\log\p(\x)$ is still locally linear, then no bias is introduced, but variance is reduced. 
Importantly, Theorem \ref{thm:improve_sampling} is based on how $\psi(\theta)$ changes with $\lambda$ when using the closed-form expression for $\psi(\theta)$ we derived based on the approximate $\L_\FIM$ objective. For this theorem to apply, we must further show that both conditions of ``sampling'' (Definition \ref{def:sampling}) are met by $\psi(\theta)$ when $\lambda=1$. This is proved in Lemma \ref{lem:sampling_approximate_solution} in Appendix \ref{app:sampling} for Gaussian components, though we suspect it holds for other component families as well.

We can also improve on VI using our method. However, this result is slightly more subtle, as there are three cases where one should expect VI to be (locally) optimal. First, if $\q$ is in the same family as $\p$, then $\q(\x;\thetastar)=\p(\x)$, then is no benefit to increasing $T$, and reducing $\lambda$ only adds variance. Similarly, if $\q$ is not in the same family but the VI solution is sufficiently close to $\p$, then a mixture of nearby $\q$s will add variance to $\m$ \citep{Lindsay1983a}, potentially making the match to $\p$ worse. Third, if $T$ is small -- in the most extreme case, if $T=1$ -- then reducing $\lambda$ will again only add variance without reducing bias. With these three cases in mind, the following theorem establishes conditions where we expect to reduce KL error relative to VI by using a large but finite $\lambda < \infty$.
\begin{theorem}[Improve on VI]\label{thm:improve_vi}
Assume that $\q(\x;\thetastar)$ is poorly matched to $\p(\x)$, in the sense that $\text{Tr}\left((\nabla^2_\theta\KL(\q||\p))^{-1} \FIM \right) > |\theta|$, and that $\lambda$ is sufficiently large to use a Laplace approximation to $\psi(\theta)$ around $\thetastar$. Then, there exists some finite $T_0>1$ such that for all $T\geq T_0$, $\frac{\rd}{\rd\lambda}\text{KL error}>0$.
Proof: see Appendix \ref{app:vi}.
\end{theorem}
Here, $|\theta|$ is the dimensionality of $\theta$. The requirement that ``$\q(\x;\thetastar)$ is poorly matched to $\p(\x)$'' is expressed in terms of the curvature of $\KL(\q||\p)$ around $\thetastar$; if this curvature is small, then many ``nearby'' $\q$s will also fit $\p$ well, and a mixture of them can improve on VI despite adding variance. On the other hand, the case where this curvature is not small corresponds to the earlier intuition that VI cannot be improved upon if the VI solution is  already close to $\p$. For further details, see the full proof in Appendix \ref{app:vi}.

\section{DISCUSSION}\label{sec:discussion}

\paragraph{Summary:} Our work provides a new perspective on the relationship between the two dominant frameworks for approximate inference -- sampling and VI -- by viewing both as special cases of inference using a broader class of stochastic mixtures. Our main theoretical contribution is the framework shown in Figure \ref{fig:mi_kl_space}, where mixtures that ``interpolate'' sampling and VI are analyzed in terms of how they trade off Mutual Information and Expected KL. We then derived an easy-to-use method based on an approximation to Mutual Information that uses the local geometry of the space of variational parameters. To demonstrate the ease and effectiveness of our method, we implemented it in the popular Stan language and demonstrated using a small set of reference problems how we ``interpolate'' sampling and VI by varying a single parameter, $\lambda$. Finally, we showed why such an intermediate inference scheme is useful in practice in terms of trading off bias and variance. On one hand, we proved that it is always possible to improve on classic sampling ($\lambda=1$) by increasing $\lambda$: our method provably reduces the variance of sampling while minimally impacting its bias. Our method also provably reduces the bias of VI under certain intuitive conditions.

\paragraph{Time and space complexity:}
% There are rapidly diminishing returns to increasing $T$, since Mutual Information is upper-bounded by $\MI[\x;\theta] \leq \log T$, with equality only when all components are mutually non-overlapping \citep{Jaakkola1998}. 
By approximating Mutual Information using only \emph{local} geometric information in (\ref{eqn:mi_stams}), in our method each component can be selected independently of the others. This means we can select and evaluate $T$ components in $\mathcal{O}(T)$ time and either $\mathcal{O}(T)$ space (if all are stored) or $\mathcal{O}(1)$ space (if components are evaluated online) -- identical to traditional MCMC sampling algorithms. Further, we can run independent chains sampling $\theta \sim \psi(\theta)$ for a constant factor speedup. This improves on past work using mixture approximations, which incurred $\mathcal{O}(T^2)$ time and $\mathcal{O}(T)$ space complexity, since the optimization problem for the $T$th component depends on the location of the other $T-1$ components, all of which must be in memory at once \citep{Jaakkola1998,Gershman2012b,Zobay2014,Guo2016,Nalisnick2017,Miller2017,Acerbi2018,Yin2018} (but the $\mathcal{O}(T^2)$ complexity may be hardware-accelerated). 

\paragraph{Related Work:} The trade-offs between sampling and VI are well-studied, and many methods have been proposed to ``close the gap'' between them (see \citep{Angelino2016,Zhang2019} for general reviews). Like these other methods, we aim to provide good approximations with high computational efficiency and low variance.

There are many methods that use mixture models to reduce the bias of variational inference.
Theorem \ref{thm:improve_vi} shows that our method only ``beats'' classic VI when $T>T_0$ for some finite but potentially large $T_0$. This is the price we pay for drawing mixture components stochastically \citep{Salimans2015,Yin2018}. When a mixture of $T$ components is \emph{optimized} rather than \emph{sampled}, bias is reduced and variance remains near zero, as in previous work \citep{Jaakkola1998,Gershman2012b,Zobay2014,Guo2016,Miller2017}, but in previous work this optimization has incurred a $\mathcal{O}(T^2)$ cost while our method is $\mathcal{O}(T)$ and can be further parallelized. Further, with some notable exceptions \citep{Anaya-Izquierdo2007,Salimans2015}, most mixture VI methods make strong assumptions about the family of components \citep{Jaakkola1998,Gershman2012b,Acerbi2018,Miller2017}. Our framework and method is somewhat agnostic to the family of $\q$, though we have only rigorously proved that is asymptotically unbiased when using Gaussian components.

Many methods use sampling in the service of variational inference, or vice versa, but do not provide a unifying approach to both. These typically use the samples to compute expectations used to update a variational approximation \citep{Acerbi2018,Miller2017,Kucukelbir2017}, rather than to generate the mixture components themselves. 

There is also a large number of sampling approaches that aim to improve the efficiency of sampling by reducing its variance at the cost of some bias. Some of these use variational approaches as proposal distributions, but ultimately the posterior is approximated by a set of (possibly weighted) samples of the latent variables \citep{deFreitas2001,Korattikara2013,Ma2015,Zhang2021}. By expanding each sample from a point to a distribution, our approach allows each sample to cover more space with less variance and greater efficiency.

Despite some high-level similarities to other approaches, our framework is unusual in approximating the posterior by a sampled mixture of variational approximations. The Mixture Kalman filter \citep{chen2000mixture} is a special case of this, which uses a sampled mixture of Gaussians, each constructed as a Kalman filter. A related approach is to \emph{optimize} a parameterized function that generates mixture components \citep{Salimans2015, wolf2016variational,Yin2018}, and generative diffusion models can also be seen as a case of such mixtures \citep{sohl2015deep, ho2020denoising}. Our work differs in that we derived a closed-form mixing distribution that requires no additional learning or optimization and that is readily implemented in existing inference software (Stan, \citep{Carpenter2017}).

The trade-offs described in this section are summarized in Table \ref{tbl:prior_work} in the Appendix.

\paragraph{Limitations and future work:} Using $\MI_\FIM[\x;\theta]$ to approximate $\MI[\x;\theta]$ reduces the generality of our method, since the former is most appropriate for narrow and Gaussian-like components \citep{Wei2016}. Incorporating prior information from $\psi(\theta)$ into this bound, generalizing to other kinds of components, or even starting with alternative bounds on $\MI[\x;\theta]$ are all interesting avenues for future work. Our proof of Theorem \ref{thm:improve_vi} requires an assumption about the curvature of $\KL(\q||\p)$ near the VI solution; identifying cases where this assumption holds may also be an interesting future direction.

We currently only study mixtures with $T$ \emph{independent} mixture components without taking into account the cost of producing independent samples of $\theta$. In reality, this cost depends on the quality of the sampler, warm-up and burn-in time, and a potentially large number of calls to $\log\p(\x)$ \citep{Zhang2021}. Further, $\lambda$ dramatically changes the shape of $\log\psi(\theta)$, which may affect the efficiency of the sampler -- we mitigated this slightly by scaling the mass parameter of NUTS with $\lambda$. Other sampling algorithms besides NUTS may also be beneficial, such as ULA, which is known to have favorable scaling properties in higher dimensions \citep{Durmus2017}.

We have so far considered $\lambda$ to be constant for a run of our algorithm, and this can lead to asymptotic bias even when $T$ is large. A simple adjustment to make our method effective at both small and large $T$ would be to decay $\lambda$ as $T$ grows, but note that this may require adapting the sampler parameters on the fly. Our method also requires evaluating $\KL(\q||\p)$ many times per sample of $\theta$. This could be made more efficient by adapting the number of Monte Carlo evaluations (fewer samples from $\q$ are sufficient when $\lambda$ is low and components are narrow), by accounting for stochastic likelihood evaluations \citep{Ma2015}, or by extending our method to mean-field message-passing \citep{Jaakkola1998}, where $\nabla_\theta\KL(\q||\p)$ can be computed in closed form \citep{Hoffman2013}.

% \begin{contributions}
% RDL and XP independently conceived of the core stochastic mixtures idea. RDL and AB derived and prototyped the algorithm. RDL implemented the algorithm in Stan. AB ran the experiments and created the plots. RMH and XP supervised throughout. All authors wrote the paper.
% \end{contributions}

\begin{acknowledgements}
This material is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-21-1-0422, NSF CAREER IOS-1552868, and support from the McNair Foundation to XP, as well as support from the NIH R01 EY028811 to RMH. Thanks also to Daniel Lee, who acted as our guide to the Stan codebase.
\end{acknowledgements}

% \bibliographystyle{myplainnat}
\bibliography{references-mendeley-group,additional-references}

\end{document}
