% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    % \bibliographystyle{plainnat}
    % \renewcommand{\bibsection}{\subsubsection*{References}}
    
    
% \usepackage[pdftex]{graphicx}
\usepackage{subcaption}
\usepackage[inline]{enumitem}
\usepackage[linesnumbered,ruled,vlined]{algorithm2e}
\usepackage{wrapfig}
\usepackage{xcolor}
\usepackage{amssymb}
\usepackage{tabularx}
\usepackage[page]{appendix}

\usepackage{xr-hyper}
\usepackage{hyperref}

\makeatletter
\newcommand*{\addFileDependency}[1]{% argument=file name and extension
  \typeout{(#1)}
  \@addtofilelist{#1}
  \IfFileExists{#1}{}{\typeout{No file #1.}}
}
\makeatother

\newcommand*{\myexternaldocument}[1]{%
    \externaldocument{#1}%
    \addFileDependency{#1.tex}%
    \addFileDependency{#1.aux}%
}

\myexternaldocument{menon_319}

\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\begin{document}
\appendix
% \appendixpage
\onecolumn

\title{Forget-me-not! Contrastive Critics for Mitigating Posterior Collapse}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors

\author[1]{\href{mailto:<sachit.menon@columbia.edu>?Subject=Your UAI 2022 paper}{Sachit Menon}{}}
\author[1]{David Blei}
\author[1]{Carl Vondrick}
% Add affiliations after the authors
\affil[1]{%
    Computer Science Dept.\\
    Columbia University\\
    New York, New York, USA
}

\maketitle

% \section{Appendix}

 \begin{appendices}
    
    \section{Mutual Information}\label{appendix:mi}
    Recall that the mutual information across the variational joint is defined 
    
    \begin{equation}
        \begin{aligned}
        I_q(x ; z) &=D_{\mathrm{KL}}(q_\phi(x, z)|| p_\mathcal{D}(x) q(z)) \\
        &=\mathbb{E}_{q_\theta(x, z)}\left[\log \frac{q_\phi(x, z)}{p_\mathcal{D}(x) q_\phi(z)}\right]
        \end{aligned}        
    \end{equation}
    
    % Similarly, the mutual information across the model joint is defined 
    
    % \begin{equation}
    %     \begin{aligned}
    %     I_p(x ; z) &=D_{\mathrm{KL}}(p(x, z)|| p(x) p(z)) \\
    %     &=\mathbb{E}_{p(x, z)}\left[\log \frac{p(x, z)}{p(x) p(z)}\right]
    %     \end{aligned}        
    % \end{equation}
    
    If the observations and the latents are independent, the mutual information is zero; in our case, we want to encourage the model to preserve their dependence, so we want it to be higher. Increasing $I_q(x;z)$ works against posterior collapse by preventing the posterior from always matching the prior (since this would not preserve any information). %Increasing $I_p(x;z)$ has been shown to work against posterior collapse (\citep{dieng_avoiding_2019}, discussed further in Related Work) by preventing the function approximator in the likelihood from forgetting the latent. 
    We will show that the approach provides a new way to increase both of these MI measures that could be combined with existing approaches. % this mutual information has been shown to be an effective way to help avoid posterior collapse

    \section{Binary classifier and MI}\label{appendix:binclas}
    We use a categorical likelihood for multi-way classification to implement the critic. Another option that provides some intuition would be a binary classifier, that simply takes a pair and decides if they correspond or not in isolation. (See Appendix \ref{appendix:multirat} for the details of why the multisample case is preferred.) The connection between the binary classifier and MI follows directly from application of the density ratio trick \citep{sugiyama_density-ratio_2012}, which tells us a binary probabilistic classifier between two distributions estimates the density ratio between them. In our case, then, the optimal classifier would correspond to $\frac{ p(\textbf{z}, \textbf{x})}{p(\textbf{z})p(\textbf{x})}$. We highlight that in a different context, this same binary-classification density-ratio trick is what is used to power GANs: the discriminator estimates a density ratio between real and fake samples. In GANs, we do not want to be able to distinguish these distributions so we train the critic adversarially; here, we \textit{want} the critic to succeed. GANs also provide us some basis that we do not need to train the critic to optimality at every step, which would be too expensive - joint training of the critic and the model can yield the desired results \citep{goodfellow_generative_2014}. See Related Work for more discussion of GAN-related techniques.
    
    Thus, applying the density trick to our distributions at hand would provide us the integrand (of the MI expectation), and we could compute the expectation via Monte Carlo using all of the samples in the batch to get an estimate of the mutual information. The general technique of using a density ratio to estimate mutual information is introduced in \cite{suzuki_approximating_2008}, elaborated on in \cite{sugiyama_density-ratio_2012} and draws its roots to 2-sample testing via classifiers; we encourage the interested reader to refer to these for the history of the method. 
    
    \section{Multisample Density Ratio}\label{appendix:multirat}
    
    One practical reason we would be interested in using the information from all the samples is that the `one-sample' estimate of every density ratio term in the Monte Carlo expectation for MI described in \ref{appendix:binclas} will have very high variance; using information from multiple samples for each term and getting a `multi-sample' estimate would be more stable per \cite{poole_variational_2019}. When we use the multiclass objective pushing down the objective pushes up the MI implicitly, see Appendix \ref{appendix:multimi}. The tightness of this bound increases with the number of samples, so this is another reason we opt for the multisample approach. 
    % This may also be true of the binary classification problem, but I am not sure, and regardless it is more computationally intensive.) 
    
    % Having observed the analogy to contrastive learning, the proof that this loss leads to estimation of the density ratio follows from the proof for CPC in [CITE] page 4. Consider a classifier that, for a latent sample $z_i$, tries to pick which observation $x$ from a set $x_1, \ldots, x_K$ it corresponds to. (We can phrase the problem as the reverse as well - it doesn't matter since the density ratio and mutual information are symmetric.) Let $[d=k]$ be the indicator that the observation $x_k$ corresponds to $z_i$, i.e. $k=i$. Then we can write the optimal probability for the classifier's loss as $p(d=k \mid x_1, \ldots, x_K, z_i)$
    % \begin{equation}
    %     \begin{aligned}
    %     p\left(d=i \mid x_1, \ldots, x_K, z_i\right) &=\frac{p\left(x_{i} \mid c_{t}\right) \prod_{l \neq i} p\left(x_{l}\right)}{\sum_{j=1}^{N} p\left(x_{j} \mid c_{t}\right) \prod_{l \neq j} p\left(x_{l}\right)} \\
    %     &=\frac{\frac{p\left(x_{i} \mid c_{t}\right)}{p\left(x_{i}\right)}}{\sum_{j=1}^{N} \frac{p\left(x_{j} \mid c_{t}\right)}{p\left(x_{j}\right)}}
    %     \end{aligned}
    % \end{equation}
    % \textcolor{red}{FIX NOTATION AND EXPLAIN}
    
    \section{Mutual Information Bound}\label{appendix:multimi}
    
    This follows from analogy to CPC \citep{oord_representation_2019} (Appendix). This is an immediate application of the InfoNCE bound introduced there, which we follow here (along with \citep{grewal_recent_2019}); this is further elaborated on theoretically in \cite{poole_variational_2019}. 
    
    % I also would like to direct the reader to [CITE] (On Variational Bounds), which provides a proof from a different angle - namely, they interpret the `additional information' from the additional samples (past the binary case) as being used by the critic to estimate the partition function
    
    Consider a classifier that, for a latent sample $z_i$, tries to pick which observation $x$ from a set $X = \{x_1, \ldots, x_i, \ldots, x_K\}$ it corresponds to. (We can phrase the problem as the reverse as well - it doesn't matter since the density ratio and mutual information are symmetric.) We'll also follow the notation for the model critic (Equation \ref{eqn:sampling} left) for simplicity, but the inference critic (Equation \ref{eqn:sampling} right) follows the same steps. (Note we also drop subscripts on densities for clarity.)
    
    Consider Equation \ref{eqn:classification}. We know \citep{sugiyama_density-ratio_2012}, reshown by \citep{oord_representation_2019}, \citep{song_multi-label_2020}) that the classifier will estimate the density ratio up to a constant. That is, %optimal $K$-way 
    \begin{equation}\label{eqn:densityprop}
        f(x, z) \propto \frac{p(x,z)}{p(x)p(z)} = \frac{p(x|z)}{p(x)}
    \end{equation}
    (where the second equality is a simple application of Bayes' rule.)
    
    We'll split the sum in the denominator of Equation \ref{eqn:classification} into 1) the term for the observation that corresponds to the latent at hand and 2) all the others. (In contrastive learning terminology, these are the positive and negatives respectively.) 
    
    \begin{equation}
    \begin{aligned}
        \mathcal{L} &= \mathbb{E}\left[\log \frac{f \left(x^{+}, z^{+}\right)}{\sum_{x \in S} f\left(x, z^{+}\right)}\right] \\
                    &= \mathbb{E}\left[\log \frac{f\left(x^{+}, z^{+}\right)}{f\left(x^{+}, z^{+}\right)+\sum_{x_j \in X \setminus x_i } f\left(x_{j}, z^{+}\right)}\right] \\
    \end{aligned}
    \end{equation}
    
    Since the classifier aims to estimate the density ratio (up to a constant), from Equation \ref{eqn:densityprop}
    
    \begin{equation}
    \begin{aligned}
                    &\approx \mathbb{E} \log \left[\frac{\frac{p\left(x^{+} \mid z^{+}\right)}{p\left(x^{+}\right)} C}{\frac{p\left(x^{+} \mid z^{+}\right)}{p\left(x^{+}\right)} C +\sum_{x_j \in X \setminus x_i} \frac{p\left(x_{j} \mid z^{+}\right)}{p\left(x_{j}\right)} C}\right] \\
                    &= \mathbb{E} \log \left[\frac{\frac{p\left(x^{+} \mid z^{+}\right)}{p\left(x^{+}\right)} }{\frac{p\left(x^{+} \mid z^{+}\right)}{p\left(x^{+}\right)}  +\sum_{x_j \in X \setminus x_i} \frac{p\left(x_{j} \mid z^{+}\right)}{p\left(x_{j}\right)} }\right]
    \end{aligned}
    \end{equation}
    
    We notice the `positive' term appears in the numerator and denominator. Doing some algebraic manipulation,
    
   
    \begin{equation}
        \begin{aligned}
                    &= \mathbb{E} \left( - \log \left[\frac{ \frac{p\left(x^{+} \mid z^{+}\right)}{p\left(x^{+}\right)}  +\sum_{x_j \in X \setminus x_i} \frac{p\left(x_{j} \mid z^{+}\right)}{p\left(x_{j}\right)} }{\frac{p\left(x^{+} \mid z^{+}\right)}{p\left(x^{+}\right)} }\right] \right) \\
                    &=\mathbb{E}\left(-\log \left[1+\frac{\sum_{x_j \in X \setminus x_i} \frac{p\left(x_{j} \mid z^{+}\right)}{p\left(x_{j}\right)}}{\frac{p\left(x^{+} \mid z^{+}\right)}{p\left(x^{+}\right)}}\right]\right) \\
                    &=\mathbb{E}\left(-\log \left[1+\frac{p\left(x^{+}\right)}{p\left(x^{+} \mid z^{+}\right)} \sum_{x_j \in X \setminus x_i} \frac{p\left(x_{j} \mid z^{+}\right)}{p\left(x_{j}\right)}\right]\right) \\
        \end{aligned}   
    \end{equation}
                    
    Now examine the sum over all terms but the `positive' one. This can be considered a (scaled) expectation of the density ratio over the `negative' terms - which should be $1$, as for independent $x$, $z$ the joint is the product of marginals. (Technically, since we are computing on samples, this is a Monte Carlo estimate of this expectation, but as noted by \cite{oord_representation_2019} it is nearly exact even with relatively low $K$; \cite{poole_variational_2019} shows a proof of the InfoNCE bound that does not use this approximation.)
    
    \begin{equation}
        \begin{aligned}
                    &\approx \mathbb{E}\left(-\log \left[1+\frac{p\left(x^{+}\right)}{p\left(x^{+} \mid z^{+}\right)} (K-1)\mathbb{E}_{x \sim X_{neg}} \left[ \frac{p\left(x \mid z^{+}\right)}{p\left(x\right)}\right] \right]\right) \\
                    &=\mathbb{E}\left(-\log \left[1+\frac{p\left(x^{+}\right)}{p\left(x^{+} \mid z^{+}\right)} (K-1)\right]\right) \\
                    &=\mathbb{E} \log \left[\frac{1}{1+\frac{p\left(x^{+}\right)}{p\left(x^{+} \mid z^{+}\right)} (K-1)}  \right] \\
                    & =\mathbb{E} \log \left[\frac{1}{K\frac{p\left(x^{+}\right)}{p\left(x^{+} \mid z^{+}\right)} - \frac{p\left(x^{+}\right)}{p\left(x^{+} \mid z^{+}\right)} + 1}  \right] \\
                    & =\mathbb{E} \log \left[\frac{1}{K\frac{p\left(x^{+}\right)}{p\left(x^{+} \mid z^{+}\right)} + \left(1 - \frac{p\left(x^{+}\right)}{p\left(x^{+} \mid z^{+}\right)}\right)} \right]
        \end{aligned}   
    \end{equation}
    
    % This is where the bound enters the picture - we know the density ratio can never exceed $1$ (and is non-negative, as the ratio of non-negative probabilities), so its inverse will be greater than $1$. Thus the second term in the denominator will be negative. 
    
    % \textcolor{red}{COME BACK TO THIS}    
    
    \begin{equation}
        \begin{aligned}
                    & \leq \mathbb{E} \log \left[\frac{1}{K\frac{p\left(x^{+}\right)}{p\left(x^{+} \mid z^{+}\right)}} \right]\\
                    &= \mathbb{E} \log \left[\frac{1}{K} \frac{p\left(x^{+} \mid z^{+}\right)}{p\left(x^{+}\right)} \right]\\
                    &= \mathbb{E} \left( \log \left[\frac{p\left(x^{+} \mid z^{+}\right)}{p\left(x^{+}\right)} \right] \right) - \log K \\
                    &= I(x ; z)-\log K
        \end{aligned}   
    \end{equation}
    
    where the last line is clearly less than $I(x;z)$. Thus
    \begin{equation}
        I(x; z) \geq \mathcal{L} + \log K
    \end{equation}
    
    As the optimal loss (where the classifier is exactly a constant proportion of the density ratio) is bounded above by the MI, any suboptimal loss (which will be lower, since we are maximizing) will be bounded by the same. The key here is the $\log K$ term, which upper bounds the estimate of the MI (as $\mathcal{L} \leq 0$): intuitively, optimizing the loss pushes up the MI with a stick that is $\log K$ long. If the MI would otherwise fall to $0$, the regularization aims to increase it by up to $\log K$ - past this, the bound is loose (our stick cannot reach), so it is advantageous to use higher $K$. (This is why the binary case as a lower bound is not practical.) Empirically, we increase $I_q(x;z)$ by almost exactly $\log K$ with an inference-side critic, showing this technique works as well as the theory might tell us it can (Table \ref{table:textresults}, Table \ref{table:imgresults}). 
    
    % \begin{equation}
    %     \begin{aligned}
    %     \mathcal{L}_{\mathrm{N}}^{\mathrm{opt}} &=-\underset{X}{\mathbb{E}} \log \left[\frac{\frac{p\left(x_{t+k} \mid c_{t}\right)}{p\left(x_{t+k}\right)}}{\frac{p\left(x_{t+k} \mid c_{t}\right)}{p\left(x_{t+k}\right)}+\sum_{x_{j} \in X_{\mathrm{neg}}} \frac{p\left(x_{j} \mid c_{t}\right)}{p\left(x_{j}\right)}}\right] \\
    %     &=\underset{X}{\mathbb{E}} \log \left[1+\frac{p\left(x_{t+k}\right)}{p\left(x_{t+k} \mid c_{t}\right)} \sum_{x_{j} \in X_{\mathrm{neg}}} \frac{p\left(x_{j} \mid c_{t}\right)}{p\left(x_{j}\right)}\right] \\
    %     & \approx \underset{X}{\mathbb{E}} \log \left[1+\frac{p\left(x_{t+k}\right)}{p\left(x_{t+k} \mid c_{t}\right)}(N-1) \underset{x_{j}}{\mathbb{E}} \frac{p\left(x_{j} \mid c_{t}\right)}{p\left(x_{j}\right)}\right] \\
    %     &=\underset{X}{\mathbb{E}} \log \left[1+\frac{p\left(x_{t+k}\right)}{p\left(x_{t+k} \mid c_{t}\right)}(N-1)\right] \\
    %     & \geq \underset{X}{\mathbb{E}} \log \left[\frac{p\left(x_{t+k}\right)}{p\left(x_{t+k} \mid c_{t}\right)} N\right] \\
    %     &=-I\left(x_{t+k}, c_{t}\right)+\log (N)
    %     \end{aligned}
    % \end{equation}
    
    
    % \begin{equation}
    %     \begin{aligned}
    %         I(x ; z)-\log N &=\mathbb{E}_{S}\left[\log \frac{p\left(x^{+}, z^{+}\right)}{p\left(x^{+}\right)p\left(z^{+})\right)}\right]-\log N \\
    %         &=\mathbb{E}_{S}\left[\log \frac{p\left(x^{+} \mid z^{+}\right)}{p\left(x^{+}\right)}\right]-\log N \\
    %         &=\mathbb{E}_{S}\left[\log \left(\frac{p\left(x^{+} \mid z^{+}\right)}{p\left(x^{+}\right)} \frac{1}{N}\right)\right] \\
    %         &=\mathbb{E}_{S}\left[\log \left(\frac{1}{\frac{p\left(x^{+}\right)}{p\left(x^{+} \mid z^{+}\right)}} \frac{1}{N}\right)\right] \\
    %         &\geq \mathbb{E}_{S}\left[\log \frac{1}{1+\frac{p\left(x^{+}\right)}{p\left(x^{+} \mid z^{+}\right)}(N-1)}\right] \\
    %         &=\mathbb{E}_{S}\left[-\log \left(1+\frac{p\left(x^{+}\right)}{p\left(x^{+} \mid z^{+}\right)}(N-1)\right)\right] \\
    %         &=\mathbb{E}_{S}\left[-\log \left(1+\frac{p\left(x^{+}\right)}{p\left(x^{+} \mid z^{+}\right)}(N-1) \mathbb{E}_{S-\left\{x^{+}\right\}}\left[\frac{p\left(x \mid z^{+}\right)}{p(x)}\right]\right)\right] \\
    %         &=\mathbb{E}_{S}\left[-\log \left(1+\frac{p\left(x^{+}\right)}{p\left(x^{+} \mid z^{+}\right)} \sum_{j=1}^{N-1} \frac{p\left(x_{j} \mid z^{+}\right)}{p\left(x_{j}\right)}\right)\right] \\
    %         &=\mathbb{E}_{S}\left[-\log \left(1+\frac{\sum_{j=1}^{N-1} \frac{p\left(x_{j} \mid z^{+}\right)}{p\left(x_{j}\right)}}{\frac{p\left(x^{+}\right)}{p\left(x^{+} \mid z^{+}\right)}}\right)\right] \\
    %         &=\mathbb{E}_{S}\left[\log \frac{\frac{p\left(x^{+} \mid z^{+}\right)}{p\left(x^{+}\right)}}{\frac{p\left(x^{+} \mid z^{+}\right)}{p\left(x^{+}\right)}+\sum_{j=1}^{N-1} \frac{p\left(x_{j} \mid z^{+}\right)}{p\left(x_{j}\right)}}\right] \\
    %         &\approx \mathbb{E}_{S}\left[\log \frac{h_{\theta}\left(x^{+}, z^{+}\right)}{h_{\theta}\left(x^{+}, z^{+}\right)+\sum_{j=1}^{N-1} h_{\theta}\left(x_{j}, z^{+}\right)}\right] \\
    %         &=\mathbb{E}_{S}\left[\log \frac{h_{\theta}\left(x^{+}, z^{+}\right)}{\sum_{x \in S} h_{\theta}\left(x, z^{+}\right)}\right]
    %     \end{aligned}
    % \end{equation}
    % \textcolor{red}{FIX NOTATION AND EXPLAIN}
    
    \section{Inference vs Model Critics} \label{appendix:critics}
    The model (`decoder'-side) critic corresponds to increasing $I_p(x;z)$ to prevent the likelihood from forgetting the latent, as previously discussed. For the inference (`encoder'-side) critic, the same analysis holds - instead of distinguishing the model joint from the product of the prior and the model approximate data distribution (marginal), it distinguishes the \textit{variational} joint $q_\phi (x, z)$ from the product of the empirical data distribution $p_{\mathcal{D}}(x)$ and the aggregate posterior $q(z)$ (whose samples are obtained by ancestral sampling, analogously to the samples from the model approximate data distribution). %It thus would correspond to increasing $I_q(x;z)$ instead of $I_p(x;z)$. 
    Interestingly, adversarial variational Bayes \citep{mescheder_adversarial_2017} trains a similar critic adversarially, using this optimization to replace the ELBO. It also learns notably bad representations%[CITE]
    , so this is consistent, especially given that the ELBO terms they replace include the mutual information penalty (recall Equation \ref{eqn:elbo2}), but this could be interesting to consider in more depth.
    
    One disadvantage of the model critic is that it requires sampling from the model, which can be expensive for strong model networks like autoregressive ones - which are where it would have the most effect. The inference critic does not have this restriction. 
    
    \section{Experimental Protocol}\label{appendix:protocol}
    
    Protocol reproduced from \cite{he_lagging_2019}. 
    
    Text experiments: LSTM parameters are initialized from $\mathcal{U}(-0.01,0.01)$, with $\mathcal{U}(-0.1,0.1)$ for embedding parameters. The final hidden representation produced by the inference network is used to predict the latent variable with a linear transformation. The SGD optimizer is used with an initial learning rate of $1.0$, decayed by a factor of $2$ upon a validation loss plateau for at least 2 epochs. Training ends once the learning rate has been thus decayed 5 times. No text preprocessing is performed. Dropout of $0.5$ is used on the model network for the input embeddings and the pre-linear transformation output in vocabulary space.
    
    Image experiments: train/val/test splits are used identically to \cite{he_lagging_2019} and \cite{kim_semi-amortized_2018}. The Adam optimizer is used with an initial learning rate of 0.001, decayed by a factor of 2 upon a validation loss plateau for at least 2 epochs. Training ends once the learning rate has been thus decayed 5 times. Images are dynamically binarized -- that is, the input pixel values are considered parameters to Bernoulli random variables. Validation and test are performed with fixed binarization. The model network uses binary likelihood. The ResNet and PixelCNN are as described in \cite{he_lagging_2019}.
    
    \section{Mutual Information Comparison -- All Training} \label{appendix:miexpanded}
    \begin{figure}%{0.5\linewidth}
    \centering
    % \vspace{-2em}
    \includegraphics[width=0.45\linewidth]{mi_comparison_expanded.pdf}
    \caption{Comparison of mutual information across the variational family ($I_q$) for various critics vs baseline; the different endpoints are due to the termination condition for the experimental protocol depending on when a certain number of plateaus are reached.}
    \vspace{-2em}
    \label{fig:mi_comparison_expanded}
    \end{figure}
    
    
    \section{Discussion of VAE-MINE}\label{appendix:vaemine}
    Intuitively, our inference critics solve a classification task with a simple cross-entropy loss. This can be optimized with vanilla backprop. VAE-MINE adds a different term, based on MINE, to train an energy function that does not solve the same task; to optimize it, they resort to Taylor approximations and convex duality. This only implicitly results in contrasting the two distributions, while we directly train our inf. critic to do so. Yet, (per \cite{poole_variational_2019} Sec 2.2,) their way of estimating MINE \textit{is not even a correct bound on the MI}. Even if we ignore this, they lose the critical aspect of speed; for every size-$n$ batch, their bound uses $n^2$ forward passes (\citep{poole_variational_2019} Sec 3) vs our $2n$ (vs base VAE's $n$). This scales poorly. (There is no code available for VAE-MINE for empirical comparison, but there is a decisive gap between their quadratic and our linear runtime.) Finally, (\citep{poole_variational_2019} App. A), the bound we use is lower variance than (the correct) MINE. Our method is theoretically appealing, correct, and fast.
    
    \end{appendices}
% \subsubsection*{References}
\bibliographystyle{plainnat}
\bibliography{references.bib}

\end{document}