\section{Method}
\subsection{Variational Auto-Encoder Baseline}
Formally, given a sentence $\mathbf{x}=\{x_1, x_2, \dots, x_n\}$, where $n$ is the number of input tokens in $\mathbf{x}$, an auto-regressive language model finds a neural network $\bm{\theta}$ that can maximize the log-likelihood  
$\log P(\mathbf{x}; \bm{\theta}) = \prod_{i=1}^n \log P(x_i | x_1, \dots, x_{i-1}; \bm{\theta})$. Variational auto-encoder (VAE) introduces a random latent variable $\zv$ to model $P(\mathbf{x}; \bm{\theta})$ as $P(\mathbf{x}; \bm{\theta}) = \int_{\zv} P( \mathbf{x}, \zv; \bm{\theta}) d\mathbf{z} = \int_{\zv} P(\zv; \bm{\theta})P( \mathbf{x} | \zv; \bm{\theta}) d\mathbf{z}$.  
To generate a sentence $\mathbf{x}$ using VAEs, a random vector $\zv$ from the \emph{prior distribution} $P(\zv; \bm{\theta})$ is first sampled and then a \emph{decoder network}  is used to generate $\mathbf{x}$ from the sampled $\zv$ according to the conditional density $P(\mathbf{x}|\zv; \bm{\theta})$. When $\zv$ is continuous (e.g. a multivariate Gaussian variable), VAE can generate sentences from continuous space. 

The maximum likelihood training of the marginal distribution $\log P(\mathbf{x}; \bm{\theta})$ becomes intractable due to the integration over $\zv$. To train the model parameters $\bm{\theta}$, a variational posterior distribution $Q(\zv|\xv; \bm{\phi})$ over the latent variable $\zv$ is introduced to approximate the true posterior $P(\zv|\xv; \bm{\theta}) \approx P(\zv| \bm{\theta}) P(\xv|\zv; \bm{\theta})$. 
To ensure that the random vector $\zv$ from the prior $P(\zv; \bm{\theta})$ can encode meaningful semantic representations of the input sentence, the posterior $Q(\zv|\mathbf{x}; \bm{\phi})$ are enforced to match $P(\zv; \bm{\theta})$ during training. In this way, the prior distribution can capture the latent semantic distribution of the input sentences. Based on $Q(\zv|\mathbf{x}; \bm{\phi})$, an evidence lower bound (ELBO)~\citep{bowman2015generating} to $\log P(\mathbf{x}; \bm{\theta})$ is defined as: 
\begin{align}
\small 
&\textsc{Elbo}(\mathbf{x}; \bm{\phi}, \bm{\theta}) \label{eq:elbo} \\
&= \mathbb{E}_{\zv \sim Q(\zv|\mathbf{x}; \bm{\phi})} \Big[ \log \frac{P(\mathbf{x}, \zv; \bm{\theta})}{Q(\zv; \bm{\phi})}\Big] \nonumber \\
&= \mathbb{E}_{\zv \sim Q(\zv|\mathbf{x}; \bm{\phi})} \Big[\log \frac{P(\mathbf{x}|\zv; \bm{\theta})P(\zv; \bm{\theta})}{Q(\zv; \bm{\phi})}\Big] \nonumber\\
&= \underbrace{\mathbb{E}_{\zv \sim Q(\zv|\mathbf{x}; \bm{\phi})} \log P(\mathbf{x}|\zv; \bm{\theta})}_\text{reconstruction  loss} - \underbrace{\KL\Big(Q(\zv|\mathbf{x}; \bm{\phi}),  P(\zv; \bm{\theta})\Big)}_\text{regularizer}.\nonumber
\end{align}
The neural network parameters $\bm{\theta}$ and $\bm{\phi}$ can be trained by maximizing ELBO in Eq \ref{eq:elbo}.  The first term $\mathbb{E}_{\zv \sim Q(\zv|\mathbf{x}; \bm{\phi})} \log P(\mathbf{x}|\zv; \bm{\theta})$ can be regarded as a reconstruction loss. The encoder first generates latent random vectors $\zv$ according to the posterior distribution $Q(\zv|\mathbf{x}; \bm{\phi})$, which is  called an \emph{encoder network} or a \emph{recognition network}. The latent random vector  $\zv$ is required to best reconstruct $\mathbf{x}$ by maximizing the log-likelihood $P(\mathbf{x}|\zv; \bm{\theta})$. The second term $\textsc{KL}\Big(Q(\zv|\mathbf{x}; \bm{\phi}),  P(\zv; \bm{\theta})\Big)$ is the Kullback-Leibler divergence, which regularizes the posterior distribution to be close to the prior distribution. The posterior distribution and the prior distribution are often assumed to belong to the same parametric distributions with different parameters, which can ease the learning burden. Ideally, a good model maximizes the first term and minimizes the second term.  When the KL term tends to be 0, the posterior $Q(\zv|\mathbf{x}; \bm{\phi})$ degenerates to $ P(\zv; \bm{\theta})$ and the posterior collapse issue happens.  For text generation, posterior collapse is common due to the auto-regressive decoder can be very strong, which  totally ignores $\zv$ and generates $x_{i}$ solely from the language contexts $[x_1, \dots, x_{i-1}]$~\citep{Fu2019CyclicalAS}. \subsection{Implicit VAE Baseline}
The relationship between $\log P(\xv;\bm{\theta}) $ and $\textsc{Elbo}(\xv; \bm{\phi}; \bm{\theta})$ can also be written as
\begin{equation}
\small
   \textsc{Elbo}(\xv; \bm{\phi}; \bm{\theta})  =  \log P(\xv;\bm{\theta}) - \KL(Q(\zv | \xv; \bm{\phi}), P(\zv|\xv; \bm{\theta})).
\label{eq:secondview}
\end{equation}
Under this view, finding a good posterior to minimize the KL divergence between $Q(\zv | \xv; \bm{\phi})$ and  $P(\zv|\xv; \bm{\theta})$ is the key to minimize $\textsc{Elbo}(\xv; \bm{\phi}; \bm{\theta})$.  $Q(\zv | \xv; \bm{\phi})$ are typically assumed to be multivariate Gaussian distributions  for computational convenience. However, Gaussian are not enough to capture the rich semantic properties of natural sentences in latent space. Implicit VAEs (iVAE) introduce a sample-based posterior representation which does not depend on an explicit density form. Instead, $Q(\zv|\xv; \bm{\phi})$ is represented by a set of samples $\{\zv_{\xv, i}\}_{i=1}^{T}$, where $T$ is a the number of samples. For the $i$-th sample $\zv_i$, it is obtained by 
\begin{equation}
\small 
   \bm{\xi}_i \sim \mathcal{N}(\bm{\xi}),\zv_{\xv,i} = \textsc{Enc}(\xv,\bm{\xi}_i;\bm{\phi}), 
\label{eq:implicitz}
\end{equation}
where $\mathcal{N}(\bm{\xi})$ is the standard normal Gaussian, $\textsc{Enc}(\xv,\bm{\xi}_i;\bm{\phi})$ is a noise aware encoder network, which transforms the input sentence and the noise signal together into a latent sample. The sentence representation $\mathbf{h}$ of $\xv$ is generated by a LSTM encoder~\citep{hochreiter1997long}and $\mathbf{h}$ is concatenated with $\bm{\xi}_i$ by a multi-layer perceptron layer (MLP) to produce $\zv_i$. 

Although  the sample based posterior representation grants theoretical flexibility for the distributions of latent variables in high-dimensional spaces, the KL divergence between $Q(\zv|\xv; \bm{\phi})$ and $P(\zv; \bm{\theta})$ defined in ELBO of Eq \ref{eq:elbo} is intractable now. Following \cite{Fang_iLVM_2019_EMNLP}, we calculate its dual form based on Fenchel duality theorem~\citep{rockafellar1966extension, NIPS2018_dai} by introducing an auxiliary function $f(\xv,\zv; \psiv)$,
\begin{align}
\small
 &\KL\left(Q(\zv|\xv, \bm{\phi}), P(\zv, \bm{\theta})\right) \label{eq:dualkl} \\
=&\text{max}_{f} \mathbb{E}_{\zv\sim Q(\zv|\xv; \bm{\phi})}f(\xv,\zv; \psiv)-\mathbb{E}_{\zv\sim P(\zv; \bm{\theta})} \text{exp}(f(\xv,\zv; \psiv)),\nonumber
\end{align}
where $f(\xv,\zv; \psiv)$ outputs a scalar real value and  $\psiv$ is the neural network parameters for $f$. The latent vector $\zv$ in $f$ is either drawn from the posterior $Q(\zv|\xv; \bm{\phi})$ or the prior $P(\zv; \bm{\theta})$.  $f(\xv,\zv; \psiv)$ is implemented as a MLP, which basically distinguishes between $(\xv, \zv_Q)$ and $(\xv, \zv_P)$ where $\zv_Q$ and $\zv_P$ denote latent variables drawn from the posterior and the prior distribution, respectively.  Using the approximate KL function above, the ELBO loss can be written as 
\begin{align}
\small 
 \mathcal{L}_{\text{iVAE}} 
&=  \mathbb{E}_{\zv\sim Q(\zv|\xv, \bm{\phi})}\text{log} P(\xv|\zv, \bm{\theta})
  - \mathbb{E}_{\zv\sim Q(\zv|\xv, \bm{\phi})}f(\xv,\zv, \psiv) \nonumber \\
  + &\mathbb{E}_{\zv\sim P(\zv; \bm{\theta})} \text{exp}(f(\xv,\zv, \psiv)).
  \label{eq:ivae}
\end{align}

Considering a document $\mathbf{D}=\{\xv_i\}_{i=1}^n$, the loss function of the whole $\mathbf{D}$ using VAE and iVAE are: 
\begin{align}
\small 
\mathcal{L}_\textsc{VAE} = &\mathbb{E}_{\xv \sim D }\Big[\mathbb{E}_{\zv \sim Q(\zv|\mathbf{x}; \bm{\phi})} \log P(\mathbf{x}|\zv; \bm{\theta}) \Big] \nonumber \\
&- \mathbb{E}_{\xv \sim D }\Big[\textsc{KL}(Q(\zv|\mathbf{x}; \bm{\phi}),  P(\zv; \bm{\theta}))\Big],
\label{eq:lvaed}\\
\mathcal{L}_\textsc{iVAE} = &\mathbb{E}_{\xv \sim D}\Big[ \mathbb{E}_{\zv\sim Q(\zv|\xv, \bm{\phi})}\text{log} P(\xv|\zv, \bm{\theta})\Big] \nonumber\\
  &- \mathbb{E}_{\xv \sim D}\Big[\mathbb{E}_{\zv\sim Q(\zv|\xv, \bm{\phi})}f(\xv,\zv, \psiv) \nonumber \\
  &+ \mathbb{E}_{\zv\sim P(\zv; \bm{\theta})} \text{exp}(f(\xv,\zv, \psiv))\Big].   \label{eq:livaed}
\end{align}
\paragraph{Mutual Information Regularized iVAE}
In both Eq \ref{eq:lvaed} and Eq \ref{eq:livaed}, the KL term forces a data-dependent posterior distribution $Q(\zv|\xv, \bm{\phi})$ to independently match the same data-agnostic prior distribution $P(\xv|\zv, \bm{\theta})$. To improve the efficiency of VAE, some efforts have been made to explore data-dependent priors~\citep{dai-etal-2021-apo,ding-gimpel-2021-flowprior} or  matching  the aggregated posterior $Q(\zv; \bm{\phi})$ with $P(\zv; \bm{\theta})$, where $Q(\zv; \bm{\phi}) = \int Q(\xv) Q(\zv|\xv; \bm{\phi}) d\xv $ and $Q(\xv)$ is the empirical data distribution. The data-dependent prior requires an additional network to generate a prior distribution from input data, which is not considered in this work.  Using aggregated posterior $Q(\zv; \bm{\phi})$, the latent space can be better regularized by cooperating different posterior distributions of all sentences to jointly match the prior. Replacing $\mathbb{E}_{\xv \sim D} \textsc{KL}(Q(\zv|\mathbf{x}; \bm{\phi}),  P(\zv; \bm{\theta}))$ with $\textsc{KL}(Q(\zv;\bm{\phi}),  P(\zv; \bm{\theta}))$, the new training objective is, 
\begin{align}
\small 
\mathcal{L}_\textsc{iVAE$_\textsc{mi}$} &= \mathbb{E}_{\xv \sim D}\Big[ \mathbb{E}_{\zv\sim Q(\zv|\xv, \bm{\phi})}\text{log} P(\xv|\zv, \bm{\theta})\Big] \label{eq:livaed}\\
  &- \mathbb{E}_{\zv\sim Q(\zv, \bm{\phi})}g(\zv, \psiv) + \mathbb{E}_{\zv \sim P(\zv; \bm{\theta})} \text{exp}(g(\zv, \psiv)),   \nonumber
\end{align}
where $g(\zv, \psiv)$ is a similar function as $f(\xv, \zv, \psiv)$ to produce a real-valued scalar. Differently, it only considers the latent vector $\zv$ instead of the concatenation of the input $\xv$ and $\zv$. To generate samples from $Q(\zv, \bm{\phi})$, ancestral sampling can be used. First, $\xv$ is sampled from $D$ and then $\zv$ is sampled from $Q(\zv|\xv; \bm{\phi})$. \cite{Fang_iLVM_2019_EMNLP} show that optimizing the aggregated posterior based KL term is to maximize the mutual information $I(\xv, \zv)$ between each input $\xv$ and its latent vector $\zv$ under the joint distribution $Q(\xv, \zv; \bm{\phi})$. Thus, this version is also named as iVAE$_\textsc{mi}$, short for mutual information regularized iVAE. 
\subsection{Contrastive Learning over Latent Variables for Text Generation}
Figure \ref{fig:clvae} shows the overall structure for our proposed model based on the framework of iVAE. It contains three main components, an encoder, a contrastive learning module and a decoder. 
The encoder is quite similar to iVAE, except that we highlight that dropout is enabled. The decoder is exactly the same as iVAE. The contrastive learning module is our main design.  In this section, we show how to integrate contrastive learning over latent variables with iVAE. 

Contrastive learning over latent variables aims to increase the semantic similarities among semantically close posterior samples and decrease them among semantically distant samples~\citep{hadsell2006dimensionality}. Formally, given the latent sample $\zv_i$ of the $i$-th input sentence, its positive latent sample $\zv_i^+$ which is semantically similar to $\zv_i$, its $M$ negative latent samples $\{\zv_{i,j}^-\}_{j=1}^M$ which is semantically far away from $\zv_i$, the training objective using contrastive learning for the $i$-th sentence is to maximize:
\begin{equation}
\small 
\label{eq:clobjective}
\ell_i = \log \frac{\exp(\textsc{Sim}(\mf{z}_i, \mf{z}^+_i) / \tau )}{\sum_{\mf{z}\in \{\mf{z}_i^+\} \cup\{\mf{z}_{i,j}^-\}_{j=1}^M}\exp(\textsc{Sim}(\mf{z}_i, \mf{z}) /\tau)},
\end{equation}
where $\tau$ is a temperature hyper-parameter and $\textsc{Sim}(\zv_1, \zv_2)$ is a function for measuring semantic similarities between $\zv_1$ and $\zv_2$. Particularly, we use cosine similarity as the semantic similarity metric, 
\begin{equation}
\small 
    \textsc{Sim}(\mathbf{z}_1, \mathbf{z}_2) = \frac{\mf{z}_1^\mr{T} \mf{z}_2}{\Vert \mf{z}_1\Vert_2 \cdot \Vert \mf{z}_2\Vert_2},
    \label{eq:sim}
\end{equation}
where $\Vert \mf{z}\Vert_2$ denotes the $\mathcal{L}_2$ norm of $\zv$. 
Following~\cite{Chen2020ASF} and \cite{Gao2021SimCSESC}, we consider batch-based contrastive learning. For each sentence $\xv_i$, we follow \cite{Gao2021SimCSESC} and use dropout as data augmentation. We encode the input sentence twice with dropout being enabled, and obtain two different views $\xv_i$ and $\xv_i^+$ of the same sentence. $\xv_i$ is then regarded as the anchor sentence and $\xv_i^+$ is the positive sample of $\xv_i$. The remaining sentences in the same batch are treated as the negative samples of $\xv_i$. In this way, the number of negative examples $M=N-1$. To obtain the corresponding posterior latent variables, we sample $\zv_i \sim Q(\zv |\xv_i, \bm{\phi})$ and $\zv_i^+ \sim Q(\zv |\xv_i^+, \bm{\phi})$. For the posterior samples $\zv_j \sim Q(\zv |\xv_j, \bm{\phi})$ and $j \neq i $, they are the negative samples of $\zv_i$. The overall contrastive learning loss for the whole batch is 
\begin{equation}
\small
    \mathcal{L}_\textsc{cl} = \frac{1}{N} \sum_{i=1}^N \log \frac{\exp(\textsc{Sim}(\mf{z}_i, \mf{z}^+_i) / \tau )}{\sum_{\mf{z}\in \{\mf{z}_i^+\} \cup\{\mf{z}_{i,j}^-\}_{j=1}^M}\exp(\textsc{Sim}(\mf{z}_i, \mf{z}) /\tau)}.
\label{eq:clloss}
\end{equation}
Adding the contrastive loss into $\mathcal{L}_\textsc{iVAE}$ and $\mathcal{L}_{\textsc{iVAE}_\textsc{MI}}$, we have
\begin{align}
\small 
 \mathcal{L}_{\textsc{clVAE}} &=  \mathcal{L}_\textsc{iVAE} + \mathcal{L}_\textsc{cl}, 
  \label{eq:clvae}
  \\
\mathcal{L}_{\textsc{clVAE}_\textsc{MI}} 
&=  \mathcal{L}_{\textsc{iVAE}_\textsc{MI}} + \mathcal{L}_\textsc{cl}.
  \label{eq:clvaemi}
\end{align}
\paragraph{Improved Dual Function} The KL divergence $\KL(Q(\zv|\xv; \bm{\phi}), P(\zv; \bm{\theta}))$ should be non-negative according to its definition. However, the approximation by the dual function in Eq \ref{eq:dualkl} cannot ensure this property since  $\mathbb{E}_{\zv\sim Q(\zv|\xv; \bm{\phi})}f(\xv,\zv; \psiv)$ can be less than $\mathbb{E}_{\zv\sim P(\zv; \bm{\theta})} \text{exp}(f(\xv,\zv; \psiv))$. To encourage the KL approximation to be non-negative, we resort to a new constraint $\mathbb{E}_{\zv\sim Q(\zv|\xv; \bm{\phi})}f(\xv,\zv; \psiv) > 0 > \mathbb{E}_{\zv\sim P(\zv; \bm{\theta})} \text{exp}(f(\xv,\zv; \psiv))$. Using this constraint, we define a new approximation inspired by circle loss \cite{Sun2020CircleLA}, 
\begin{align}
\small
 &\KL\left(Q(\zv|\xv, \bm{\phi}), P(\zv, \bm{\theta})\right) \label{eq:newdualkl} \\
=&\min_{f} \log \Big(1 + \exp\big(-\mathbb{E}_{\zv\sim Q(\zv|\xv; \bm{\phi})}f(\xv,\zv; \psiv)\big)\Big)\nonumber \\
&+ \log\Big(1 + \exp\big(\mathbb{E}_{\zv\sim P(\zv; \bm{\theta})} \text{exp}(f(\xv,\zv; \psiv))\big)\Big),\nonumber
\end{align}
To minimize Eq \ref{eq:newdualkl}, $\mathbb{E}_{\zv\sim Q(\zv|\xv; \bm{\phi})}f(\xv,\zv; \psiv)$ should be forced to be much greater than 0 and $\mathbb{E}_{\zv\sim P(\zv; \bm{\theta})}f(\xv,\zv; \psiv)$ should be much less than 0.  Similarly,  $\KL(Q(\zv; \bm{\phi}), P(\zv; \bm{\theta}))$ can be approximated as, 
\begin{align}
\small
 &\KL\left(Q(\zv; \bm{\phi}), P(\zv; \bm{\theta})\right) \label{eq:newdualkl2} \\
=&\min_{g} \log \Big(1 + \exp\big(-\mathbb{E}_{\zv\sim Q(\zv; \bm{\phi})}g(\zv; \psiv)\big)\Big)\nonumber \\
&+ \log\Big(1 + \exp\big(\mathbb{E}_{\zv\sim P(\zv; \bm{\theta})} \text{exp}(g(\zv; \psiv))\big)\Big).\nonumber
\end{align}
Using the proposed KL approximation term, the full objective is to maximize 
\begin{align}
\small 
   & \mathcal{L}_{\textsc{clVAE}}^+ 
=  \mathbb{E}_{\zv\sim Q(\zv|\xv, \bm{\phi})}\text{log} P(\xv|\zv, \bm{\theta}) + \mathcal{L}_\textsc{cl} \nonumber \\
  - & \log \Big(1 + \exp\big(-\mathbb{E}_{\zv\sim Q(\zv| \xv; \bm{\phi})}f(\xv, \zv; \psiv)\big)\Big) \nonumber \\
  - &\log\Big(1 + \exp\big(\mathbb{E}_{\zv\sim P(\zv; \bm{\theta})} \text{exp}(f(\xv, \zv; \psiv))\big)\Big),
  \label{eq:clvaeplus}\\
&\mathcal{L}_{\textsc{clVAE}_\textsc{mi}}^+ 
=  \mathbb{E}_{\zv\sim Q(\zv|\xv, \bm{\phi})}\text{log} P(\xv|\zv, \bm{\theta}) + \mathcal{L}_\textsc{cl} \nonumber \\
  - & \log \Big(1 + \exp\big(-\mathbb{E}_{\zv\sim Q(\zv; \bm{\phi})}g(\zv; \psiv)\big)\Big) \nonumber \\
  - &\log\Big(1 + \exp\big(\mathbb{E}_{\zv\sim P(\zv; \bm{\theta})} \text{exp}(g(\zv; \psiv))\big)\Big).
  \label{eq:clvaemiplus}
\end{align}
\paragraph{Training Algorithm} Algorithm~\ref{alg:clvae} shows the training algorithm of our proposed method. We first sample a mini-batch of paired  random Gaussian noise vectors $\bm{\xi}_i$ and $\bm{\xi}_i^+$. After obtaining a mini-batch of input sentences, we pass them through the LSTM encoder with dropout being enabled to produce the latent vectors $\zv_{\xv, i}$ and $\zv_{\xv, i}^+$. Then the contrastive loss defined in Eq \ref{eq:clloss} is calculated. In addition, we sample a paired prior vectors from $P(\zv; \bm{\theta})$. $\psiv$ in $f(\xv, \zv, \psiv)$ is updated according to Eq \ref{eq:klupdate}. $\psiv$ is fixed afterwards. The encoder parameters $\phiv$ and decoder parameters $\thetav$ are updated according to Eq \ref{eq:modelupdate}. If the dual function $g$ instead of $f$ is used, Eq \ref{eq:klupdate} and Eq \ref{eq:modelupdate} can be changed accordingly . 
\begin{algorithm}[!t]
\SetAlgoNoLine 
\caption{The training algorithm for constrastive learning over latent variables for text generation. }
\label{alg:clvae} 
{\bf Input}: The training data set $D$ and  the training epochs $T$\;
{\bf Model parameters}: $\bm{\theta}$, $\bm{\phi}$ and $\bm{\psi}$\;
\While{$ t < T $}{
 1. Sample a mini-batch of $\bm{\xi}_i \sim \mathcal{N}(\bm{\xi}), \bm{\xi}_i^+ \sim \mathcal{N}(\bm{\xi})$\; 
 2. Sample a mini-batch of input sentences $\xv_i\sim\mathcal{D}$\;
 3. Generate $\zv_{\xv, i} = \textsc{Enc}(\xv_i,\bm{\xi}_i;\bm{\phi})$ and $\zv_{\xv, i}^+ = \textsc{Enc}(\xv_i,\bm{\xi}_i^+;\bm{\phi})$\;
 4. Calculate $\mathcal{L}_\textsc{cl}$ in Eq \ref{eq:clloss}\; 
 5. Sample a mini-batch of $\zv_i, \zv_i^+ \sim P(\zv;\bm{\theta})$\;
 6. Update ${\psiv}$ in $f(\xv,\zv, \psiv)$ to minimize
\begin{align}
\small 
 &\log \big(1 +  \exp (\sum_{i} - f(\xv_i,\zv_{\xv, i}; \psiv))\big)\nonumber \\
+ &\log\big(1 + \exp(\sum_{i} \text{exp}(f(\xv_i,\zv_{i}; \psiv))\big)\nonumber \\
+ &\log \big(1 +  \exp (\sum_{i} - f(\xv_i,\zv_{\xv, i}^+; \psiv))\big)\nonumber \\
+ &\log\big(1 + \exp(\sum_{i} \text{exp}(f(\xv_i,\zv_{i}^+; \psiv))\big). \label{eq:klupdate}
\end{align}
7. Update parameters $\{\phiv, \thetav\}$ to minimize
\begin{align}
\small 
& -\sum_i \text{log} P(\xv_i |\zv_i, \bm{\theta}) - \sum_i \text{log} P(\xv_i |\zv_i^+, \bm{\theta}) - \mathcal{L}_\textsc{cl} \nonumber \\
  & + \log \big(1 + \exp\big(-\sum_i f(\xv_i, \zv_{\xv, i}; \psiv))\big) \nonumber \\
  & + \log \big(1 + \exp\big(-\sum_i f(\xv_i, \zv_{\xv, i}^+; \psiv))\big). \label{eq:modelupdate}
\end{align}
8. $ t \leftarrow  t+1$\; 
}
{\bf Output}: $\bm{\theta}$, $\bm{\phi}$ and $\bm{\psi}$\;
\end{algorithm}

\iffalse 
\paragraph{Training scheme} 
Implicit VAE inherits the end-to-end training scheme of VAEs with extra work on training the auxiliary network  $\nu_{\psiv}(\xv,\zv)$: %\cy{It might help to write out the training process for VAE as well. ----$>$ maybe, but I think VAE process is clearly known for all, but depend on you, I can add it afterwards}:
\begin{itemize}
\item Sample a mini-batch of $\xv_i\sim\mathcal{D}$, $\epsilon_i\sim q(\epsilon)$, and generate $\zv_{\xv_i,\epsilon_i} = G(\xv_i, \epsilon_i; \phiv)$; Sample a mini-batch of $\zv_i\sim p(\zv)$.
\item Update ${\psiv}$ in $\nu_{\psiv}(\xv,\zv)$ to maximize 
%
\begin{align}
\hspace{-3mm}
\sum_{i}\nu_{\psiv}(\xv_i,\zv_{\xv_i,\epsilon_i})-\sum_{i}\text{exp}(\nu_{\psiv}(\xv_i,\zv_i))
\label{eq:ivae_dual}
\end{align}
\item Update parameters $\{\phiv, \thetav\}$ to maximize
\begin{align}
\hspace{-3mm}
\sum_{i}\text{log}p_{\thetav}(\xv_i|\zv_{\xv_i,\epsilon_i})-\sum_{i}\nu_{\psiv}(\xv_i,\zv_{\xv_i,\epsilon_i})
\label{eq:ivae_prime}
\end{align}
\end{itemize}
%
In practice, we implement $\nu_{\psiv}(\xv,\zv)$ with a multilayer perceptron (MLP), which takes the concatenation of $\hv$ and $\zv$. In another word, the auxiliary network distinguishes between $(\xv, \zv_{\xv})$ and  $(\xv, \zv)$, where $\zv_{\xv}$ is drawn from the posterior and $\zv$  is drawn from the prior, respectively. We found the  MLP-parameterized auxiliary network converges faster than LSTM encoder and decoder~\cite{hochreiter1997long}. This means that the auxiliary network practically provides an accurate approximation to the KL regularization $\Lcal_{R}$. 


\paragraph{Training scheme}
%
Note that the aggregated posterior $q_{\phiv}(\zv)$ is also a sample-based distribution. 
Similarly, we evaluate \eqref{eq:KL_MI} through its dual form:
\begin{align}
 &\KL\left(q_{\phiv}(\zv) \parallel p(\zv)\right) \label{eq:KL-dual-mi} \\
=&\text{max}_{\nu} \mathbb{E}_{\zv\sim q_{\phiv}(\zv)}\nu_{\psiv}(\zv)-\mathbb{E}_{\zv\sim p(\zv)} \text{exp}(\nu_{\psiv}(\zv)).\nonumber
\end{align}
Therefore, iVAE$_{\text{MI}}$ in \eqref{eq:KL_MI} can be written as:
\begin{align}
 \mathcal{L}_{\text{iVAE}_{\text{MI}}}  \label{eq:I-VAE-MI}
& = \mathbb{E}_{\xv \sim D} \mathbb{E}_{\zv\sim q_{\phiv}(\zv|\xv)}\text{log}p_{\thetav}(\xv|\zv) \\
&  -\mathbb{E}_{\zv\sim q_{\phiv}(\zv)}\nu_{\psiv}(\zv)+\mathbb{E}_{\zv\sim p(\zv)} \text{exp}(\nu_{\psiv}(\zv)), \nonumber
\end{align}
where the auxiliary network $\nu_{\psiv}(\zv)$ is parameterized as a neural network. Different from iVAE, $\nu_{\psiv}(\zv)$ in iVAE$_{\text{MI}}$ only takes posterior samples as input. 
The training algorithm is similar to iVAE in Section 3.1, except a different auxiliary network $\nu_{\psiv}(\zv)$. In Appendix \ref{I-VAE-MI}, we show the full algorithm of iVAE$_{\text{MI}}$.  
We illustrate the proposed methods in Figure~\ref{fig:lm_model}. Note that both iVAE and iVAE$_{\text{MI}}$ share the same model architecture, except a different auxiliary network $\nu$.
\fi 