\section{Introduction}

Deep latent variable models  such as variational autoencoder (VAEs)~\citep{bowman2015generating} and deep energy-based models~\citep{Deng2020Residual,NEURIPS2020_pang} have been widely used for text generation applications, such as machine translation~\citep{calixto2019latent}, language modeling~\citep{Deng2020Residual}, dialogue understanding ~\citep{shen-etal-2017-conditional}, and story generation~\citep{9265406,jhamtani-berg-kirkpatrick-2020-narrative}. These models follow a sequence-to-sequence~\citep{Sutskever2014SequenceTS} task setting, first stochastically mapping the input sentence into a latent variable according to proper probabilistic distribution assumptions and then reconstruct the whole sentence. By manipulating the latent variable, controllable text generation models with both diversity and fluency can be achieved. In such models, the latent variable is expected to encode high-level semantic and style information of the input sequence, which can serve to guide the generation of output sequences in different text generation tasks.
\begin{figure}[!t]
\begin{center}
\centerline{\includegraphics[width=0.52\textwidth]{figtexs/clvaemodel.pdf}}
\caption{Contrastive learning over latent variables.}
\label{fig:clvae}
\end{center}
%\vspace{-1em}
\end{figure}

In this paper, we consider VAE and its variant iVAE~\citep{Fang_iLVM_2019_EMNLP}, which addresses two limits of simple VAEs. The first is the simple posterior, which is usually assumed to be isotropic Gaussian,  limiting the expressive power of the latent variable. Many efforts have been made to use an enhanced posterior \citep{Casale2018GaussianPP,Tomczak2018VAEWA}. Another problem is the posterior collapse issue \citep{bowman2015generating}, where the learned latent variable can be largely ignored in a trained model. %To address this issues, previous researchers propose KL annealing, cyclic learning \citep{Fu2019CyclicalAS},  aggressive training \citep{he2019lagging}. \cite{Fang_iLVM_2019_EMNLP} have presented  \textbf{iVAE}, which can  alleviate the two issues of simple VAEs at the same time.
iVAE solves these problems by using implicit representation of posterior distributions of latent variables, where the posterior distribution are represented by a set of samples. Mutual information maximization is considered to encourage the correlation between the inputs and the induced latent variables. 

\iffalse
In this paper, we intend to improve both VAE and iVAE using contrastive learning in the latent semantic space. Contrastive learning have been shown to be effective in computer vision, sentence representation learning, and graph representation learning by using self-supervised signals. Contrastive learning is often done on the output feature space. Ideally, VAE models focus on the instance-level reconstruction objective while contrastive learning forces the latent variable to learn high-level semantics by exploring intra-instance relationship in a batch. We hypothesize that contrastive learning can be complementary to the reconstruction objective, leading to better latent representations. 
\fi 

It has been shown that the strength of iVAE as compared to vanilla VAE lies in better latent variable representations \citep{Fang_iLVM_2019_EMNLP}. In particular, the latent variable should faithfully represent the semantics of the input, while allowing diversified outputs. Intuitively, further improvements towards these two characteristics in the latent variable can potentially lead to stronger VAE methods. However, there is an intrinsic tradeoff between diversity and faithfulness, as it is more challenging to ensure that a set of different latent representations all embody the same semantic information. To this end, both VAE and iVAE make use of the reconstruction objective, learning to reconstruct the same input from different latent variable samples. The effectiveness can be limited by the number of samples that can be drawn, which can hardly match the number of possible variations in the reconstruction output. To address this issue, we consider adding direct supervision signals on the latent representation itself, by ensuring that different latent variable samples of the same input are closer in the vector space as compared to latent variable samples of two different sentences. This loss ensures semantic faithfulness without losing diversity. It can be viewed as belonging to the category of contrastive loss \citep{NEURIPS2020_Khosla,LKhc2020ContrastiveRL}.

Our model structure is shown in Figure \ref{fig:clvae}. In particular, for VAE, we encode the same input sentence twice with different dropout masks and then sample the corresponding latent codes for obtaining positive pairs. %They are regarded as different latent views of the input sentence and are therefore a positive training pair.  
Negative pairs are constructed by comparing the sampled latent vector with the remaining latent vectors resided in the same batch. Contrastive learning is used to increase the semantic similarity of positive pairs and decrease the similarities of negative pairs. 
For iVAE,  we observe that the KL divergence between the implicit posterior distribution and the prior distribution can explode.  The main reason is that, to calculate the KL divergence between sample-based representations, iVAE uses an approximation based on Fechel duality theorem \citep{rockafellar1966extension,NIPS2018_dai}. However, this approximation cannot guarantee the non-negativity of KL divergence. We empirically observe that contrastive learning on the latent space can  ease the explosion problem. We further adapt this approximation by borrowing the idea from CircleLoss~\citep{Sun2020CircleLA} to ensure the non-negativity of the KL approximation. 


We conduct experiments on a synthetic dataset and three language modeling benchmarks. Results show that our model can learn better latent representations on the synthetic dataset and achieve better perplexity scores than VAE and iVAE. For example, on PennTreebank language modeling benchmark, our model can decrease the perplexity scores by more than 10 points compared to iVAE.  Including contrastive learning over latent variable in VAEs can alleviate the posterior collapse issues and avoid approximated KL divergence explosion in iVAEs. 
To our knowledge, we are the first to combine contrastive learning with latent variable models for natural language processing. We  make our models and codes publicly available at \url{https://github.com/zeeeyang/constrastive_vae} and an alternative implementation using MindSpore\footnote{\url{https://www.mindspore.cn/}} which is a new deep learning computing framework,
can be found at  \url{https://github.com/zeeeyang/constrastive_vae_mindspore}.

