\section{Related Work}
\textbf{Varational Auto-Encoders for Text Generation}
VAE \citep{bowman2015generating} proposes the first model to enable text generation from continuous space using variational inference and an isotropic Gaussian prior.  Many efforts have been made to improve VAE by using advanced prior distributions \citep{Tomczak2018VAEWA,wang-wang-2019-riemannian,ding-gimpel-2021-flowprior} and alleviate the the posterior collapse issue of VAE~\citep{bowman2015generating, higgins2017beta, he2019lagging, Fu2019CyclicalAS}, making the generation depends on latent representations. 
iVAE \citep{Fang_iLVM_2019_EMNLP} uses implicit sample-based representation, without requiring an explicit density form for the approximate posterior, which enables more flexibility. %In iVAE, there is no tractable form for the KL term in traditional VAE loss. Instead, an approxmiation based on Fenchel duality theorem \citep{rockafellar1966extension} is used to evaluate its dual form. 
To solve the posterior collapse issue, iVAE adopts a mutual information regularization to match the aggregated posterior to the prior distribution. 
APO-VAE \citep{NIPS2018_dai} defines both the prior and posterior of latent variables over a Poincaré ball in hyperbolic space, which also adopts the  training scheme of iVAE. We do not compare with Apo-VAE since it additionally adopts a data-dependent VampPrior~\citep{Tomczak2018VAEWA}. Our model is based on iVAE, with direct  supervision of latent variables using contrastive learning. 
%EBM \citep{NEURIPS2020_pang} 
%%%%Pang et al. (2020a) recently proposes to learn an energy-based model (EBM) in the latent space, where the EBM serves as a prior model for the latent vector. Both the EBM prior and the generator network are learned jointly by maximum likelihood or its approximate variants. The latent space EBM has been applied to text modeling, image modeling, and molecule generation, and significantly improves over VAEs with Gaussian prior, mixture prior and other flexible priors. where the posterior inference is done with short-run MCMC sampling
%%%%%%Pang et al. [57] have shown that neural EBMs can represent expressive prior distributions. However, in this case, the prior is trained using MCMC sampling, and it has been limited to a single group of latent variables. VAEBM [80] combines the VAE decoder with an EBM defined on the pixel space and trains the model using MCMC. Additionally, VAEBM assumes that data lies in a continuous space and applies the energy function in that space. Hence, it cannot be applied to discrete data such as text or graphs. In contrast, NCP-VAE forms the energy function in the latent space and can be applied to non-continuous data. For continuous data, our model can be used along with VAEBM.

\textbf{Contrastive Learning for Sentence Representations}
Contrastive learning~\citep{Oord2018RepresentationLW, Hjelm2019LearningDR,He2020MomentumCF,Chen2020ASF} has achieved great success in self-supervised visual representation learning.  %These models typically pull embeddings from different augmented views of the same image closer, while pushing embeddings from other images apart. 
Recent work transfers this learning strategy to texts with different network architectures and augmenting methods for unsupervised sentence representation learning  ~\citep{zhang2020unsupervised,Giorgi2021DeCLUTRDC,Carlsson2021SemanticRW,Gao2021SimCSESC}. Among these, %constrastive predictive coding (CPC)  ~\citep{Oord2018RepresentationLW} compares the representation vector of a context with its future  response vectors. 
SimCSE~\citep{Gao2021SimCSESC}, which only uses standard dropout as minimal data augmentation, achieves the state of art and even performs on par with previously supervised counterparts. Our method uses the same data augmentation method as SimCSE, applying dropout to the input batch twice to obtain two different views. Similarly, R-drop~\citep{liang2021rdrop} applies dropout twice to regularize the behaviours of the decoders in language modeling. 
However, different from SimCSE and R-drop,  the goal of our method is to better supervise the \emph{latent state}s of variational auto-encoders. 

\textbf{Contrastive Learning for Text Generation}
\cite{Dai2021DialogueRG} use CPC to do  utterance-level contrastive learning between the dialogue context  and the corresponding response. \cite{Su2022ACF} propose a contrastive training objective for text generation, which is used to improve MLE based training and beam search decoding.  In their methods, the hidden vector of a token is contrasted with the hidden representation vectors of the remaining tokens in the same sentence, which is different from our method. To our knowledge, we are the first to directly apply contrastive learning over the \emph{latent} vectors for variational auto-encoders. \cite{lee2021contrastive} combines adversarial perturbations with contrastive learning  to solve the exposure bias problem for conditional sequence generation. Their contrastive learning is done in the embedding space, which is different from our method. 