\section{Experiments}
We evaluate our proposed models on two tasks: 1) language modeling;  2) a synthetic clustering setting;
\subsection{Data and Settings} 
\textbf{Language Modeling} For language modeling, we consider three datasets, including the  Penn Treebank \citep{marcus1993building}, Yahoo \citep{yang2017improved} and Yelp \citep{he2019lagging}. The Penn Treebank (PTB) is a common benchmark for language modeling, which consists of 42K sentences of varying lengths and 1 million words of Wall Street Journal material in 1989. Yahoo and Yelp are much larger datasets compared to the PTB. They both consist of 100K sentences and the average length of them are 78.7 and 96.0 words, respectively. Each dataset contains train, validation and test sets. The detailed data statistics are in Appendix A.1.  We select the best model according to the performance on validation set and report the results on the test set using the corresponding best model.  %Table \ref{tab:stat} shows the detailed statistics of the three datasets. 

\textbf{Synthetic Clustering Data} Following \cite{Fang_iLVM_2019_EMNLP}, we design an experiment of synthetic clustering to show the learning dynamics of the induced latent variable. Given a random input category $k$, $k \in [0, K-1]$ and $K$ is the number of categories, a  2-dimensional random Gaussian noise sampled from $\mathcal{N}(0, 1)$ is combined with the one-hot representation of the input category as the input variable $\mathbf{x}$. An encoder with multi-layered feedforward neural networks transforms $\mathbf{x}$ into a 2-dimensional latent vector $\zv$. An decoder is trained to reconstruct $\mathbf{x}$ from $\mathbf{z}$. Suppose that the 
reconstructed output is $\mathbf{y}$ and $\mathbf{y} \in \mathbb{R}^K$, the binary cross-entropy loss  is used, namely $L = - \mathbf{e}_k \log \sigma(\mathbf{y}_k) + \sum_{ i\neq k} (1-\mathbf{e}_i) \log (1-\sigma (\mathbf{y}_i))$, where $k$ is the input category, $\mathbf{e}$ is the corresponding one-hot vector representation of the category $k$ and $\sigma$ is the sigmoid function $\sigma(x) = \frac{1}{1 + \exp(-x)} $.  %%%%%The specific network structures of encoders and decoders can be found in the Appendix XX.

During training, we generate a batch of one-hot vectors by randomly choosing from $[1,0,0,0]$, $[0,1,0,0]$, $[0,0,1,0]$ and $[0,0,0,1]$ when $K=4$. The batch size is 256 and we train the encoder-decoder 30K to 80K epochs until convergence.  Ideally, the cluster of latent variable $\mathbf{z}$ in 2 dimensional space should exactly correspond to the input category after training. To understand the training behaviour of our proposed model, we visualize $\mathbf{z}$ in 2D space.


%\subsubsection{Dialogue Generation} 


\subsection{Language Modeling Results}
\textbf{PTB Results} 
We compare our models with state-of-the-art VAE language models without using contrastive learning, including: 1) traditional VAE models with a KL-annealing strategy ~\citep{bowman2015generating}; 2) $\beta$VAE ~\citep{higgins2017beta}, which controls the penalty on KL using a small hyper-parameter $\beta$; 3) ${Sa}$VAE~\citep{kim2018semi}, which is a semi-amortized VAE;  4) ${Cyc}$VAE ~\citep{Fu2019CyclicalAS}, which anneals the KL term in a cyclical way;  5) $lag$VAE~\citep{he2019lagging}, which lags the update of decoder by aggressively updating encoder several times; 6) $i$VAE ~\citep{Fang_iLVM_2019_EMNLP}, which assumes an implicit distribution of latent variable as mentioned before; 7) $i$VAE$_\textsc{mi}$ is an enhanced version of $i$VAE by directly considering the mutual information between the input $\mathbf{x}$ and the latent variable $\mathbf{z}$. We denote our model as \textbf{\mymodel}\ (Eq \ref{eq:clvae}) and the circle loss enhanced version is named as \textbf{\mymodelp} (Eq \ref{eq:clvaeplus}). \textbf{ \mymodel$_\textsc{mi}$} (Eq \ref{eq:clvaemi}) and \textbf{\mymodelp$_\textsc{mi}$} (Eq \ref{eq:clvaemiplus}) are their corresponding enhanced versions by mutual information.
\input{tables/lmptb}
We evaluate the models in terms of the quality of both the generation texts and the induced latent features. To measure the generation outputs, we use negative ELBO scores (-ELBO) and perplexity scores (PPL).  The lower they are, the better the generated outputs are.  Following \cite{Fang_iLVM_2019_EMNLP}, we evaluate the quality of induced latent features in terms of $\KL(Q(\mathbf{z}|\mathbf{x}; \phiv)||P(\mathbf{z}; \thetav))$, the mutual information $I(\mathbf{x}, \mathbf{z})$ under the joint distribution of $Q(\mathbf{x}, \mathbf{z}; \phiv)$, and the number of active units of $\mathbf{z}$, which is defined as $\textsc{AU}(\mathbf{z}) = \textsc{Cov}_\mathbf{x} (\mathbb{E}_{z \sim Q(\mathbf{z}|\mathbf{x}; \phiv)}[z])>0.01$. Since all these terms cannot be solved in an analytical way, they are approximated by sample $\mathbf{z}$ 128 times from $Q(\mathbf{z}|\mathbf{x}; \phiv)$. 
%For more details, please refer to \cite{Fang_iLVM_2019_EMNLP}. 
In general, higher KL, larger MI and AU values indicate the latent feature is more sufficiently used by the  model. 

Table \ref{tab:lmptb} shows the main results on PTB test set. Our simplest model \mymodel\  outperforms all the baselines, which shows that adding contrastive learning over latent variables greatly improves the model performance. \mymodel\ reduces ELBO and PPL by 4.9 and 10.95 points compared to $i$VAE. The KL and MI values are  increased by 2.05 points compared to $i$VAE. There results show that our model make better use of the latent vector compared to baselines, and produce better outputs. Similar conclusions hold for the models enhanced by mutual information, which suggests that contrastive learning can be complementary to mutual information estimation.
%\mymodel$_{MI}$ gives better results compared to \mymodel. 
Particularly, \mymodel$_\textsc{mi}$ gives the best KL and MI values among all the models, demonstrating that it is a good choice to use the learned latent features.  
Using the proposed circle loss enhancement, the model performance improves with respect to ELBO and PPL, with 79.8 ELBO and 38.04 PPL, which are much better than \mymodel\ (82.7 ELBO and 43.51 PPL). Among them, \mymodelp$_\textsc{mi}$ gives the best results, presenting 77.7 ELBO and 34.61 PPL.  
In terms of KL and MI,  using the circle loss enhancement slightly hurts the sufficient use of latent variables compared to \mymodel$_\textsc{mi}$. However, the KL and MI values are relative large, ranking the third among all the models. Therefore, considering both the generation quality and the effective use of latent space,  the circle loss enhancement is useful. 

\textbf{Yahoo and Yelp Results}
\input{tables/lmyy}
We use the best configuration model of \mymodelp$_\textsc{mi}$ to conduct experiments on Yahoo and Yelp. Table \ref{tab:lmyy} shows the results. On Yahoo,  our model outperform $i$VAE$_\textsc{mi}$ in terms of both ELBO and PPL. For ELBO, our model outperform $i$VAE$_\textsc{mi}$ by 5.8 points. For PPL, our model gives 44.61 points, reducing the PPL of $i$VAE$_\textsc{mi}$ by 3.32 points. For KL and MI values, our model are better than $i$VAE and comparable to $i$VAE$_\textsc{mi}$.  Similar observations can be found on Yelp dataset. \mymodelp$_\textsc{mi}$ reduces ELBO and PPL by 5.1 and 1.91 points compared to $i$VAE$_\textsc{mi}$. On Yelp, for KL and MI values are 8.8 and 8.2, respectively, ranking the second.  Since both Yahoo and Yelp are large corpora, it is challenging to greatly reduce the PPL and ELBO scores while maintaining high KL and MI values at the same time. Our model is empirically well balanced. 

\subsection{Analysis}
\textbf{The effect of contrastive learning for latent variable} 
Table \ref{tab:clhz} shows the comparison results of doing contrastive learning using the encoder output $\mathbf{h}$ and the latent variable $\mathbf{z}$ on PTB test set, respectively. For simplicity, we use two basic models: namely \mymodel\ and \mymodel$_\textsc{mi}$. As shown in Table 1, when the same models are used, doing contrastive learning over $\mathbf{z}$ instead of $\mathbf{h}$ gives better results.  For example, when using the \mymodel$_\textsc{mi}$ model, the model based on $\mathbf{h}$ gives 52.13 PPL, while the model based on $\mathbf{z}$ obtains 43.49 PPL, which is much lower than the model based on $\mathbf{h}$.  Only in terms of ELBO and PPL values, the basic model \mymodel\ over $\mathbf{z}$  even performs better than the mutual information enhanced \mymodel$_\textsc{mi}$ models over $\mathbf{h}$. Besides, when doing contrastive learning on $\mathbf{h}$, the performance gap between \mymodel\  and \mymodel$_\textsc{mi}$ are large,  whereas the corresponding gap is shortened when doing contrastive learning on $\mathbf{z}$.  These results suggest that it is necessary to do contrastive learning over latent variables, producing better results compared to directly doing contrastive learning on the encoder hidden output. 
\input{tables/clhz}
\input{tables/decoder}

\textbf{The effect on the decoder} 
To evaluate whether the decoder is being improved by making better use of latent features, we sample latent variables from  the prior distribution $P(\mathbf{z})$ and ask the decoder to generate outputs based on the sampled latent codes, following \cite{Fang_iLVM_2019_EMNLP} and \cite{kim2017adversarially}. The generated text is evaluated by KenLM ~\citep{Heafield-estimate} using two metrics: forward PPL and reverse PPL. The forward PPL evaluates the generated texts based on a language model trained on the PTB-train corpus. Lower forward PPL indicates the generated texts are more fluent.  The reverse PPL evaluates the PTB corpus based on a language model trained on the generated texts. Lower reverse PPL means the generated texts are better representative of the PTB corpus. For the underlying language model, 5-gram KenLM is used. 

Table \ref{tab:ppl} shows the forward PPL and reverse PPL on the PTB. For reverse PPL, our \mymodelp performs better than all the baselines and  \mymodel$_\textsc{mi}$ gives the best values, which shows that our models can better represent the true data distribution.  For forward PPL, our model performs better than most of baselines models including $Sa$VAE and $\beta_{0.5}$VAE. iVAE gives the best forward PPL scores (116) since it imposes less constraints on the latent variable and thus benefits from a smoother space. Our model give 120 forward PPL scores, which is comparable to iVAE and acceptable by sacrificing a little fluency, but it can generate more diverse outputs and make text appear more human like.    

%\input{figtexs/kltrend}
%\input{figtexs/mitrend}
\input{figtexs/trend}
\textbf{The negativity of approximate KL and MI terms}
Section 3.3 mentioned that the approximated KL and MI term can become negative during training iVAE. Empirically, we demonstrate this problem by using the PTB language modeling task. Figure \ref{fig:kltrend} and Figure \ref{fig:mitrend} show how the KL term and the mutual information term change with the training epoch of the three models, including iVAE, \mymodel\ and \mymodelp$_\textsc{mi}$, respectively. As shown in Figure \ref{fig:kltrend},  the approximate KL term of iVAE suddenly becomes a large negative number at the 39th epoch. 
From Figure \ref{fig:mitrend}, we can observe that the approximate MI term at 0th epoch, 2nd epoch and the 39th epoch are all negative. This observation shows that the negativity problem of approximate KL and MI terms do exist in iVAE.  Adding the contrastive learning module alleviates this problem to some extent. The KL and MI values of \mymodel\  become more stable than iVAE.  However, the KL value of \mymodel\  dramatically decreases at the 10th epoch. In contrast, no such phenomenons exist in our \mymodelp$_\textsc{mi}$ model based on the empirical observations. Its KL and MI values gradually increase as the training goes on in general.  These  evidences suggest that our proposed circle loss enhancement can learn a more robust model in term of the stability of the approximated KL and MI values. 

%\textbf{More Analysis} More analysis on sentence interpolation and effect of different dropout rates are in A.2 and A.3.
\input{syn}
\iffalse 
\begin{table}[t] \scriptsize
    %\centering
    \begin{tcolorbox}
    \hspace{-5mm}
    \begin{tabular}{l l}
        {\bf Input}: & it was super dry and had a \textcolor{red}{weird}  \textcolor{red}{taste} to the entire \textcolor{red}{slice} . \\
        {\bf ARAE}: & it was \textcolor{blue}{super nice} and \textcolor{blue}{the owner} was \textcolor{blue}{super sweet and helpful} . \\
        {\bf iVAE$_{\text{MI}}$}: & it was \textcolor{red}{super tasty} and a good size with the best in the \textcolor{red}{burgh} . \\
         \hfill  \\

        {\bf Input}: & so i only had \textcolor{red}{half} of the regular \textcolor{red}{fries and my soda} . \\
        {\bf ARAE}: & it 's the \textcolor{blue}{best to eat} and \textcolor{blue}{had a great} \textcolor{blue}{meal} .\\
        {\bf iVAE$_{\text{MI}}$}: & so i had a \textcolor{red}{huge side} and the price was great . \\
        \hfill  \vspace{0mm}  \\

        {\bf Input}: & i am just \textcolor{red}{not a fan} of this kind of \textcolor{red}{pizza} . \\
        {\bf ARAE}: & i am very \textcolor{blue}{pleased} and will definitely use \textcolor{blue}{this place} . \\
        {\bf iVAE$_{\text{MI}}$}: & i am just \textcolor{red}{a fan} of \textcolor{red}{the chicken}  \textcolor{red}{and egg roll} .

      \end{tabular}
    \end{tcolorbox}

    \begin{tcolorbox}
      \hspace{-5mm}
    \begin{tabular}{l l}
        {\bf Input}: & i have eaten the \textcolor{red}{lunch}  \textcolor{red}{buffet} and it was \textcolor{red}{outstanding} ! \\
        {\bf ARAE}: & once again , i was \textcolor{blue}{told by the wait} and was \textcolor{blue}{seated} . \\
        {\bf iVAE$_{\text{MI}}$}: & we were \textcolor{red}{not impressed} with \textcolor{red}{the buffet} there last night . \\
         \hfill  \vspace{0mm}  \\

        {\bf Input}: & my favorite food is \textcolor{red}{kung pao beef} , it is \textcolor{red}{delicious} . \\
        {\bf ARAE}: & my \textcolor{blue}{husband} was on the \textcolor{blue}{phone} , which i tried it . \\
        {\bf iVAE$_{\text{MI}}$}: & \textcolor{red}{my chicken} was n't warm , though it is \textcolor{red}{n't delicious} . \\
        \hfill  \vspace{0mm}  \\

        {\bf Input}: & overall , it was a very \textcolor{red}{positive} \textcolor{red}{dining experience} . \\
         %\vspace{1mm}
        {\bf ARAE}: & overall , it was very \textcolor{blue}{rude and}  \textcolor{blue}{unprofessional} . \\
        {\bf iVAE$_{\text{MI}}$}: & overall , it was a nightmare of \textcolor{red}{terrible experience} .

    \end{tabular}
    \end{tcolorbox}
    \caption{Sentiment transfer on $\mathtt{Yelp}$. (Up: From negative to positive, Down: From positive to negative.)}
    \label{table:sentiment_sample}
\end{table}
\fi 
                
\iffalse            
\begin{table}[t!]
\caption{Sentiment Transfer on $\mathtt{Yelp}$.}
\label{table:sentiment}
\vskip 0.15in
\begin{center}
\begin{small}
\begin{sc}
\begin{tabular}{c|c|c|c|c|c|c}
\toprule
Model & Acc\! $\uparrow$ & BLEU\! $\uparrow$ & PPL\! $\downarrow$ & RPPL\! $\downarrow$ & Flu\! $\uparrow$ & Sim\! $\uparrow$ \tabularnewline
\midrule
ARAE & \textbf{95} & 32.5 & 6.8 & 395 & 3.6 & 3.5\tabularnewline
iVAE$_{\text{MI}}$ & 92 & \textbf{36.7} & \textbf{6.2} & \textbf{285} & \textbf{3.8} & \textbf{3.9}\tabularnewline
\bottomrule
\end{tabular}
\end{sc}
\end{small}
\end{center}
\vskip -0.1in
\end{table}      
\fi 

\iffalse 
\begin{table}[t]
\caption{Dialog response generation on two datasets. }
\label{table:dialogue}
\label{table:sentiment}
\vskip 0.15in
\begin{center}
\begin{small}
\begin{sc}
\begin{tabular}{c|c|c|c|c}
\toprule
Metrics & SeqGAN & CVAE & WAE & iVAE$_{\text{MI}}$\tabularnewline
\hline
\multicolumn{5}{c}{Dataset: $\mathtt{Switchboard}$}\tabularnewline
\hline
BLEU-R\! $\uparrow$ & 0.282 & 0.295 & 0.394 & \textbf{0.427}\tabularnewline
BLEU-P\! $\uparrow$ & \textbf{0.282} & 0.258 & 0.254 & 0.254\tabularnewline
BLEU-F1\! $\uparrow$ & 0.282 & 0.275 & 0.309 & \textbf{0.319}\tabularnewline
BOW-A\! $\uparrow$ & 0.817 & 0.836 & 0.897 & \textbf{0.930}\tabularnewline
BOW-E\! $\uparrow$ & 0.515 & 0.572 & 0.627 & \textbf{0.670}\tabularnewline
BOW-G\! $\uparrow$ & 0.748 & 0.846 & 0.887 & \textbf{0.900}\tabularnewline
Intra-dist1\! $\uparrow$ & 0.705 & 0.803 & 0.713 & \textbf{0.828}\tabularnewline
Intra-dist2\! $\uparrow$ & 0.521 & 0.415 & 0.651 & \textbf{0.692}\tabularnewline
Inter-dist1\! $\uparrow$ & 0.070 & 0.112 & 0.245 & \textbf{0.391}\tabularnewline
Inter-dist2\! $\uparrow$ & 0.052 & 0.102 & 0.413 & \textbf{0.668}\tabularnewline
\hline
\multicolumn{5}{c}{Dataset: $\mathtt{Dailydialog}$}\tabularnewline
\hline
BLEU-R\! $\uparrow$ & 0.270 & 0.265 & 0.341 & \textbf{0.355}\tabularnewline
BLEU-P\! $\uparrow$ & 0.270 & 0.222 & \textbf{0.278} & 0.239\tabularnewline
BLEU-F1\! $\uparrow$ & 0.270 & 0.242 & \textbf{0.306} & 0.285\tabularnewline
BOW-A\! $\uparrow$ & 0.907 & 0.923 & 0.948 & \textbf{0.951}\tabularnewline
BOW-E\! $\uparrow$ & 0.495 & 0.543 & 0.578 & \textbf{0.609}\tabularnewline
BOW-G\! $\uparrow$ & 0.774 & 0.811 & 0.846 & \textbf{0.872}\tabularnewline
Intra-dist1\! $\uparrow$ & 0.747 & \textbf{0.938} & 0.830 & 0.897\tabularnewline
Intra-dist2\! $\uparrow$ & 0.806 & 0.973 & 0.940 & \textbf{0.975}\tabularnewline
Inter-dist1\! $\uparrow$ & 0.075 & 0.177 & 0.327 & \textbf{0.501}\tabularnewline
Inter-dist2\! $\uparrow$ & 0.081 & 0.222 & 0.583 & \textbf{0.868}\tabularnewline
\bottomrule
\end{tabular}
\end{sc}
\end{small}
\end{center}
\vskip -0.1in
\end{table}  
\fi 






%\subsection{Effect of Dropout Rates}

%\subsection{Effect of Batch Size} 
