\subsection{Results on Synthetic Data}
%\input{tables/conspeed.tex}
\input{tables/trendall}
\textbf{$\mathbf{K=4}$} Table \ref{tab:conspeed} shows the visualizations of the learned 2D latent vector of VAE, iVAE and our model with different training epochs. When epoch$=0K$, the distribution of the induced latent variable of VAE is basically a normal distribution and the points are mixed together. Since both iVAE and our model represents the latent variable using an implicit distribution, there are no clues about the distribution of the latent variable in the beginning. After training with $5K$ epochs, we observe that VAE can make an initial guess concerning the corresponding cluster of the induced latent variable, iVAE is still struggling to cluster these points, whereas our model can achieve a clear separation between clusters, and capture well the latent distribution of the data. 
%clearly separate these points and the  boundaries between different clusters are already obvious.
As the training continues to 15K and 30K epochs, VAE shapes the data distribution as a sandwich, as shown in Table2, the green and blue points are widely mixed together and the data points within the same cluster are wide spread. Meanwhile, iVAE still could not provide a nearly clear separation between clusters. However, our model starts to shorten the distance among points within the same cluster and enlarge the distance between different clusters. As a fact, our model only needs 3.5K epochs to converge while both VAE and iVAE take 80K epochs to converge. After 80k epochs, iVAE divides the 2D space into the corresponding clusters successfully, and it can provide a better data distribution compared to VAE. However, the data points with the same cluster given by iVAE  are dispersed.  As shown in Table \ref{tab:conspeed}, our model converges much faster than VAE and iVAE. It also gives a more compact and coherent representation, making the intra-class distance smaller and the inter-class distance larger compared to VAE and iVAE.  This shows that adding contrastive learning over latent spaces can better regularize the distributions of latent variables. 

%\input{tables/conk8}

\textbf{$\mathbf{K>4}$} When the number of category becomes large, the difficulty of representing data points in the 2D latent variable also increases.
Table \ref{tab:conk8} shows the comparisons of converging trend between iVAE and our model when the input category is larger than $K=4$.   When $K=8$, iVAE fails to converge even using $80K$ training epochs, while our model obtains a decent separation boundaries using only 5K epochs. After training $80K$ epochs, our model successfully cluster the input data into the corresponding category. To show the capability of our method, we further set $K=16$, making it more challenging.
%We also try a challenging setting with $K=16$ using our proposed model. 
In this setting,  the class boundaries appear after 5K epochs, but not so clear as those when $K=8$ using the same training budget. Separation between clusters gets refined as the iterations go on. With 80K training epochs, our models can manage to reconstruct the input categories from the 2D latent variable. This not only shows the discriminatory nature of the learnt representations, but also the speed with which becomes a potential advantages of combining contrastive learning with latent variables, and the capacity of handling more complex situations compared to the traditional latent variable models. 


\subsection{Sentence Interpolation} 
Table \ref{tab:interpolation} shows the sentence interpolation ~\citep{bowman2015generating} results of two example sentences. Similar to VAE, our model can generate sentences by interpolating the latent semantic vectors.  Given two sentences $\mathbf{x}_1$ and $\mathbf{x_2}$, we generate vectors $\mathbf{z}_1$ and $\mathbf{z}_2$ by averaging samples from  $Q_\phi(\mathbf{z}|\mathbf{x})$. Given $\mathbf{z}_1$ and $\mathbf{z}_2$, we generate a new latent vector $\mathbf{z} = \lambda * \mathbf{z}_1 + (1-\lambda) * \mathbf{z}_2$ by interpolating the sentence semantics of $\mathbf{z}_1$ and $\mathbf{z}_2$. Then $\mathbf{z}$ is  used by the decoder to produce a sentence with mixed semantics.  $\lambda$ is varied from 0 to 1 with a step size of 0.1. As shown in Table  \ref{tab:interpolation},  our model can generate sentences by smooth considering the semantics from the two input sentences. 


\begin{table}[!t] 
 \caption{Interpolation of latent representation.}
    \label{tab:interpolation}
    %\vspace{-2em}
\footnotesize 
  %\centering
    \begin{tcolorbox}
      %\hspace{-5mm}
      \setlength{\tabcolsep}{2.5pt}
    \begin{tabular}{l l}
$\lambda=$0 & there was \$ N billion and more interest income \\
$\lambda=0.1$ & there was \$ N billion and more interest in 30-year \\
$\lambda=0.2$ & there was \$ N billion and more than \$ N billion \\
$\lambda=0.3$ & there was \$ N million more than two days \\
$\lambda=0.4$ & there was \$ N million more than in chicago \\
$\lambda=0.5$ & we had \$ N million in the stock market \\
$\lambda=0.6$ &  we had \$ N million in the latest period \\
$\lambda=0.7$ & we had \$ N million in the latest period \\
$\lambda=0.8$ &  we 'll N years in its latest day \\
$\lambda=0.9$ &  i went in N with a few months ago \\
$\lambda=1$ & i went in N with a few months in N \\
    \end{tabular}
    \end{tcolorbox}%
    \vspace{-2em}
\end{table}

\subsection{Effect of Different Dropout Rates}

\input{figtexs/dropout}

We use the PTB to study the effect of different dropout rates by varying the dropout ratio from 0.1 to 0.9 with a step size 0.1. Figure \ref{fig:dropout} shows the PPL and ELBO values of different dropout rates.  We find that $droput=0.3$ gives the optimal performance. Empirically, increasing the dropout rates from 0.1 to 0.3, the performances become better. When the dropout ratio is in range [0.3, 0.6], the performances are stable. Using a large dropout rate ($dropout >= 0.7$), the performances degrade. 

\subsection{Different Data Augmentation Methods}
The selection of positive and negative samples are very important for the success of contrastive learning. In this paper, we follow~\cite{Gao2021SimCSESC} and use dropout as the minimum data augmentation. Applying dropout twice to a recurrent encoder for the same input sentence can lead to two different (but still semantically related) posterior distributions. The latent samples which represent the two implicit posterior distributions can be sufficiently different, therefore being useful for contrastive learning for variational auto-encoders. We additionally perform experiments by using random swap to create negative samples. Given an input sentence, we flip a coin with a probability 0.1 to decide whether to swap the positions of the $i$-th token and the (i+1)-th token, from $i=1$ to $i=n-1$. In this way, the word order is perturbed. We compare the dropout-based augmentation with swap-based augmentation.  Table~\ref{tab:aug} shows that dropout is a better augmentation method for the proposed model than swap-based method.
Considering more different data augmentations  will be considered in future work. 


\begin{table}[t]
\caption{Comparisons with different augmentation methods.  }
\label{tab:aug}
\vspace{-2em}
\setlength{\tabcolsep}{2.5pt}
\begin{center}
\begin{small}
\begin{sc}
\begin{tabular}{lccccc}
\toprule
 Augmentation & -ELBO\! $\downarrow$ & PPL\! $\downarrow$ & KL\! $\uparrow$  & MI\! $\uparrow$ & AU\! $\uparrow$ \\
\midrule
swap & 79.1 & 36.96 & 9.21 & 8.95  & 32 \\
dropout & \bf 77.7 & \bf 34.61 & \bf 9.94 & \bf 9.58 & \bf 32 \\
\bottomrule
\end{tabular}
\end{sc}
\end{small}
\end{center}
\vspace{-2em}
\end{table}


%\input{tables/dd}