\section{Experiment}
We evaluate our proposed models on three tasks: 1) a synthetic clustering setting; 2) language modeling; 3) dialogue generation. 

\subsection{Data and Settings} 
\subsubsection{Synthetic clustering data} Following \cite{Fang_iLVM_2019_EMNLP}, we design an experiment of synthetic clustering to show the learning dynamics of the induced latent variable. Given a random input category $k$, $k \in [0, K-1]$, a  2-dimensional random Gaussian noise sampled from $\mathcal{N}(0, 1)$ is combined with the one-hot representation of the input category as the input variable $\mathbf{x}$. An encoder with multi-layered feedforward neural networks transforms $\mathbf{x}$ into a 2-dimensional latent vector $\zv$. An decoder is trained to reconstruct $\mathbf{x}$ from $\mathbf{z}$. Suppose that the 
reconstructed output is $\mathbf{y}$ and $\mathbf{y} \in \mathbb{R}^4$, the binary cross-entropy loss function is used, namely $L = - \mathbf{e}_k \log \sigma(\mathbf{y}_k) + \sum_{ i\neq k} (1-\mathbf{e}_i) \log (1-\sigma (\mathbf{y}_i))$, where $k$ is the input category, $\mathbf{e}$ is the corresponding one-hot vector representation of the category $k$ and $\sigma$ is the sigmoid function $\sigma(x) = \frac{1}{1 + \exp(-x)} $.  The specific settings of network structures of encoders and decoders can be found in the Appendix.

During training, we generate a batch of one-hot vectors by randomly choosing from $[1,0,0,0]$, $[0,1,0,0]$, $[0,0,1,0]$ and $[0,0,0,1]$ when the number of category is 4. The batch size is 256 and we train the encoder-decoder 30K to 80K epochs until convergence.  Ideally, the cluster of latent variable $\mathbf{z}$ in 2 dimensional space should exactly correspond to the input category after training. To understand the training behaviour of our proposed model, we visualize $\mathbf{z}$ in 2D space as shown in Table ~\ref{table:conspeed}. 

\subsubsection{Language Modeling} For language modeling, we consider three datasets, including the  Penn Treebank \citep{marcus1993building}, Yahoo \citep{yang2017improved} and Yelp \citep{he2019lagging}. The Penn Treebank (PTB) is a common benchmark for language modeling, which consists of 42K sentences of varying lengths and 1 million words of Wall Street Journal material in 1989. Yahoo and Yelp are much larger datasets compared to the PTB. They both consist of 100K sentences and the average length of them are 78.7 and 96.0 words, respectively. Each dataset contains train, validation and test sets.Table \ref{tab:stat} shows the detailed statistics of the three datasets.  We select the best model according to the performance on validation set and report the results on the test set using the corresponding best model.  %Table \ref{tab:stat} shows the detailed statistics of the three datasets. 
\input{tables/datastat}

\subsubsection{Dialogue Generation} 

\input{tables/conspeed.tex}
Table \ref{tab:conspeed} shows the visualizations of the learned 2D latent vector of VAE, iVAE and our model with different training epochs. In the beginning (epoch$=0K$), the distribution of the induced latent variable of VAE is basically a normal distribution and the points are mixed together. Since both iVAE and our model represents the latent variable using an implicit distribution, there are no clues about the distribution of the latent variable in the beginning. After training with $5K$ epochs, we observe that VAE can make an initial guess concerning the corresponding cluster of the induced latent variable, iVAE is still struggling to cluster these points, whereas our model can achieve a clear separation between clusters, and capture well the latent distribution of the data. 
%clearly separate these points and the  boundaries between different clusters are already obvious.
As the training continues to 15K and 30K epochs, VAE shapes the data distribution as a sandwich, as shown in Table2, the green and blue points are widely mixed together and the data points within the same cluster are wide spread. Meanwhile, iVAE still could not provide a nearly clear separation between clusters. However, our model starts to shorten the distance among points within the same cluster and enlarge the distance between different clusters. As a fact, our model only needs 3.5K epochs to converge while both VAE and iVAE take 80K epochs to converge. After 80k epochs, iVAE divides the 2D space into the corresponding clusters successfully, and it can provide a better data distribution compared to VAE. However, the data points with the same cluster given by iVAE  are dispersed.  As shown in Table \ref{tab:conspeed}, our model converges much faster than VAE and iVAE. It also gives a more compact and coherent representation, making the intra-class distance smaller and the inter-class distance larger compared to VAE and iVAE.  This shows that adding contrastive learning over latent spaces can better regularize the distributions of latent variables. 

\input{tables/conk8}

When the number of category becomes large, the difficulty of representing data points in the 2D latent variable also increases.
Table \ref{tab:conk8} shows the comparisons of converging trend between iVAE and our model when the input category is larger than $K=4$.   When $K=8$, iVAE fails to converge even using 80K training epochs, while our model obtains a decent separation boundaries using only 5K epochs. After training 80k epochs, our model successfully cluster the input data into the corresponding category. To show the capability of our method, we further set $K=16$, making it more challenging.
%We also try a challenging setting with $K=16$ using our proposed model. 
In this setting,  the class boundaries appear after 5K epochs, but not so clear as those when $K=8$ using the same training budget. Separation between clusters gets refined as the iterations go on. With 80K training epochs, our models can manage to reconstruct the input categories from the 2D latent variable. This not only shows the discriminatory nature of the learnt representations, but also the speed with which becomes a potential advantages of combining contrastive learning with latent variables, and the capacity of handling more complex situations compared to the traditional latent variable models. 


%%We use three datasets for language modeling: Penn Treebank (PTB) (Marcus et al., 1993), Yahoo and Yelp corpora (Yang et al., 2017). PTB contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths. Yahoo and Yelp are much larger datasets, each containing 100k sentences with greater average length.
%%For unaligned style transfer, we use the Yelp restaurant reviews dataset (Shen et al., 2017), which is obtained by pre-processing the Yelp dataset, i.e., sentences are shortened for more feasi- ble sentence level sentiment analysis. Overall, the dataset includes 350k positive and 250k negative reviews (based on user rating).
%%%%%Following Gu et al. (2019), we use the Switch- board (Godfrey and Holliman, 1997) dataset for dialogue-response generation. The former contains 2.4k two-sided telephone conversations, manually transcribed and aligned. We split the data into train- ing, validation and test sets following the protocol described in Zhao et al. (2017b).



\iffalse 
\begin{table}[t]
\caption{Total training time in hours: absolute time and relative time versus VAE.}
\label{table:lm_traintime}
\vskip 0.15in
\begin{center}
\begin{small}
\begin{sc}
\begin{tabular}{c|cc|cc|cc}
\toprule
\multirow{2}{*}{Dataset} & \multicolumn{2}{c|}{$\mathtt{PTB}$} & \multicolumn{2}{c|}{$\mathtt{Yahoo}$} & \multicolumn{2}{c}{$\mathtt{Yelp}$}\\
\cline{2-7}
 & Re.\! $\downarrow$ & Abs.\! $\downarrow$ & Re.\! $\downarrow$ & Abs.\! $\downarrow$ & Re.\! $\downarrow$ & Abs.\! $\downarrow$ \\
\midrule
VAE/$\beta$- & 1.0 & 1.3 & 1.0 & 5.3 & 1.0 & 5.7 \\
SA-VAE & 5.5 & 7.1 & 9.9 & 52.9 & 10.3 & 59.3 \\
Lag-VAE & - & - & 2.2 & 11.7 & 3.7 & 21.4 \\
iVAEs & 1.4 & 1.8 & 1.3 & 6.9 & 1.3 & 7.5 \\
\bottomrule
\end{tabular}
\end{sc}
\end{small}
\end{center}
\vskip -0.1in
\end{table}
\fi 

\subsection{Language Modeling Results.}

\subsubsection{PTB Results}

\input{tables/lmptb}

We compare our models with state-of-the-art VAE language models without using contrastive learning, including: 1) traditional VAE models with a KL-annealing strategy ~\citep{bowman2015generating}; 2) $\beta$VAE ~\citep{higgins2017beta}, which controls the penalty on KL using a small hyper-parameter $\beta$; 3) ${Sa}$VAE~\citep{kim2018semi}, which is a semi-amortized VAE;  4) ${Cyc}$VAE ~\citep{fu2019cyclical}, which anneals the KL term in a cyclical way;  5) $lag$VAE~\citep{he2018lagging}, which lags the update of decoder by aggressively updating encoder several times; 6) $i$VAE ~\citep{Fang_iLVM_2019_EMNLP}, which assumes an implicit distribution of latent variable as mentioned before; 7) $i$VAE$_{MI}$ is an enhanced version of $i$VAE by directly considering the mutual information between the input $\mathbf{x}$ and the latent variable $\mathbf{z}$. We denote our model as \textbf{\mymodel} and the circle loss enhanced version is named as \textbf{\mymodelp}. \textbf{\mymodel$_{MI}$} and \textbf{\mymodelp$_{MI}$} are their corresponding enhanced versions by mutual information.

We evaluate the models in terms of the quality of both the generation texts and the induced latent features. To measure the generation outputs, we use negative ELBO scores (-ELBO) and perplexity scores (PPL).  The lower they are, the better the generated outputs are.  Following \cite{Fang_iLVM_2019_EMNLP}, we evaluate the quality of induced latent features in terms of $KL(Q_\phi(\mathbf{z}|\mathbf{x})||P(\mathbf{z}))$, the mutual information $I(\mathbf{x}, \mathbf{z})$ under the joint distribution of $Q_\phi(\mathbf{x}, \mathbf{z})$, and the number of active units of $\mathbf{z}$, which is defined as $AU(\mathbf{z}) = Cov_\mathbf{x} (\mathbb{E}_{z \sim Q_\phi(\mathbf{z}|\mathbf{x})}[z])>0.01$. Since all these terms cannot be solved in an analytical way, they are approximated by sample $\mathbf{z}$ 128 times from $Q_\phi(\mathbf{z}|\mathbf{x})$. 
%For more details, please refer to \cite{Fang_iLVM_2019_EMNLP}. 
In general, higher KL, larger MI and AU values indicate the latent feature is more sufficiently used by the latent variable model. 

Table \ref{tab:lmptb} shows the main results on PTB test set. Our simplest model \mymodel outperforms all the baselines, which shows that adding contrastive learning over latent variables greatly improve the model performance. For instance, \mymodel reduces ELBO and PPL by 4.9 and 10.95 points compared to $i$VAE. The KL and MI values are also increased by 2.05 points compared to $i$VAE. There results show that our model can produce better results and make better use of the latent vector compared to baselines. Similar conclusions hold for the models enhanced by mutual information, which suggests that contrastive learning can be complementary to mutual information estimation.
%\mymodel$_{MI}$ gives better results compared to \mymodel. 
Particularly, \mymodel$_{MI}$ gives the best KL and MI values among all the models, demonstrating that it is a good choice to use the learned latent features.  
Using the proposed circle loss enhancement, the model performance improves with respect to ELBO and PPL, obtaining 79.8 ELBO and 38.04 PPL, which is much better than \mymodel (82.7 ELBO and 43.51 PPL). Among them, \mymodelp$_{MI}$ gives the best results, presenting 79.3 ELBO and 37.21 PPL.  
In terms of KL and MI,  using the circle loss enhancement slightly hurts the sufficient use of latent variables compared to \mymodel$_{MI}$. However, the KL and MI values are relative large, ranking 3rd among all the models. Therefore, considering both the generation quality and the effective use of latent space, we believe the circle loss enhancement is still useful. 

\subsubsection{Yahoo and Yelp Results}
\input{tables/lmyy}
We use the best configuration model of \mymodelp$_{MI}$ to conduct experiments on Yahoo and Yelp. Table \ref{tab:lmyy} shows the results. On Yahoo,  our model outperform $i$VAE$_{MI}$ in terms of both ELBO and PPL. For ELBO, our model outperform $i$VAE$_{MI}$ by 3.8 points. For PPL, our model gives 45.74 points, reducing the PPL of $i$VAE$_{MI}$ by 2.19 points. For KL and MI values, our model are better than $i$VAE and comparable to $i$VAE$_{MI}$.  Similar observations can be found on Yelp dataset. \mymodelp$_{MI}$ reduces ELBO and PPL by 3.0 and 1.11 points compared to $i$VAE$_{MI}$. On Yelp, for KL and MI values, the gap between \mymodelp$_{MI}$ and  $i$VAE$_{MI}$ are very small.  Since both Yahoo and Yelp are large corpora, it is difficult to greatly reduce the PPL and ELBO scores while maintaining high KL and MI values at the same time. Our model empirically  balances trade-offs well. 


\subsection{The effect of contrastive learning or latent variable} 
\input{tables/clhz}
Table \ref{tab:clhz} shows the comparison results of doing contrastive learning using the encoder output $\mathbf{h}$ and the latent variable $\mathbf{z}$ on PTB test set, respectively. For simplicity, we use two basic models: namely \mymodel and \mymodel$_{MI}$. As shown in Table 1, when the same models are used, doing contrastive learning over $\mathbf{z}$ instead of $\mathbf{h}$ gives better results.  For example, when using the \mymodel$_{MI}$ model, the model based on $\mathbf{h}$ gives 52.13 PPL, while the model based on $\mathbf{z}$ obtains 43.49 PPL, which is much lower than the model based on $\mathbf{h}$.  Only in terms of ELBO and PPL values, the basic model \mymodel over $\mathbf{z}$  even performs better than the mutual information enhanced \mymodel$_{MI}$ models over $\mathbf{h}$. Besides, when doing contrastive learning on $\mathbf{h}$, the performance gap between \mymodel  and \mymodel$_{MI}$ are large,  whereas the corresponding gap is shortened when doing contrastive learning on $\mathbf{z}$.  These results suggest that it is necessary to do contrastive learning over latent variables , producing better results compared to direct doing contrastive learning on the encoder hidden output. 

\begin{table}[t!]
\caption{Forward and reverse PPL on PTB test set.}
\label{tab:ppl}
\begin{center}
\begin{small}
\begin{sc}
\begin{tabular}{l|r|r}
\toprule
Model & forward\! $\downarrow$ & reverse\! $\downarrow$ \\
\midrule
VAE & 18,494 & 10,149 \\
$Cyc$VAE & 3,390 & 5,587 \\
AE & 672 & 2,589 \\
$\beta_{0}$VAE & 625 & 1,897 \\
$\beta_{0.5}$VAE & 939 & 4,078 \\
$Sa$VAE & 341 & 10,651 \\
$i$VAE & \textbf{116} & {1,520} \\
$i$VAE$_{\text{MI}}$ & {134} & {1,137} \\ 
\midrule 
%\mymodelp & 159 & 1,117 \\
\mymodelp$_{\text{MI}}$ & 120 & \bf 1,047 \\ 
\bottomrule
\end{tabular}
\end{sc}
\end{small}
\end{center}
\end{table}

\subsection{The effect on the Decoder} 

To evaluate whether the decoder is being improved by making better use of latent features, we sample latent variables from  the prior distribution $P(\mathbf{z})$ and ask the decoder to generate outputs based on the sampled latent codes, following \cite{Fang_iLVM_2019_EMNLP} and \cite{kim2017adversarially}. The generated text is evaluated by KenLM ~\citep{Heafield-estimate} using two metrics: forward PPL and reverse PPL. The forward PPL evaluates the generated texts based on a language model trained on the PTB-train corpus. Lower forward PPL indicates the generated texts are more fluent.  The reverse PPL evaluates the PTB corpus based on a language model trained on the generated texts. Lower reverse PPL means the generated texts are better representative of the PTB corpus. For the underlying language model, 5-gram KenLM is used. 

Table \ref{tab:ppl} shows the forward PPL and reverse PPL on the PTB. For reverse PPL, our \mymodelp performs better than all the baselines and  \mymodelp$_{MI}$ gives the best values, which shows that our models can better represent the true data distribution.  For forward PPL, our model performs better than most of baselines models including $Sa$VAE and $\beta_{0.5}$VAE. iVAE gives the best forward PPL scores (116) since it imposes less constraints on the latent space and thus benefits from a smoother space. Our model give 120 forward PPL scores, which is comparable to iVAE and acceptable by sacrificing a little fluency, but it can generate more diverse outputs and make text appear more human like.    

\input{figtexs/kltrend}
\input{figtexs/mitrend}
\subsection{The Negativity of Approximate KL and MI terms.}

In section XX, we argue that the approximated KL and MI term can become negative during training iVAE. Empirically, we demonstrate this problem by using the PTB language modeling task. Figure \ref{fig:kl} and Figure \ref{fig:mi} show how the KL term and the mutual information term change with the training epoch of the three models, including iVAE, \mymodel \ and \mymodelp$_{MI}$, respectively. As shown in Figure \ref{fig:kl},  the approximate KL term of iVAE suddenly becomes a large negative number at the 39th epoch. 
From Figure 3, we can observe the approximate MI term at 0th epoch, second epoch and the 39th epoch are all negative. This observation shows that the negativity problem of approximate KL and MI terms do exist in iVAE.  Adding the contrastive learning module alleviate this problem to some extent. The KL and MI values of \mymodel\  become more stable than iVAE.  However, the KL value of \mymodel\  dramatically decreases at the 10th epoch. In contrast, no such phenomenons exist in our \mymodelp$_{MI}$ model based on the empirical observations. Its KL and MI values gradually increase as the training goes on in general.  These  evidences suggest that our proposed circle loss enhancement can learn a more robust model in term of the stability of the approximated KL and MI values. 

\iffalse 
\begin{figure*}[!t]
  \caption{}
  \label{fig:klmi}
  \begin{minipage}[t]{0.5\linewidth}
    \centering
    \includegraphics[scale=0.4]{kltrend.pdf}
    \caption{Jay}
    \label{fig:side:a}
  \end{minipage}%
  \begin{minipage}[t]{0.5\linewidth}
    \centering
    \includegraphics[scale=0.4]{mitrend.pdf}
    \caption{aaa}
    \label{fig:side:b}
  \end{minipage}
\end{figure*}
\fi 

\subsection{Sentence Interpolation} 
Table \ref{tab:interpolation} shows the sentence interpolation ~\citep{bowman2015generating} results of two example sentences. Similar to VAE, our model can generate sentences by interpolating the latent semantic vectors.  Given two sentences $\mathbf{x}_1$ and $\mathbf{x_2}$, we generate vectors $\mathbf{z}_1$ and $\mathbf{z}_2$ by averaging samples from  $Q_\phi(\mathbf{z}|\mathbf{x})$. Given $\mathbf{z}_1$ and $\mathbf{z}_2$, we generate a new latent vector $\mathbf{z} = \lambda * \mathbf{z}_1 + (1-\lambda) * \mathbf{z}_2$ by interpolating the sentence semantics of $\mathbf{z}_1$ and $\mathbf{z}_2$. Then $\mathbf{z}$ is  used by the decoder to produce a sentence with mixed semantics.  $\lambda$ is varied from 0 to 1 with a step size of 0.1. As shown in Table  \ref{tab:interpolation},  our model can generate sentences by smooth considering the semantics from the two input sentences. 


\begin{table}[!t] 
 \caption{Interpolation of latent representation.}
    \label{tab:interpolation}
\footnotesize 
  %\centering
    \begin{tcolorbox}
      %\hspace{-5mm}
      \setlength{\tabcolsep}{2.5pt}
    \begin{tabular}{l l}
$\lambda=$0 & there was \$ N billion and more interest income \\
$\lambda=0.1$ & there was \$ N billion and more interest in 30-year \\
$\lambda=0.2$ & there was \$ N billion and more than \$ N billion \\
$\lambda=0.3$ & there was \$ N million more than two days \\
$\lambda=0.4$ & there was \$ N million more than in chicago \\
$\lambda=0.5$ & we had \$ N million in the stock market \\
$\lambda=0.6$ &  we had \$ N million in the latest period \\
$\lambda=0.7$ & we had \$ N million in the latest period \\
$\lambda=0.8$ &  we 'll N years in its latest day \\
$\lambda=0.9$ &  i went in N with a few months ago \\
$\lambda=1$ & i went in N with a few months in N \\
    \end{tabular}
    \end{tcolorbox}%
    %\vspace{2mm}
\end{table}

\iffalse 
\begin{table}[t] \scriptsize
    %\centering
    \begin{tcolorbox}
    \hspace{-5mm}
    \begin{tabular}{l l}
        {\bf Input}: & it was super dry and had a \textcolor{red}{weird}  \textcolor{red}{taste} to the entire \textcolor{red}{slice} . \\
        {\bf ARAE}: & it was \textcolor{blue}{super nice} and \textcolor{blue}{the owner} was \textcolor{blue}{super sweet and helpful} . \\
        {\bf iVAE$_{\text{MI}}$}: & it was \textcolor{red}{super tasty} and a good size with the best in the \textcolor{red}{burgh} . \\
         \hfill  \\

        {\bf Input}: & so i only had \textcolor{red}{half} of the regular \textcolor{red}{fries and my soda} . \\
        {\bf ARAE}: & it 's the \textcolor{blue}{best to eat} and \textcolor{blue}{had a great} \textcolor{blue}{meal} .\\
        {\bf iVAE$_{\text{MI}}$}: & so i had a \textcolor{red}{huge side} and the price was great . \\
        \hfill  \vspace{0mm}  \\

        {\bf Input}: & i am just \textcolor{red}{not a fan} of this kind of \textcolor{red}{pizza} . \\
        {\bf ARAE}: & i am very \textcolor{blue}{pleased} and will definitely use \textcolor{blue}{this place} . \\
        {\bf iVAE$_{\text{MI}}$}: & i am just \textcolor{red}{a fan} of \textcolor{red}{the chicken}  \textcolor{red}{and egg roll} .

      \end{tabular}
    \end{tcolorbox}

    \begin{tcolorbox}
      \hspace{-5mm}
    \begin{tabular}{l l}
        {\bf Input}: & i have eaten the \textcolor{red}{lunch}  \textcolor{red}{buffet} and it was \textcolor{red}{outstanding} ! \\
        {\bf ARAE}: & once again , i was \textcolor{blue}{told by the wait} and was \textcolor{blue}{seated} . \\
        {\bf iVAE$_{\text{MI}}$}: & we were \textcolor{red}{not impressed} with \textcolor{red}{the buffet} there last night . \\
         \hfill  \vspace{0mm}  \\

        {\bf Input}: & my favorite food is \textcolor{red}{kung pao beef} , it is \textcolor{red}{delicious} . \\
        {\bf ARAE}: & my \textcolor{blue}{husband} was on the \textcolor{blue}{phone} , which i tried it . \\
        {\bf iVAE$_{\text{MI}}$}: & \textcolor{red}{my chicken} was n't warm , though it is \textcolor{red}{n't delicious} . \\
        \hfill  \vspace{0mm}  \\

        {\bf Input}: & overall , it was a very \textcolor{red}{positive} \textcolor{red}{dining experience} . \\
         %\vspace{1mm}
        {\bf ARAE}: & overall , it was very \textcolor{blue}{rude and}  \textcolor{blue}{unprofessional} . \\
        {\bf iVAE$_{\text{MI}}$}: & overall , it was a nightmare of \textcolor{red}{terrible experience} .

    \end{tabular}
    \end{tcolorbox}
    \caption{Sentiment transfer on $\mathtt{Yelp}$. (Up: From negative to positive, Down: From positive to negative.)}
    \label{table:sentiment_sample}
\end{table}
\fi 
                
\iffalse            
\begin{table}[t!]
\caption{Sentiment Transfer on $\mathtt{Yelp}$.}
\label{table:sentiment}
\vskip 0.15in
\begin{center}
\begin{small}
\begin{sc}
\begin{tabular}{c|c|c|c|c|c|c}
\toprule
Model & Acc\! $\uparrow$ & BLEU\! $\uparrow$ & PPL\! $\downarrow$ & RPPL\! $\downarrow$ & Flu\! $\uparrow$ & Sim\! $\uparrow$ \tabularnewline
\midrule
ARAE & \textbf{95} & 32.5 & 6.8 & 395 & 3.6 & 3.5\tabularnewline
iVAE$_{\text{MI}}$ & 92 & \textbf{36.7} & \textbf{6.2} & \textbf{285} & \textbf{3.8} & \textbf{3.9}\tabularnewline
\bottomrule
\end{tabular}
\end{sc}
\end{small}
\end{center}
\vskip -0.1in
\end{table}      
\fi 

\iffalse 
\begin{table}[t]
\caption{Dialog response generation on two datasets. }
\label{table:dialogue}
\label{table:sentiment}
\vskip 0.15in
\begin{center}
\begin{small}
\begin{sc}
\begin{tabular}{c|c|c|c|c}
\toprule
Metrics & SeqGAN & CVAE & WAE & iVAE$_{\text{MI}}$\tabularnewline
\hline
\multicolumn{5}{c}{Dataset: $\mathtt{Switchboard}$}\tabularnewline
\hline
BLEU-R\! $\uparrow$ & 0.282 & 0.295 & 0.394 & \textbf{0.427}\tabularnewline
BLEU-P\! $\uparrow$ & \textbf{0.282} & 0.258 & 0.254 & 0.254\tabularnewline
BLEU-F1\! $\uparrow$ & 0.282 & 0.275 & 0.309 & \textbf{0.319}\tabularnewline
BOW-A\! $\uparrow$ & 0.817 & 0.836 & 0.897 & \textbf{0.930}\tabularnewline
BOW-E\! $\uparrow$ & 0.515 & 0.572 & 0.627 & \textbf{0.670}\tabularnewline
BOW-G\! $\uparrow$ & 0.748 & 0.846 & 0.887 & \textbf{0.900}\tabularnewline
Intra-dist1\! $\uparrow$ & 0.705 & 0.803 & 0.713 & \textbf{0.828}\tabularnewline
Intra-dist2\! $\uparrow$ & 0.521 & 0.415 & 0.651 & \textbf{0.692}\tabularnewline
Inter-dist1\! $\uparrow$ & 0.070 & 0.112 & 0.245 & \textbf{0.391}\tabularnewline
Inter-dist2\! $\uparrow$ & 0.052 & 0.102 & 0.413 & \textbf{0.668}\tabularnewline
\hline
\multicolumn{5}{c}{Dataset: $\mathtt{Dailydialog}$}\tabularnewline
\hline
BLEU-R\! $\uparrow$ & 0.270 & 0.265 & 0.341 & \textbf{0.355}\tabularnewline
BLEU-P\! $\uparrow$ & 0.270 & 0.222 & \textbf{0.278} & 0.239\tabularnewline
BLEU-F1\! $\uparrow$ & 0.270 & 0.242 & \textbf{0.306} & 0.285\tabularnewline
BOW-A\! $\uparrow$ & 0.907 & 0.923 & 0.948 & \textbf{0.951}\tabularnewline
BOW-E\! $\uparrow$ & 0.495 & 0.543 & 0.578 & \textbf{0.609}\tabularnewline
BOW-G\! $\uparrow$ & 0.774 & 0.811 & 0.846 & \textbf{0.872}\tabularnewline
Intra-dist1\! $\uparrow$ & 0.747 & \textbf{0.938} & 0.830 & 0.897\tabularnewline
Intra-dist2\! $\uparrow$ & 0.806 & 0.973 & 0.940 & \textbf{0.975}\tabularnewline
Inter-dist1\! $\uparrow$ & 0.075 & 0.177 & 0.327 & \textbf{0.501}\tabularnewline
Inter-dist2\! $\uparrow$ & 0.081 & 0.222 & 0.583 & \textbf{0.868}\tabularnewline
\bottomrule
\end{tabular}
\end{sc}
\end{small}
\end{center}
\vskip -0.1in
\end{table}  
\fi 

\begin{table*}[t]
\setlength{\tabcolsep}{1.5pt} 
\caption{Dialog response generation on DailyDialog. }
\label{table:dialogueomd}
\label{table:sentimentmd}
\vskip 0.15in
\begin{center}
\begin{small}
\begin{sc}
\begin{tabular}{ccccccccccc}
\toprule
Metrics & BLEU-R & BLEU-P & BLEU-F1 & BOW-A & BOW-E & BOW-G & Intra-dist1 & Intra-dist2 & Inter-dist1 & Inter-dist2 \tabularnewline
% Metrics & SeqGAN & CVAE & WAE & iVAE$_{\text{MI}}$\tabularnewline
\midrule 
\iffalse 
\multicolumn{11}{c}{Dataset: $\mathtt{Switchboard}$}\tabularnewline
\hline
SeqGAN  & 0.282 & \textbf{0.282} & 0.282 & 0.817 & 0.515 & 0.748 & 0.705 & 0.521 & 0.070 & 0.052 \tabularnewline
CVAE  & 0.295 & 0.258 & 0.275 & 0.836 & 0.572 & 0.846 & 0.803 & 0.415 & 0.112 & 0.102 \tabularnewline
WAE  & 0.394 & 0.254 & 0.309 & 0.897 & 0.627 & 0.887 & 0.713 & 0.651 & 0.245 & 0.413 \tabularnewline
iVAE$_{\text{MI}}$  & \textbf{0.427} & 0.254 & \textbf{0.319} & \textbf{0.930} & \textbf{0.670} & \textbf{0.900} & \textbf{0.828} & \textbf{0.692} & \textbf{0.391} & \textbf{0.668} \tabularnewline
\hline
\multicolumn{11}{c}{Dataset: $\mathtt{Dailydialog}$}\tabularnewline
\fi 
\midrule
SeqGAN  & 0.270 & 0.270 & 0.270 & 0.907 & 0.495 & 0.774 & 0.747 & 0.806 & 0.075 & 0.081 \tabularnewline
CVAE  & 0.265 & 0.222 & 0.242 & 0.923 & 0.543 & 0.811 & \textbf{0.938} & 0.973 & 0.177 & 0.222 \tabularnewline
WAE  & 0.341 & 0.278 & \textbf{0.306} & 0.948 & 0.578 & 0.846 & 0.830 & 0.940 & 0.327 & 0.583 \tabularnewline
iVAE$_{\text{MI}}$  & \textbf{0.355} & 0.239 & 0.285 & \textbf{0.951} & \textbf{0.609} & \textbf{0.872} & 0.897 & 0.975 & \textbf{0.501} & \textbf{0.868} \tabularnewline
\midrule 
Ours & 0.306 & \bf 0.286 & 0.296 & 0.936 & 0.515 & 0.824 & 0.893 & \bf 0.983 & 0.141 & 0.184 \\
\bottomrule
\end{tabular}
\end{sc}
\end{small}
\end{center}
\vskip -0.1in
\end{table*} 


\begin{table*}[t]
\caption{Dialog response generation on two datasets. }
\label{table:dialogueomd}
\label{table:sentimentmd}
\vskip 0.15in
\begin{center}
\begin{small}
\begin{sc}
\begin{tabular}{c|c|c|c|c|c|c|c|c|c|c}
\toprule
Metrics & BLEU-R & BLEU-P & BLEU-F1 & BOW-A & BOW-E & BOW-G & Intra-dist1 & Intra-dist2 & Inter-dist1 & Inter-dist2 \tabularnewline
% Metrics & SeqGAN & CVAE & WAE & iVAE$_{\text{MI}}$\tabularnewline
\hline
\multicolumn{11}{c}{Dataset: $\mathtt{Switchboard}$}\tabularnewline
\hline
SeqGAN  & 0.282 & \textbf{0.282} & 0.282 & 0.817 & 0.515 & 0.748 & 0.705 & 0.521 & 0.070 & 0.052 \tabularnewline
CVAE  & 0.295 & 0.258 & 0.275 & 0.836 & 0.572 & 0.846 & 0.803 & 0.415 & 0.112 & 0.102 \tabularnewline
WAE  & 0.394 & 0.254 & 0.309 & 0.897 & 0.627 & 0.887 & 0.713 & 0.651 & 0.245 & 0.413 \tabularnewline
iVAE$_{\text{MI}}$  & \textbf{0.427} & 0.254 & \textbf{0.319} & \textbf{0.930} & \textbf{0.670} & \textbf{0.900} & \textbf{0.828} & \textbf{0.692} & \textbf{0.391} & \textbf{0.668} \tabularnewline
\hline
\multicolumn{11}{c}{Dataset: $\mathtt{Dailydialog}$}\tabularnewline
\hline
SeqGAN  & 0.270 & 0.270 & 0.270 & 0.907 & 0.495 & 0.774 & 0.747 & 0.806 & 0.075 & 0.081 \tabularnewline
CVAE  & 0.265 & 0.222 & 0.242 & 0.923 & 0.543 & 0.811 & \textbf{0.938} & 0.973 & 0.177 & 0.222 \tabularnewline
WAE  & 0.341 & \textbf{0.278} & \textbf{0.306} & 0.948 & 0.578 & 0.846 & 0.830 & 0.940 & 0.327 & 0.583 \tabularnewline
iVAE$_{\text{MI}}$  & \textbf{0.355} & 0.239 & 0.285 & \textbf{0.951} & \textbf{0.609} & \textbf{0.872} & 0.897 & \textbf{0.975} & \textbf{0.501} & \textbf{0.868} \tabularnewline


\bottomrule
\end{tabular}
\end{sc}
\end{small}
\end{center}
\vskip -0.1in
\end{table*}   


\subsection{Effect of Dropout Rates}

\subsection{Effect of Batch Size} 
