\subsection{Data Statistics for Language Modeling}
\input{tables/datastat}
Table \ref{tab:stat} shows the data statistics for three benchmarks for language modeling. 

\subsection{Sentence Interpolation} 
Table \ref{tab:interpolation} shows the sentence interpolation ~\citep{bowman2015generating} results of two example sentences. Similar to VAE, our model can generate sentences by interpolating the latent semantic vectors.  Given two sentences $\mathbf{x}_1$ and $\mathbf{x_2}$, we generate vectors $\mathbf{z}_1$ and $\mathbf{z}_2$ by averaging samples from  $Q_\phi(\mathbf{z}|\mathbf{x})$. Given $\mathbf{z}_1$ and $\mathbf{z}_2$, we generate a new latent vector $\mathbf{z} = \lambda * \mathbf{z}_1 + (1-\lambda) * \mathbf{z}_2$ by interpolating the sentence semantics of $\mathbf{z}_1$ and $\mathbf{z}_2$. Then $\mathbf{z}$ is  used by the decoder to produce a sentence with mixed semantics.  $\lambda$ is varied from 0 to 1 with a step size of 0.1. As shown in Table  \ref{tab:interpolation},  our model can generate sentences by smooth considering the semantics from the two input sentences. 


\begin{table}[!t] 
 \caption{Interpolation of latent representation.}
    \label{tab:interpolation}
    %\vspace{-2em}
\footnotesize 
  %\centering
    \begin{tcolorbox}
      %\hspace{-5mm}
      \setlength{\tabcolsep}{2.5pt}
    \begin{tabular}{l l}
$\lambda=$0 & there was \$ N billion and more interest income \\
$\lambda=0.1$ & there was \$ N billion and more interest in 30-year \\
$\lambda=0.2$ & there was \$ N billion and more than \$ N billion \\
$\lambda=0.3$ & there was \$ N million more than two days \\
$\lambda=0.4$ & there was \$ N million more than in chicago \\
$\lambda=0.5$ & we had \$ N million in the stock market \\
$\lambda=0.6$ &  we had \$ N million in the latest period \\
$\lambda=0.7$ & we had \$ N million in the latest period \\
$\lambda=0.8$ &  we 'll N years in its latest day \\
$\lambda=0.9$ &  i went in N with a few months ago \\
$\lambda=1$ & i went in N with a few months in N \\
    \end{tabular}
    \end{tcolorbox}%
    \vspace{-2em}
\end{table}

\subsection{Effect of Different Dropout Rates}

\input{figtexs/dropout}

We use the PTB to study the effect of different dropout rates by varying the dropout ratio from 0.1 to 0.9 with a step size 0.1. Figure \ref{fig:dropout} shows the PPL and ELBO values of different dropout rates.  We find that $droput=0.3$ gives the optimal performance. Empirically, increasing the dropout rates from 0.1 to 0.3, the performances become better. When the dropout ratio is in range [0.3, 0.6], the performances are stable. Using a large dropout rate ($dropout >= 0.7$), the performances start to decrease. 

\input{tables/dd}