\section{Method}
\input{figtexs/modelrel}
%\includegraphics{fig2}
Figure~\ref{fig:modelrel} shows the models we investigated in this paper. The blue boxes denote our proposed models. We introduce the VAE baselines in Sec 3.1. In Sec 3.2, we describe iVAE, which considers sample-based posterior distribution with VAE, and a variant of iVAE (iVAE$_\textsc{mi}$) which considers mutual information between input data and latent. In Sec 3.3, we first show two models \mymodel\ and \mymodel$_\textsc{mi}$ by integrating contrastive learning over latent variables for text generation based on iVAE and iVAE$_\textsc{mi}$, respectively. Then, we discuss \mymodelp\ and \mymodelp$_\textsc{mi}$, which solve the explosion problem of approximated KL divergence by using circle loss enhancement. 

\subsection{Variational Auto-Encoder Baseline}
\label{sec:vae}
Formally, given a sentence $\mathbf{x}=\{x_1, x_2, \dots, x_T\}$, where $T$ is the length of  $\mathbf{x}$, an auto-regressive language model finds a neural network $\bm{\theta}$ that can maximize the log-likelihood  
$\log P(\mathbf{x}; \bm{\theta}) = \prod_{i=1}^n \log P(x_i | x_1, \dots, x_{i-1}; \bm{\theta})$. Variational auto-encoder (VAE) introduces a random latent variable $\zv$ to model $P(\mathbf{x}; \bm{\theta}) = \int_{\zv} P(\zv; \bm{\theta})P( \mathbf{x} | \zv; \bm{\theta}) d\mathbf{z}$.  
To generate a sentence $\mathbf{x}$, a  $\zv$ is first sampled from the \emph{prior distribution} $P(\zv; \bm{\theta})$  and then a \emph{decoder network}  is used to produce $\mathbf{x}$ from $\zv$ according to $P(\mathbf{x}|\zv; \bm{\theta})$. In this paper, $\zv$ is is continuous, $\log P(\mathbf{x}; \bm{\theta})$ becomes intractable due to the integration over $\zv$. To train $\bm{\theta}$, a variational posterior distribution $Q(\zv|\xv; \bm{\phi})$ over $\zv$ is introduced to approximate the true posterior $P(\zv|\xv; \bm{\theta})$. 
To ensure that $\zv$ from the prior $P(\zv; \bm{\theta})$ can encode meaningful semantic representations of the input sentence,  $Q(\zv|\mathbf{x}; \bm{\phi})$ is forced to match $P(\zv; \bm{\theta})$ during training. Based on $Q(\zv|\mathbf{x}; \bm{\phi})$, an evidence lower bound (ELBO)~\citep{bowman2015generating} to $\log P(\mathbf{x}; \bm{\theta})$ is defined as: 
\begin{align}
\small 
&\textsc{Elbo}(\mathbf{x}; \bm{\phi}, \bm{\theta}) \label{eq:elbo} \\
%&= \mathbb{E}_{\zv \sim Q(\zv|\mathbf{x}; \bm{\phi})} \Big[ \log \frac{P(\mathbf{x}, \zv; \bm{\theta})}{Q(\zv; \bm{\phi})}\Big] \nonumber \\
%&= \mathbb{E}_{\zv \sim Q(\zv|\mathbf{x}; \bm{\phi})} \Big[\log \frac{P(\mathbf{x}|\zv; \bm{\theta})P(\zv; \bm{\theta})}{Q(\zv; \bm{\phi})}\Big] \nonumber\\
&= \underbrace{\mathbb{E}_{\zv \sim Q(\zv|\mathbf{x}; \bm{\phi})} \log P(\mathbf{x}|\zv; \bm{\theta})}_\text{reconstruction  loss} - \underbrace{\KL\Big(Q(\zv|\mathbf{x}; \bm{\phi}),  P(\zv; \bm{\theta})\Big)}_\text{regularizer}.\nonumber
\end{align}
$\bm{\theta}$ and $\bm{\phi}$ are jointly trained by maximizing ELBO in Eq \ref{eq:elbo}, where the first term is a reconstruction loss. The encoder first generates $\zv$ according to  $Q(\zv|\mathbf{x}; \bm{\phi})$, which is called an \emph{encoder network} or a \emph{recognition network}.  $\zv$ is required to reconstruct $\mathbf{x}$ by maximizing $\log P(\mathbf{x}|\zv; \bm{\theta})$. The second term of Eq \ref{eq:elbo} is the Kullback-Leibler divergence, which regularizes the posterior distribution to be close to the prior distribution. The posterior distribution and the prior distribution are often assumed to belong to the same parametric distributions with different parameters, which can ease the learning burden. Ideally, a good model maximizes the first term and minimizes the second term.  When the KL term tends to be 0,  $Q(\zv|\mathbf{x}; \bm{\phi})$ degenerates to $ P(\zv; \bm{\theta})$ and the posterior collapse issue happens.  For text generation, it is common due to the auto-regressive decoder can be very strong, which  totally ignores $\zv$ and generates $x_{i}$ solely from $[x_1, \dots, x_{i-1}]$~\citep{Fu2019CyclicalAS}. 
\subsection{Implicit VAE Baseline (iVAE)}
\label{sec:ivae}
 $Q(\zv | \xv; \bm{\phi})$ is typically assumed to be multivariate Gaussian distributions for computational convenience. However, Gaussians are not enough to capture the rich semantics of natural sentences in latent space. Implicit VAEs (iVAE) introduce a sample-based posterior representation which does not depend on an explicit density form. $Q(\zv|\xv; \bm{\phi})$ is represented by a set of samples $\{\zv_{\xv, i}\}_{i=1}^{S}$, where $S$ is the sample size.  The $i$-th sample is given by 
\begin{equation}
\small 
   \bm{\xi}_i \sim \mathcal{N}(\bm{\xi}),\zv_{\xv,i} = \textsc{Enc}(\xv,\bm{\xi}_i;\bm{\phi}), 
\label{eq:implicitz}
\end{equation}
where $\mathcal{N}(\bm{\xi})$ is the standard Gaussian, $\textsc{Enc}(\xv,\bm{\xi}_i;\bm{\phi})$ is a noise aware encoder network. The sentence representation $\mathbf{h}_T$ of $\xv$ is generated by a LSTM encoder~\citep{hochreiter1997long}and $\mathbf{h}_T$ is concatenated with $\bm{\xi}_i$ by a multi-layer perceptron layer (MLP) to produce $\zv_{\xv,i}$. 

%Although  the sample based posterior representation grants theoretical flexibility for the distributions of latent variables in high-dimensional spaces, the KL divergence between $Q(\zv|\xv; \bm{\phi})$ and $P(\zv; \bm{\theta})$ defined in ELBO of Eq \ref{eq:elbo} is intractable now. 
The KL divergence of Eq \ref{eq:elbo} is intractable after using the sample based posterior representation.  
Following \cite{Fang_iLVM_2019_EMNLP}, its dual form is  calculated according to Fenchel duality theorem~\citep{rockafellar1966extension, NIPS2018_dai} by introducing an auxiliary function $f(\xv,\zv; \psiv)$,
\begin{align}
\small
 &\KL\left(Q(\zv|\xv;\bm{\phi}), P(\zv; \bm{\theta})\right) \label{eq:dualkl} \\
=&\text{max}_{f} \mathbb{E}_{\zv\sim Q(\zv|\xv; \bm{\phi})}f(\xv,\zv; \psiv)-\mathbb{E}_{\zv\sim P(\zv; \bm{\theta})} \text{exp}(f(\xv,\zv; \psiv)),\nonumber
\end{align}
where $f$ outputs a real value and  $\psiv$ is the model parameters for $f$. %$\zv$ in $f$ is either drawn from $Q(\zv|\xv; \bm{\phi})$ or $P(\zv; \bm{\theta})$.  
$f$ is implemented as a MLP, which distinguishes between $(\xv, \zv_Q)$ and $(\xv, \zv_P)$ where $\zv_Q$ and $\zv_P$ denote latent samples drawn from the posterior and the prior distributions, respectively.  Using Eq \ref{eq:dualkl},  ELBO can be written as 
\begin{align}
\small 
 \mathcal{L}_{\text{iVAE}} 
&=  \mathbb{E}_{\zv\sim Q(\zv|\xv;\bm{\phi})}\text{log} P(\xv|\zv; \bm{\theta})
  - \mathbb{E}_{\zv\sim Q(\zv|\xv; \bm{\phi})}f(\xv,\zv, \psiv) \nonumber \\
  + &\mathbb{E}_{\zv\sim P(\zv; \bm{\theta})} \text{exp}(f(\xv,\zv, \psiv)).
  \label{eq:ivae}
\end{align}

Considering a dataset $\mathbf{D}=\{\xv_i\}_{i=1}^n$, the loss function of the whole $\mathbf{D}$ using VAE and iVAE are: 
\begin{align}
\small 
\mathcal{L}_\textsc{VAE} = &\mathbb{E}_{\xv \sim D }\Big[\mathbb{E}_{\zv \sim Q(\zv|\mathbf{x}; \bm{\phi})} \log P(\mathbf{x}|\zv; \bm{\theta}) \Big] \nonumber \\
&- \mathbb{E}_{\xv \sim D }\Big[\textsc{KL}(Q(\zv|\mathbf{x}; \bm{\phi}),  P(\zv; \bm{\theta}))\Big],
\label{eq:lvaed}\\
\mathcal{L}_\textsc{iVAE} = &\mathbb{E}_{\xv \sim D}\Big[ \mathbb{E}_{\zv\sim Q(\zv|\xv, \bm{\phi})}\text{log} P(\xv|\zv, \bm{\theta})\Big] \nonumber\\
  &- \mathbb{E}_{\xv \sim D}\Big[\mathbb{E}_{\zv\sim Q(\zv|\xv, \bm{\phi})}f(\xv,\zv, \psiv) \nonumber \\
  &+ \mathbb{E}_{\zv\sim P(\zv; \bm{\theta})} \text{exp}(f(\xv,\zv, \psiv))\Big].   \label{eq:livaed}
\end{align}

\textbf{Mutual Information Regularized iVAE (iVAE$_\textsc{MI}$)}
In Eq~\ref{eq:elbo}, the KL term forces a data-dependent posterior $Q(\zv|\xv, \bm{\phi})$ to match the same data-agnostic prior distribution $P(\zv; \bm{\theta})$. An variant of iVAE uses aggregated posterior $Q(\zv; \bm{\phi})$ to match $P(\xv|\zv; \bm{\theta})$, where $Q(\zv; \bm{\phi}) = \int Q(\xv) Q(\zv|\xv; \bm{\phi}) d\xv $ and $Q(\xv)$ is the empirical data distribution. Using aggregated posterior $Q(\zv; \bm{\phi})$, the latent space can be better regularized by cooperating different posterior distributions of all sentences to jointly match the prior. 
%To improve the efficiency of VAE, some efforts have been made to explore data-dependent priors~\citep{dai-etal-2021-apo,ding-gimpel-2021-flowprior} or matching the aggregated posterior $Q(\zv; \bm{\phi})$ with $P(\zv; \bm{\theta})$, where $Q(\zv; \bm{\phi}) = \int Q(\xv) Q(\zv|\xv; \bm{\phi}) d\xv $ and $Q(\xv)$ is the empirical data distribution. T
%he data-dependent prior requires an additional network to generate a prior distribution from input data, which is not considered in this work.  
Given a dataset $D=\{\xv_i\}_{i=1}^n$,  replacing the expected KL term  $\mathbb{E}_{\xv \sim D} \textsc{KL}(Q(\zv|\mathbf{x}; \bm{\phi}),  P(\zv; \bm{\theta}))$ with $\textsc{KL}(Q(\zv;\bm{\phi}),  P(\zv; \bm{\theta}))$, the new training objective is, 
\begin{align}
\small 
\mathcal{L}_\textsc{iVAE$_\textsc{mi}$} &= \mathbb{E}_{\xv \sim D}\Big[ \mathbb{E}_{\zv\sim Q(\zv|\xv; \bm{\phi})}\text{log} P(\xv|\zv; \bm{\theta})\Big] \label{eq:livaemid}\\
  &- \mathbb{E}_{\zv\sim Q(\zv; \bm{\phi})}g(\zv, \psiv) + \mathbb{E}_{\zv \sim P(\zv; \bm{\theta})} \text{exp}(g(\zv, \psiv)),   \nonumber
\end{align}
where $g$ is a similar function as $f$ to produce a real value. The dual form of $\textsc{KL}(Q(\zv;\bm{\phi}),  P(\zv; \bm{\theta}))$ is $\max_g \mathbb{E}_{\zv\sim Q(\zv; \bm{\phi})}g(\zv, \psiv) - \mathbb{E}_{\zv \sim P(\zv; \bm{\theta})}\text{exp}(g(\zv, \psiv))$.  Differently, $g$ only considers $\zv$ instead of the concatenation of $\xv$ and $\zv$. To generate samples from $Q(\zv, \bm{\phi})$, ancestral sampling can be used. First, $\xv$ is sampled from $D$ and then $\zv$ is sampled from $Q(\zv|\xv; \bm{\phi})$. \cite{Fang_iLVM_2019_EMNLP} show that optimizing the aggregated posterior based KL term is to maximize the mutual information $I(\xv, \zv)$ under the joint distribution $Q(\xv, \zv; \bm{\phi})$. We take the above model (\textbf{iVAE$_\textsc{mi}$}) as our main baseline.
%Thus, this version is also named as iVAE$_\textsc{mi}$, short for mutual information regularized iVAE. 


\subsection{Contrastive Learning over Latent Variables for Text Generation}
As shown in Figure \ref{fig:clvae}, the overall structure for our proposed model is based on the framework of iVAE. It contains three main components, an encoder, a contrastive learning module and a decoder. 
The encoder is quite similar to iVAE, except that dropout is enabled. The decoder is exactly the same as iVAE. The contrastive learning module is our main design.  In this section, we show how to integrate contrastive learning over latent variables with iVAE. 

%Contrastive learning over latent variables aims to increase the semantic similarities among semantically close posterior samples and decrease them among semantically distant samples~\citep{hadsell2006dimensionality}. 
Formally, given the latent sample $\zv_i$ of the $i$-th input sentence, its positive latent sample ${\zv}p = \{\zv_i^+\}$ which is semantically similar to $\zv_i$, its $M$ negative latent samples ${\zv}n=\{\zv_{i,j}^-\}_{j=1}^M$ which is semantically far away from $\zv_i$, the training objective using contrastive learning for the $i$-th sentence is to maximize:
\begin{equation}
\small 
\label{eq:clobjective}
\ell_i = \log \frac{\exp(\textsc{Sim}(\mf{z}_i, \mf{z}^+_i) / \tau )}{\sum_{\mf{z}\in \{{\zv}p \cup {\zv}n\}}\exp(\textsc{Sim}(\mf{z}_i, \mf{z}) /\tau)},
\end{equation}
where $\tau$ is a temperature hyper-parameter and $\textsc{Sim}(\zv_1, \zv_2)$ is a function for measuring semantic similarities between $\zv_1$ and $\zv_2$. Particularly, we use cosine similarity as the semantic similarity metric, 
\begin{equation}
\small 
    \textsc{Sim}(\mathbf{z}_1, \mathbf{z}_2) = \frac{\mf{z}_1^\mr{T} \mf{z}_2}{\Vert \mf{z}_1\Vert_2 \cdot \Vert \mf{z}_2\Vert_2},
    \label{eq:sim}
\end{equation}
where $\Vert \mf{z}\Vert_2$ denotes the $\mathcal{L}_2$ norm of $\zv$. 
Following~\cite{Chen2020ASF} and \cite{Gao2021SimCSESC}, we consider batch-based contrastive learning. For each sentence $\xv_i$, we follow \cite{Gao2021SimCSESC} and use dropout as data augmentation. We encode the input sentence twice with dropout being enabled, and obtain two different views $\xv_i$ and $\xv_i^+$ of the same sentence. $\xv_i$ is then regarded as the anchor sentence and $\xv_i^+$ is the positive sample of $\xv_i$. The remaining sentences in the same batch are treated as the negative samples of $\xv_i$. In this way, the number of negative examples $M=N-1$. To obtain the corresponding posterior latent variables, we sample $\zv_i \sim Q(\zv |\xv_i; \bm{\phi})$ and $\zv_i^+ \sim Q(\zv |\xv_i^+; \bm{\phi})$. For the posterior samples $\zv_j \sim Q(\zv |\xv_j; \bm{\phi})$ and $j \neq i $, they are the negative samples of $\zv_i$. Denote $\mathbf{u}_i = \{\mf{z}_i^+\} \cup\{\mf{z}_{i,j}^-\}_{j=1}^M$ and the overall contrastive learning loss for the whole batch is 
\begin{equation}
\small
    \mathcal{L}_\textsc{cl} = \frac{1}{N} \sum_{i=1}^N \log \frac{\exp(\textsc{Sim}(\mf{z}_i, \mf{z}^+_i) / \tau )}{\sum_{\mf{z}\in {\mathbf{u}i} }\exp(\textsc{Sim}(\mf{z}_i, \mf{z}) /\tau)}.
\label{eq:clloss}
\end{equation}
Adding the contrastive loss into $\mathcal{L}_\textsc{iVAE}$ and $\mathcal{L}_{\textsc{iVAE}_\textsc{MI}}$, we have
\begin{align}
\small 
 \mathcal{L}_{\textsc{clVAE}} &=  \mathcal{L}_\textsc{iVAE} + \mathcal{L}_\textsc{cl}, 
  \label{eq:clvae}
  \\
\mathcal{L}_{\textsc{clVAE}_\textsc{MI}} 
&=  \mathcal{L}_{\textsc{iVAE}_\textsc{MI}} + \mathcal{L}_\textsc{cl}.
  \label{eq:clvaemi}
\end{align}
\paragraph{Improved Dual Function} $\KL(Q(\zv|\xv; \bm{\phi}), P(\zv; \bm{\theta}))$ should be non-negative according to its definition. However, the approximation by the dual function in Eq \ref{eq:dualkl} cannot ensure this property since  $\mathbb{E}_{\zv\sim Q(\zv|\xv; \bm{\phi})}f(\xv,\zv; \psiv)$ can be less than $\mathbb{E}_{\zv\sim P(\zv; \bm{\theta})} \text{exp}(f(\xv,\zv; \psiv))$. To encourage the KL approximation to be non-negative, we resort to a new constraint $\mathbb{E}_{\zv\sim Q(\zv|\xv; \bm{\phi})}f(\xv,\zv; \psiv) > 0 > \mathbb{E}_{\zv\sim P(\zv; \bm{\theta})} \text{exp}(f(\xv,\zv; \psiv))$. Using this constraint, we define a new approximation inspired by circle loss~\citep{Sun2020CircleLA}, 
\begin{align}
\small
 &\KL\left(Q(\zv|\xv;\bm{\phi}), P(\zv; \bm{\theta})\right) \label{eq:newdualkl} \\
=&\min_{f} \log \Big(1 + \exp\big(-\mathbb{E}_{\zv\sim Q(\zv|\xv; \bm{\phi})}f(\xv,\zv; \psiv)\big)\Big)\nonumber \\
&+ \log\Big(1 + \exp\big(\mathbb{E}_{\zv\sim P(\zv; \bm{\theta})} \text{exp}(f(\xv,\zv; \psiv))\big)\Big),\nonumber
\end{align}
To minimize Eq \ref{eq:newdualkl}, $\mathbb{E}_{\zv\sim Q(\zv|\xv; \bm{\phi})}f(\xv,\zv; \psiv)$ should be forced to be much greater than 0 and $\mathbb{E}_{\zv\sim P(\zv; \bm{\theta})}f(\xv,\zv; \psiv)$ should be much less than 0.  Similarly,  $\KL(Q(\zv; \bm{\phi}), P(\zv; \bm{\theta}))$ can be approximated as, 
\begin{align}
\small
 &\KL\left(Q(\zv; \bm{\phi}), P(\zv; \bm{\theta})\right) \label{eq:newdualkl2} \\
=&\min_{g} \log \Big(1 + \exp\big(-\mathbb{E}_{\zv\sim Q(\zv; \bm{\phi})}g(\zv; \psiv)\big)\Big)\nonumber \\
&+ \log\Big(1 + \exp\big(\mathbb{E}_{\zv\sim P(\zv; \bm{\theta})} \text{exp}(g(\zv; \psiv))\big)\Big).\nonumber
\end{align}
Using the proposed KL approximation term, the full objective is to maximize 
\begin{align}
\small 
   & \mathcal{L}_{\textsc{clVAE}}^+ 
=  \mathbb{E}_{\zv\sim Q(\zv|\xv, \bm{\phi})}\text{log} P(\xv|\zv, \bm{\theta}) + \mathcal{L}_\textsc{cl} \nonumber \\
  - & \log \Big(1 + \exp\big(-\mathbb{E}_{\zv\sim Q(\zv| \xv; \bm{\phi})}f(\xv, \zv; \psiv)\big)\Big) \nonumber \\
  - &\log\Big(1 + \exp\big(\mathbb{E}_{\zv\sim P(\zv; \bm{\theta})} \text{exp}(f(\xv, \zv; \psiv))\big)\Big),
  \label{eq:clvaeplus}\\
&\mathcal{L}_{\textsc{clVAE}_\textsc{mi}}^+ 
=  \mathbb{E}_{\zv\sim Q(\zv|\xv, \bm{\phi})}\text{log} P(\xv|\zv, \bm{\theta}) + \mathcal{L}_\textsc{cl} \nonumber \\
  - & \log \Big(1 + \exp\big(-\mathbb{E}_{\zv\sim Q(\zv; \bm{\phi})}g(\zv; \psiv)\big)\Big) \nonumber \\
  - &\log\Big(1 + \exp\big(\mathbb{E}_{\zv\sim P(\zv; \bm{\theta})} \text{exp}(g(\zv; \psiv))\big)\Big).
  \label{eq:clvaemiplus}
\end{align}
\textbf{Training Algorithm} Algorithm~\ref{alg:clvae} shows the training algorithm of our proposed method. We first sample a mini-batch of paired  random Gaussian noise vectors $\bm{\xi}_i$ and $\bm{\xi}_i^+$. After obtaining a mini-batch of input sentences, we pass them through the LSTM encoder with dropout being enabled to produce the latent vectors $\zv_{\xv, i}$ and $\zv_{\xv, i}^+$. Then the contrastive loss defined in Eq \ref{eq:clloss} is calculated, and a paired prior vectors are sampled from $P(\zv; \bm{\theta})$. $\psiv$ in $f(\xv, \zv, \psiv)$ is updated according to Eq \ref{eq:klupdate}. Here we further consider to minimize the differences between the dual functions of the same input data by defining a squared loss $L_\textsc{sq}$, which is given by  
\begin{equation}
\small
\label{eq:sq}
    L_\textsc{sq} = \sum_{i} \big(f(\xv_i,\zv_{\xv, i}; \psiv) - f(\xv_i,\zv_{\xv, i}^+; \psiv)\big). 
\end{equation}
%Given the same input, the values of the dual function of different posterior samples are forced to reside in a small region.
Given the same input and different $z$, the values of function $f$ are forced to reside in a small region.
$\psiv$ is fixed afterwards. The encoder parameters $\phiv$ and decoder parameters $\thetav$ are updated according to Eq \ref{eq:modelupdate}. If the dual function $g$ instead of $f$ is used, a similar loss  as Eq \ref{eq:sq} can be defined using $g$, and Eq \ref{eq:klupdate} and Eq \ref{eq:modelupdate} can be changed accordingly.

\begin{algorithm}[!t]
\SetAlgoNoLine 
%\toprule
\caption{The training algorithm (a single step SGD) for contrastive learning over latent variables for text generation. }
\label{alg:clvae} 
{\bf Input}: The training data set $D$ and  the training epochs $T$\;
{\bf Model parameters}: $\bm{\theta}$, $\bm{\phi}$ and $\bm{\psi}$\;
\While{$ t < T $}{
 1. Sample a mini-batch of $\bm{\xi}_i \sim \mathcal{N}(\bm{\xi}), \bm{\xi}_i^+ \sim \mathcal{N}(\bm{\xi})$\; 
 2. Sample a mini-batch of input sentences $\xv_i\sim\mathcal{D}$\;
 3. Generate $\zv_{\xv, i} = \textsc{Enc}(\xv_i,\bm{\xi}_i;\bm{\phi})$ and $\zv_{\xv, i}^+ = \textsc{Enc}(\xv_i,\bm{\xi}_i^+;\bm{\phi})$\;
 4. Calculate $\mathcal{L}_\textsc{cl}$ in Eq \ref{eq:clloss}\; 
 5. Sample a mini-batch of $\zv_i, \zv_i^+ \sim P(\zv;\bm{\theta})$\;
 6. Update ${\psiv}$ in $f(\xv,\zv, \psiv)$ to minimize
 \vspace{-7pt}
  \small 
 {\setlength\belowdisplayskip{-5pt}
\begin{align}
 &L_\textsc{sq} + \log \big(1 +  \exp (\sum_{i} - f(\xv_i,\zv_{\xv, i}; \psiv))\big)\nonumber \\[-2mm]
+ &\log\big(1 + \exp(\sum_{i} \text{exp}(f(\xv_i,\zv_{i}; \psiv))\big)\nonumber \\[-1.3mm]
+ &\log \big(1 +  \exp (\sum_{i} - f(\xv_i,\zv_{\xv, i}^+; \psiv))\big)\nonumber \\[-1.3mm]
+ &\log\big(1 + \exp(\sum_{i} \text{exp}(f(\xv_i,\zv_{i}^+; \psiv))\big). \label{eq:klupdate}
\end{align}
}\\[1.2mm]
7. Update parameters $\{\phiv, \thetav\}$ to minimize
 \vspace{-7pt}
  \small 
{\setlength\belowdisplayskip{-7pt}
%\setlength\abovedisplayskip{-7pt}
\begin{align}
\small
  & -\sum_i \text{log} P(\xv_i |\zv_i, \bm{\theta}) - \sum_i \text{log} P(\xv_i |\zv_i^+, \bm{\theta}) - \mathcal{L}_\textsc{cl} \nonumber \\[-0.5mm]
 & + \log \big(1 + \exp\big(-\sum_i f(\xv_i, \zv_{\xv, i}; \psiv))\big) \nonumber\\[-0.5mm]
  & + \log \big(1 + \exp\big(-\sum_i f(\xv_i, \zv_{\xv, i}^+; \psiv))\big). \label{eq:modelupdate}
\end{align}
}\\
8. $ t \leftarrow  t+1$\; }
{\bf Output}: $\bm{\theta}$, $\bm{\phi}$ and $\bm{\psi}$\;
%\bottomrule
%\vspace{2em}
\end{algorithm}