
\subsection{Cost Function and Knowledge Distillation}

To define the cost function $c\left(G\left(\bx_{i}\right),\mathbb{Q}_{m}^{S}\right)$
in Eq. (\ref{eq:OT_Q}), we build a classifier $h^{S}$ over the latent
space, and rely on its output to compute the cost values. This classifier
is first trained using the labeled source dataset $\mathbb{D}^{S}=\left\{ \left(\bx_{i}^{S},y_{i}^{S}\right)\right\} _{i=1}^{N_{S}}$
by minimizing the empirical loss:

\begin{equation}
\mathscr{\mathcal{L}}^{src}=\frac{1}{N_{S}}\sum_{i=1}^{N_{S}}CE\left(\sigma\left(h^{S}\left(G\left(\bx_{i}\right)\right)\right),y_{i}^{S}\right),\label{eq:l_src}
\end{equation}

where $\sigma$ denotes a softmax function and $CE$ represents a
cross-entropy loss. Recap that given a target example $\bx_{i}$,
$c\left(G\left(\bx_{i}\right),\mathbb{Q}_{m}^{S}\right)$ captures
the matching extent of $G\left(\bx_{i}\right)$ and the class-conditional
distribution $\mathbb{Q}_{m}^{S}$. Therefore, we can reasonably define
$c\left(G\left(\bx_{i}\right),\mathbb{Q}_{m}^{S}\right)=-\log\sigma_{m}\left(h^{S}\left(G\left(\bx_{i}\right)\right)\right)$
(i.e., $\sigma_{m}\left(h^{S}\left(G\left(\bx_{i}\right)\right)\right)$
is the predicted probability of $\bx_{i}$ belonging to class $m$
by classifier $h^{S}$). 

However, we find that $h^{S}$ is a well-trained classifier on the
source domain, and can generalize poorly on the target domain due
to the data and label shifts. Therefore, instead of using only one
classifier trained to work well on both domains, we leverage knowledge
distillation \citep{hinton2015distilling,tian2020Contrastive,tejankar2021isd}
which includes the two-network architecture, a teacher $h^{T}$and
a student $h^{S}$. The teacher $h^{T}$ aims to be an expert on the
target domain, while the student $h^{S}$, which classifies accurately
on the source domain, is also able to generalize on the target domain
via distilling knowledge from its teacher. When the generalization
ability of $h^{S}$ is improved, the cost $c\left(G\left(\bx_{i}\right),\mathbb{Q}_{m}^{S}\right)$
is computed more accurately to solve the OP in Eq. (\ref{eq:ws_latent}).
Inspired by the work of \citet{hinton2015distilling}, we perform
knowledge distillation from the teacher $h^{T}$ to the student $h^{S}$
in the target domain by minimizing a distillation loss $\mathscr{\mathcal{L}}^{dl}$
w.r.t. a temperature softmax function:

\begin{equation}
\mathscr{\mathcal{L}}^{dl}=\frac{1}{N_{T}}\sum_{i=1}^{N_{T}}CE\left(\sigma\left(\frac{h^{S}\left(G\left(\bx_{i}\right)\right)}{\tau}\right),\sigma\left(\frac{h^{T}\left(G\left(\bx_{i}\right)\right)}{\tau}\right)\right),\label{eq:distill_loss}
\end{equation}

where $\tau$ is a temperature parameter. When setting $\tau>1$,
the teacher and student's predictions become softer, from which the
student can capture ``dark knowledge'' \citep{hinton2015distilling}
from the teacher and effectively mimic the teacher's behaviour. 

The student $h^{S}$ is now trained well in the source domain via
Eq. (\ref{eq:l_src}), and is possible to generalize on the target
domain via Eq. (\ref{eq:distill_loss}). To achieve this good generalization
capability, we need to produce a teacher $h^{T}$ that is with good
classification performance on the target domain. To this end, we propose
minimizing a cross-entropy loss between the teacher's prediction and
pseudo labels computed via the optimal transportation matrix $A^{*}$
after solving Eq. (\ref{eq:ws_latent}):

\begin{equation}
\mathscr{\mathcal{L}}^{pl}=\frac{1}{n_{T}}\sum_{i=1}^{n_{T}}CE\left(\sigma\left(h^{T}\left(G\left(\bx_{i}\right)\right)\right),\hat{y}_{i}^{T}\right),\label{eq:pseudo_loss}
\end{equation}

where $\hat{y}^{T}$ are pseudo labels for unlabeled target samples.
It is worth noting that only a subset of target samples with high-confidence
pseudo labels is selected (i.e., $n_{T}<N_{T}$). In the next section,
we discuss on how to compute these pseudo labels and our framework.

\subsection{Pseudo-label Selection and Our framework}

We now introduce the strategy to produce pseudo labels for unlabeled
target samples. Let us return to the Eq. (\ref{eq:ws_latent}) where
directly solving this OP is computationally expensive. Hence, we instead
use an entropic regularized version to minimize:
\begin{figure}
\begin{centering}
\includegraphics[width=0.78\columnwidth]{cycle_process_v2}
\par\end{centering}
\caption{The proposed cycle class consistency framework.\label{fig:cycle_diagram}}
\end{figure}
\begin{figure*}
\begin{centering}
\includegraphics[width=0.85\textwidth]{COOK_overall_v2}
\par\end{centering}
\caption{The overall architecture of our proposed method where $G$ is a weight-sharing
generator for mapping the source and target data into the latent space.
The teacher $h^{T}$ and the student $h^{S}$ act in a cyclic process
as described in Figure \ref{fig:cycle_diagram} where we apply pseudo
labelling, knowledge distillation and enforce clustering assumption:
(a) when minimizing pseudo labelling loss $\mathscr{\mathcal{L}}_{w}^{pl}$,
target samples are encouraged to move towards the corresponding source
class-conditional distributions; (b) minimizing distillation loss
$\mathscr{\mathcal{L}}^{dl}$ pushes the target samples closer to
the source samples due to the distillation process between predictions
of the teacher and student classifiers. While minimizing $\mathcal{L}^{clus}$
accelerates transporting target samples, achieves a strong clustering,
improves local smoothness and achieves the good generalization ability
of $h^{S}$ on the target domain, from which the pseudo labels are
selected with the high confidence.\label{fig:The-architecture-of-cook}}
\end{figure*}
\begin{align}
\mathcal{W}_{c,\bpi}^{\epsilon}\left(\mathbb{Q}^{T},\mathcal{Q}^{S}\right) & =\min_{A}\biggl\{\sum_{i=1}^{N_{T}}\sum_{m=1}^{M}a_{im}c\left(G\left(\bx_{i}\right),\mathbb{Q}_{m}^{S}\right)\nonumber \\
-\epsilon H(A): & \sum_{m=1}^{M}a_{im}=\frac{1}{N_{T}},\sum_{i=1}^{N_{T}}a_{im}=\pi_{m}\biggr\},\label{eq:ws_latent_entropic}
\end{align}

where $H(A)\coloneqq-\sum_{i=1}^{N_{T}}\sum_{m=1}^{M}a_{im}\log a_{im}$
denotes an entropy of the transportation matrix $A$, and $\epsilon$
is the regularization rate. During the training, we use Sinkhorn algorithm
\citep{cuturi2013Sinkhorn} to solve this OP and achieve $A^{*}$
at every mini-batch. Interestingly, the solution of Eq. (\ref{eq:ws_latent_entropic})
also provides us $\sum_{m=1}^{M}a_{im}^{*}=\frac{1}{N_{T}}$ or in
other words, $N_{T}\sum_{m=1}^{M}a_{im}^{*}=1$. Hence, we can define
the pseudo label $\hat{y}_{i}^{T}\coloneqq N_{T}a_{i}^{*}$ for a
given target sample $\bx_{i}$ and it satisfies $\sum_{m=1}^{M}\hat{y}_{im}^{T}=N_{T}\sum_{m=1}^{M}a_{im}^{*}=1$.
The definition of $\hat{y}_{i}^{T}$ is then used for minimizing $\mathscr{\mathcal{L}}^{pl}$
in Eq. (\ref{eq:pseudo_loss}).

One problem when choosing $\hat{y}_{i}^{T}\coloneqq N_{T}a_{i}^{*}$
is that the performance of the teacher $h^{T}$ can be reduced if
some pseudo labels are incorrect, especially at the beginning of the
training due to the data and label shifts between the source and target
domains. This issue also influences the distillation process since
we aim to build a well-classified teacher $h^{T}$ on the target domain
to transfer some of its aspects (e.g, its ``dark knowledge'') to
the student $h^{S}$. To avoid this problem, inspired by \citet{yang2021casting},
we propose only selecting highly confident pseudo labels (i.e., pseudo
labels whose entropies are less than a threshold) using an entropy-based
selection method. The OP in Eq. (\ref{eq:pseudo_loss}) is now minimized
w.r.t. the weights $w_{i}$:

\begin{equation}
\mathscr{\mathcal{L}}_{w}^{pl}=\frac{1}{n_{T}}\sum_{i=1}^{n_{T}}w_{i}CE\left(\sigma\left(h^{T}\left(G\left(\bx_{i}\right)\right)\right),\hat{y}_{i}^{T}\right),\label{eq:pseudo_loss_with_w}
\end{equation}
where $w_{i}=\mathbb{I}_{\left\{ H\left(\hat{y}_{i}^{T}\right)<H_{\rho}\right\} }$
with $\mathbb{I}_{C}$ representing the indicator function for a statement
$C$ (i.e., $\mathbb{I}_{C}$ returns 1 iff $C$ is true), $H\left(\hat{y}_{i}^{T}\right)\coloneqq-\sum_{m=1}^{M}a_{im}\log a_{im}$
is the entropy of a pseudo label $\hat{y}_{i}^{T}$ w.r.t. a target
example $\bx_{i}$, and the threshold $H_{\rho}$ denotes the $\rho$-th
percentile of $H\left(\hat{y}_{i}^{T}\right)$.

Additionally, when training our COOK, at each iteration, we sample
a mini-batch of target examples and consider $\mathbb{Q}^{T}$ as
the distribution of latent representations corresponding to this mini-batch.
Therefore, $N_{T}$ in Eq. (\ref{eq:ws_latent_entropic}) is replaced
by the batch size and the threshold $H_{\rho}$ denotes the $\rho$-th
percentile of $H\left(\hat{y}_{i}^{T}\right)$ in the mini-batch.

Finally, we present our framework in Figure \ref{fig:cycle_diagram}
which includes three main steps: (i) the teacher is encouraged to
be an expert on the target domain using the pseudo labeling technique;
(ii) the teacher transfers its knowledge to the student via a distillation
process to support the student to generalize well on the target domain;
and (iii) the predicted probabilities of the student classifier are
utilized for minimizing $\mathcal{W}_{c,\bpi}^{\epsilon}\left(\mathbb{Q}^{T},\mathcal{Q}^{S}\right)$
using Sinkhorn algorithm, and offering the optimal transportation
matrix $A^{*}$ to compute pseudo labels. The pseudo labels with low
entropies are selected to train the teacher at the first step. This
process forms a closed cycle in which target samples are confidently
moved towards corresponding source class-conditional distributions
$\mathbb{Q}_{m}^{S}$ under the consistently cyclic guidance of the
key factors including the distributional optimal transport and knowledge
distillation, which motivates us to propose our COOK.

\subsection{Training Procedure of COOK}

To strengthen $h^{S}$ for providing better predictions and accelerating
matching target samples $\bx^{T}$ to source class-conditional distributions
$\mathbb{Q}_{m}^{S}$, we enforce the clustering assumption to $h^{S}$.
Inspired by applying clustering assumption in domain adaptation works
\citep{shu2018a,kumar2018co}, we employ Virtual Adversarial Training
(VAT) \citep{VAT} in conjunction with minimizing entropy \citep{grandvalet2015entropy}
of the prediction of $h^{S}\left(G\left(\bx^{T}\right)\right)$. VAT
is an effective technique to improve the local distribution robustness
\citep{thanh2022particle,hoang2022global}. At first, given a target
sample $\bx$, a perturbation of $\bx$, which is $\bx'$ that makes
the student classifier $h^{S}$ give a different prediction from $\bx$
is chosen. And then $h^{S}$ is enforced to predict the same label
for $\bx$ and $\bx'$. As a result, the decision boundary of $h^{S}$
is pushed away from the target sample $\bx$, which achieves a better
generalization ability for $h^{S}$ on the target domain. 

\begin{equation}
\mathcal{L}^{clus}=\mathcal{L}^{ent}+\mathcal{L}^{vat},\label{eq:clus_loss}
\end{equation}
where with $H$ to be the entropy, we have defined:

$\mathcal{L}^{ent}=\mathbb{E}_{\mathbb{P}^{T}}\left[H\left(\sigma\left(h^{S}\left(G\left(\bx\right)\right)\right)\right)\right],$

$\mathcal{L}^{vat}=\mathbb{E}_{\mathbb{P}^{T}}\bigg[\text{max }_{\bx':\norm{\bx'-\bx}<\theta}D_{KL}\bigg(\sigma\left(h^{S}\left(G\left(\bx\right)\right)\right),$

$\sigma\left(h^{S}\left(G\left(\bx'\right)\right)\right)\bigg)\bigg],$
where $D_{KL}$ denotes the Kullback-Leibler divergence and $\theta$
is a hyperparameter set to a very small positive number.

The final optimization problem of our COOK for finding $h^{S},h^{T}$
and $G$ is as follows: 

\begin{equation}
\min_{h^{S},h^{T},G}\left\{ \mathscr{\mathcal{L}}^{src}+\alpha\mathscr{\mathcal{L}}^{dl}+\beta\mathscr{\mathcal{L}}_{w}^{pl}+\gamma\mathcal{L}^{clus}\right\} ,\label{eq:final_obj}
\end{equation}
where $\alpha,\beta,\gamma>0$ are trade-off parameters. Conveniently,
the cyclic process in Figure \ref{fig:cycle_diagram} is operated
synchronously by simultaneously updating $h^{S},h^{T}$ and $G$ during
the training. Finally, we present the key steps of our COOK in Algorithm
\ref{alg:algorithm} and the overall architecture and motivation of
component losses are depicted in Figure \ref{fig:The-architecture-of-cook}.

\begin{algorithm}[h]
\begin{algorithmic}[1]

\REQUIRE  A source batch $\mathcal{B}^{S}=\left\{ \left(\bx_{i}^{S},y_{i}^{S}\right)\right\} _{i=1}^{b}$,
a target batch $\mathcal{B}^{T}=\left\{ \bx_{j}^{T}\right\} _{j=1}^{b}$($b$
denotes the batch size).

\ENSURE Classifiers $h^{S*},h^{T*}$, generator $G^{*}$.

\FOR {number of training iterations}

\STATE Solve the OP in Eq. (\ref{eq:ws_latent_entropic}) using Sinkhorn
algorithm to find $A^{*}$.

\STATE Compute $\hat{y_{i}}^{T}$ in Eq. (\ref{eq:pseudo_loss_with_w})
based on $A^{*}$.

\STATE Compute $w_{i}$ in Eq. (\ref{eq:pseudo_loss_with_w}) based
on $H_{\rho}$.

\STATE Update $h^{S},h^{T}$ and $G$ according to Eq. (\ref{eq:final_obj}).

\ENDFOR

\end{algorithmic}

\caption{Pseudocode for training our proposed COOK.\label{alg:algorithm}}
\end{algorithm}

