Unsupervised domain adaptation (UDA) allows us to transfer knowledge
from a model trained on a source domain with labels to a target domain
without any labels. To cope with structural data more efficiently
and effectively, deep domain adaptation (DDA) \citep{Ganin2015} has
been proposed and widely studied \citep{van2019dan,van2020dualdan,phung2021on}.
To tackle the data shift issue and learn domain-invariant features,
DDA aims to bridge the distribution gap between the source and target
domains in a latent space using a feature extractor. Guided by this
principle, most of the existing works in DDA propose minimizing a
divergence between the source and target distributions in the latent
space. Popular choices of divergence include the Jensen-Shannon (JS)
divergence \citep{Ganin2015,TzengHDS15,shu2018a}, the maximum mean
discrepancy (MMD) distance \citep{gretton2007kernel,long2015}, and
the Wasserstein (WS) distance \citep{shen2018ws,chenyu2019swd,le2021labelshift}.

Recently, Optimal transport (OT) \citep{villani2008optimal,santambrogio2015optimal},
a powerful tool in mathematics with rich and rigorous theories, has
been widely applied in deep domain adaptation \citep{courty2017optimal,courty2017joint,damodaran2018deepjdot,RedkoCFT19,chenyu2019swd,yujia2019onscalable,xu2020reliable,tuan2021tidot,tuan2021most,le2021lamda,va2021stem}.
From the conceptual perspective, OT-based methods encourage the target
samples to move towards the source samples by minimizing a transportation
cost. However, since the transportation cost usually engages the pairs
of target and source samples without considering label information
of the source samples, the movement of the target samples to the source
domain seems to be unaware of the class regions in that domain, hence
cannot resolve the label shift issue \citep{tachet2020domain}. Although
OT has been initially used for solving this problem \citep{courty2017optimal,damodaran2018deepjdot},
the performance of the existing methods is still less satisfactory
compared with the state-of-the-art ones.

In this paper, we propose a novel distributional OT that enables the
incorporation of the source label information when engaging and matching
target and source samples. Specifically, in the source domain we consider
that one label is associated with a conditional distribution over
all the samples conditioned on that label. Next, we define a distribution
over these conditional distributions of all the labels in the source
domain. In the target domain where there are no labels, we also consider
a distribution over all the target samples. With the two distributions
for the source and target domains respectively, we formulate the DA
problem as the computation of the OT distance between the two distributions.
The OT transport plan gives us the information of how a target sample
related to the source samples by taking into account the source domain
labels. The challenge here is how to define the cost function, which
indicates the transport cost of OT between a target sample and a source
class-conditional distribution. To tackle this challenge, we propose
a cycle class consistency framework in which we leverage the advantages
of knowledge distillation (KD) which has recently obtained outstanding
achievements \citep{tian2020Contrastive,zhao2020MDDA,tejankar2021isd,fend2021KD3A}.
We name our proposed approach \textbf{\emph{C}}\emph{ycle Class C}\textbf{\emph{O}}\emph{nsistency
with }\textbf{\emph{O}}\emph{ptimal Transport and }\textbf{\emph{K}}\emph{nowledge
Distillation for Unsupervised Domain Adaptation} (COOK). In summary,
our contributions in this paper include:
\begin{itemize}
\item We propose a novel distributional OT which seeks the optimal matching
between the target and source examples taking into account the source
label information for reducing the label and data shift, two challenging
problems of UDA. 
\item We connect KD and OT to further improve the performance of class-aware
UDA methods via proposing a cycle class consistency framework where
the teacher and student networks cooperatively work in a distillation
process and support to reduce the mismatch between the target distribution
and the source class-conditional distributions.
\item We conduct experiments to compare our proposed COOK with the existing
standard UDA methods, especially class-aware UDA methods (e.g., RADA
\citep{wang2019classaware} and CAN \citep{kang2019can}), and OT-based
UDA methods (e.g., DeepJDOT \citep{damodaran2018deepjdot}, ETD \citep{li2020enhanceOT},
and RWOT \citep{xu2020reliable}). The experimental results show that
our proposed method surpasses the baselines on the benchmark datasets
including \emph{Office-31}, \emph{Office-Home}, and \emph{ImageCLEF-DA}.
\end{itemize}

