In this section, we conduct experiments on benchmark datasets including
\emph{Office-31}, \emph{Office-Home}, and \emph{ImageCLEF-DA} to compare
with existing baselines, especially OT-based and class-aware UDA methods. 

\subsection{Datasets}

\textbf{Office-31} \citep{saenko20office31} is a well-known public
dataset used for UDA. It consists of three domains including Amazon
(\textbf{A}), Webcam (\textbf{W}) and Dslr (\textbf{D}) with 31 common
classes and 4,110 images in total.

\noindent \textbf{Office-Home} \citep{venkateswara2017deephasing}
is another and more challenging dataset for UDA which contains images
from four different domains, namely Artistic (\textbf{Ar}), Clip Art
(\textbf{Cl}), Product (\textbf{Pr}) and Real-world images (\textbf{Re}).
This dataset consists of around 15,588 images in total with 65 object
categories in office and home scenes.

\noindent \textbf{ImageCLEF-DA} \citep{caputo2014ImageCLEF} includes
three domains including Caltech-256 (\textbf{C}), ImageNet ILSVRC
2012 (\textbf{I}), and Pascal VOC 2012 (\textbf{P}), each of which
has 12 classes with 50 images per class.

\subsection{Implementation Details }

In the experiments on the \emph{Office-31}, \emph{Office-Home} and
\emph{ImageCLEF-DA} datasets, we use the extracted features ($2048$
dimensions) from ResNet-50 \citep{he2016resnet}. The generator includes
a fully connected layer that outputs $256$ dimensions. We use the
same architecture for the student and teacher networks which consists
of a fully connected layer for each network.

Some hyperparameters substantially contributes to model performance,
namely the temperature $\tau$ in Eq. (\ref{eq:distill_loss}), and
the percentile $\rho$ in Eq. (\ref{eq:pseudo_loss_with_w}). As suggested
in the ablation study, we choose $\tau=10.0$ to effectively activate
the knowledge distillation process from the teacher to the student.
The percentile $\rho$ is important to measure how well the student
$h^{S}$ can generalize on the target domain. We empirically find
that $\rho=20$ or in other words, choosing the $20$-th percentile
of $H\left(\hat{y}_{i}^{T}\right)$ is appropriate to select high-confidence
pseudo labels. Additionally, setting $\epsilon$ less than or equal
to $0.1$ can achieve better performance and we set $\epsilon$ to
$0.1$. We also select the trade-off parameters $\alpha=\beta=1.0$
and $\gamma=0.1$ in our experiments as suggested in the ablation
studies. 

We apply Adam optimizer \citep{kingma2014adam} ($\beta_{1}=0.5,\beta_{2}=0.999$)
with Polyak averaging \citep{polyak1992acce}, and the learning rate
is set to $10^{-4}$ for \emph{Office-31} and \emph{Office-Home},
and $5\times10^{-5}$ for \emph{ImageCLEF-DA}. For the baselines,
we report the experimental results mentioned in the original papers.
It is noticeable that in all experiments, we only train the feature
extractor, and the performance of COOK can be further improved when
fine-tuning the backbone ResNet-50 is conducted.

\subsection{Result and Discussion}

We compare our COOK with the standard baseline ResNet-50 \citep{he2016resnet}
and existing works including DAN \citep{long2015}, DANN \citep{Ganin2015},
RTN \citep{long2016rtn}, iCAN \citep{zhang2018ican}, CDAN-E \citep{long2018cdan},
CDAN-BSP \citep{pmlr-v97-chen19i}, CDAN-T \citep{Wang2019CDAN},
TPN \citep{pan2019tpn}, rRevGrad+CAT \citep{deng2019cluster}, CADA-P
\citep{kurmi2019attending}, SymNets \citep{zhang2019symnets}, especially
class-aware DA and OT-based methods, namely RADA \citep{wang2019classaware},
CAN \citep{kang2019can}, DeepJDOT \citep{damodaran2018deepjdot},
ETD \citep{li2020enhanceOT}, and RWOT \citep{xu2020reliable}. 

The results trained on \emph{Office-31} are reported in Table \ref{tab:results-office-31}.
In general, our proposed method achieves high results with four transfer
tasks greater than $95\%$. Except for the transfer tasks \textbf{D}$\rightarrow$\textbf{W}
and \textbf{W}$\rightarrow$\textbf{D}, our model significantly outperforms
others on almost adaptation tasks, and obtain $94.1\%$ on average,
which is a $3.5\%$ increase compared to the runner-up result. It
is worth noting that our COOK outperforms the baselines by a large
margin on challenging tasks, e.g., a $10.7\%$ increase on \textbf{D}$\rightarrow$\textbf{A}
and \textbf{W}$\rightarrow$\textbf{A} with a $9.2\%$ improvement,
in which the background of the training images between the two domains
are totally dissimilar.

We present the results trained on \emph{Office-Home }in Table \ref{tab:office-home}.
In this dataset, our COOK surpasses $7$ over $12$ transfer tasks
compared with the baselines and achieves the best performance, making
a $2.8\%$ improvement on average. More specifically, our model sees
a remarkable improvement on more challenging adaptation tasks, namely
\textbf{Ar}$\rightarrow$\textbf{Pr} ($3.6\%)$,\textbf{ Cl}$\rightarrow$\textbf{Pr}
($7.6\%)$, \textbf{Cl}$\rightarrow$\textbf{Re} ($4.1\%$). 

We further evaluate our COOK on \emph{ImageCLEF-DA} and report the
classification accuracy in Table \ref{tab:result_clef-da}. Our COOK
outperforms $4$ over $6$ transfer tasks with an average accuracy
of $90.7\%$, compared to ETD and RWOT with $89.7\%$ and $90.3\%$,
respectively.

\noindent 
\begin{table}
\centering{}\caption{Mean accuracy (\%) on Office-31 for unsupervised domain adaptation
(ResNet-50).\label{tab:results-office-31}}
\resizebox{1.0\columnwidth}{!}{%%
\begin{tabular}{cccccccc}
\hline 
Method & A$\rightarrow$W & A$\rightarrow$D & D$\rightarrow$W & W$\rightarrow$D & D$\rightarrow$A & W$\rightarrow$A & Avg\tabularnewline
\hline 
ResNet-50 & 68.4 & 68.9 & 96.7 & 99.3 & 62.5 & 60.7 & 76.1\tabularnewline
DAN & 80.5 & 78.6 & 97.1 & 99.6 & 63.6 & 62.8 & 80.4\tabularnewline
DANN & 82.0 & 79.7 & 96.9 & 99.1 & 68.2 & 67.4 & 82.2\tabularnewline
RTN & 84.5 & 77.5 & 96.8 & 99.4 & 66.2 & 64.8 & 81.6\tabularnewline
iCAN & 92.5 & 90.1 & 98.8 & \textbf{100.0} & 72.1 & 69.9 & 87.2\tabularnewline
CDAN-E & 94.1 & 92.9 & 98.6 & \textbf{100.0} & 71.0 & 69.3 & 87.7\tabularnewline
CDAN-BSP & 93.3 & 93.0 & 98.2 & \textbf{100.0} & 73.6 & 72.6 & 88.5\tabularnewline
CDAN-T & 95.7 & 94.0 & 98.7 & \textbf{100.0} & 73.4 & 74.2 & 89.3\tabularnewline
TPN & 91.2 & 89.9 & 97.7 & 99.5 & 70.5 & 73.5 & 87.1\tabularnewline
rRevGrad+CAT & 94.4 & 90.8 & 98.0 & \textbf{100.0} & 72.2 & 70.2 & 87.6\tabularnewline
SymNets & 90.8 & 93.9 & 98.8 & \textbf{100.0} & 74.6 & 72.5 & 88.4\tabularnewline
DeepJDOT & 88.9 & 88.2 & 98.5 & 99.6 & 72.1 & 70.1 & 86.2\tabularnewline
ETD & 92.1 & 88.0 & \textbf{100.0} & \textbf{100.0} & 71.0 & 69.3 & 86.2\tabularnewline
RWOT & \textbf{95.1} & 94.5 & 99.5 & \textbf{100.0} & 77.5 & 77.9 & 90.8\tabularnewline
RADA & 91.5 & 90.7 & 98.9 & \textbf{100.0} & 71.5 & 71.3 & 87.3\tabularnewline
CAN & 94.5 & 95.0 & 99.1 & 99.8 & 78.0 & 77.0 & 90.6\tabularnewline
\hline 
\textbf{COOK} & \textbf{95.1} & \textbf{96.2} & 98.3 & 99.9 & \textbf{88.7} & \textbf{86.2} & \textbf{94.1}\tabularnewline
\hline 
\end{tabular}}
\end{table}
\begin{table*}
\centering{}\caption{Mean accuracy (\%) on Office-Home for unsupervised domain adaptation
(ResNet-50).\label{tab:office-home}}
\resizebox{0.92\textwidth}{!}{%%
\begin{tabular}{cccccccccccccc}
\hline 
Method & Ar$\rightarrow$Cl & Ar$\rightarrow$Pr & Ar$\rightarrow$Re & Cl$\rightarrow$Ar & Cl$\rightarrow$Pr & Cl$\rightarrow$Re & Pr$\rightarrow$Ar & Pr$\rightarrow$Cl & Pr$\rightarrow$Re & Re$\rightarrow$Ar & Re$\rightarrow$Cl & Re$\rightarrow$Pr & Avg\tabularnewline
\hline 
ResNet-50 & 34.9 & 50.0 & 58.0 & 37.4 & 41.9 & 46.2 & 38.5 & 31.2 & 60.4 & 53.9 & 41.2 & 59.9 & 46.1\tabularnewline
DAN & 43.6 & 57.0 & 67.9 & 45.8 & 56.5 & 60.4 & 44.0 & 43.6 & 67.7 & 63.1 & 51.5 & 74.3 & 56.3\tabularnewline
DANN & 45.6 & 59.3 & 70.1 & 47.0 & 58.5 & 60.9 & 46.1 & 43.7 & 68.5 & 63.2 & 51.8 & 76.8 & 57.6\tabularnewline
SymNets & 47.7 & 72.9 & 78.5 & 64.2 & 71.3 & 74.2 & 64.2 & 48.8 & 79.5 & \textbf{74.5} & 52.6 & 82.7 & 67.6\tabularnewline
CDAN-E & 50.7 & 70.6 & 76.0 & 57.6 & 70.0 & 70.0 & 57.4 & 50.9 & 77.3 & 70.9 & 56.7 & 81.6 & 65.8\tabularnewline
CDAN-BSP & 52.0 & 68.6 & 76.1 & 58.0 & 70.3 & 70.2 & 58.6 & 50.2 & 77.6 & 72.2 & \textbf{59.3} & 81.9 & 66.3\tabularnewline
CDAN-T & 50.2 & 71.4 & 77.4 & 59.3 & 72.7 & 73.1 & 61.0 & \textbf{53.1} & 79.5 & 71.9 & 59.0 & 82.9 & 67.6\tabularnewline
DeepJDOT & 48.2 & 69.2 & 74.5 & 58.5 & 69.1 & 71.1 & 56.3 & 46.0 & 76.5 & 68.0 & 52.7 & 80.9 & 64.3\tabularnewline
ETD & 51.3 & 71.9 & \textbf{85.7} & 57.6 & 69.2 & 73.7 & 57.8 & 51.2 & 79.3 & 70.2 & 57.5 & 82.1 & 67.3\tabularnewline
RWOT & \textbf{55.2} & 72.5 & 78.0 & 63.5 & 72.5 & 75.1 & 60.2 & 48.5 & 78.9 & 69.8 & 54.8 & 82.5 & 67.6\tabularnewline
\hline 
\textbf{COOK} & 53.0 & \textbf{76.5} & 81.8 & \textbf{65.5} & \textbf{80.3} & \textbf{79.2} & \textbf{64.5} & 51.8 & \textbf{82.4} & 71.3 & 54.2 & \textbf{83.9} & \textbf{70.4}\tabularnewline
\hline 
\end{tabular}}
\end{table*}
\begin{table}
\centering{}\caption{Mean accuracy (\%) on ImageCLEF-DA for unsupervised domain adaptation
(ResNet-50).\label{tab:result_clef-da}}
\resizebox{0.92\columnwidth}{!}{%%
\begin{tabular}{cccccccc}
\hline 
Method & I$\rightarrow$P & P$\rightarrow$I & I$\rightarrow$C & C$\rightarrow$I & C$\rightarrow$P & P$\rightarrow$C & Avg\tabularnewline
\hline 
RTN & 75.6 & 86.8 & 95.3 & 86.9 & 72.7 & 92.2 & 84.9\tabularnewline
iCAN & 79.5 & 89.7 & 94.7 & 89.9 & 78.5 & 92.0 & 87.4\tabularnewline
CDAN-E & 77.7 & 90.7 & 97.7 & 91.3 & 74.2 & 94.3 & 87.7\tabularnewline
CDAN-T & 78.3 & 90.8 & 96.7 & 92.3 & 78.0 & 94.8 & 88.5\tabularnewline
SymNets & 80.2 & 93.6 & 97.0 & 93.4 & 78.7 & 96.4 & 89.9\tabularnewline
CADA-P & 78.0 & 90.5 & 96.7 & 92.0 & 77.2 & 95.5 & 88.3\tabularnewline
DeepJDOT & 77.7 & 90.6 & 95.1 & 88.5 & 75.3 & 94.3 & 86.9\tabularnewline
ETD & 81.0 & 91.7 & 97.9 & 93.3 & 79.5 & 95.0 & 89.7\tabularnewline
RWOT & \textbf{81.5} & 93.1 & \textbf{98.0} & 92.8 & 79.3 & 96.8 & 90.3\tabularnewline
RADA & 79.2 & 92.4 & 97.5 & 91.1 & 76.6 & 95.3 & 88.7\tabularnewline
\hline 
\textbf{COOK} & 80.1 & \textbf{95.5} & 97.0 & \textbf{95.9} & \textbf{79.1} & \textbf{96.3} & \textbf{90.7}\tabularnewline
\hline 
\end{tabular}}
\end{table}
\begin{table}
\centering{}\caption{Accuracy (\%) of ablation study on ImageCLEF-DA.\label{tab:effect_loss}}
\resizebox{0.85\columnwidth}{!}{\centering\setlength{\tabcolsep}{2pt}%{\small{}}%
\begin{tabular}{ccccccccccc}
\hline 
$\mathscr{\mathcal{L}}^{src}$ & $\mathscr{\mathcal{L}}_{w}^{pl}$ & $\mathscr{\mathcal{L}}^{dl}$ & $\mathcal{L}^{clus}$ & I$\rightarrow$P & P$\rightarrow$I & I$\rightarrow$C & C$\rightarrow$I & C$\rightarrow$P & P$\rightarrow$C & Avg\tabularnewline
\hline 
{\small{}\checkmark} & {\small{}\checkmark} &  &  & 75.9 & 86.1 & 93.9 & 89.0 & 74.4 & 87.4 & 84.5\tabularnewline
{\small{}\checkmark} & {\small{}\checkmark} & {\small{}\checkmark} &  & 76.4 & 86.6 & 93.9 & 89.9 & 76.0 & 91.2 & 85.7\tabularnewline
{\small{}\checkmark} & {\small{}\checkmark} &  & {\small{}\checkmark} & 78.6 & 91.0 & 95.9 & 92.9 & 77.9 & 95.9 & 88.7\tabularnewline
{\small{}\checkmark} &  & {\small{}\checkmark} & {\small{}\checkmark} & 78.5 & 90.9 & 96.7 & 93.3 & \textbf{79.6} & 95.0 & 89.0\tabularnewline
{\small{}\checkmark} & {\small{}\checkmark} & {\small{}\checkmark} & {\small{}\checkmark} & \textbf{80.1} & \textbf{95.5} & \textbf{97.0} & \textbf{95.9} & 79.1 & \textbf{96.3} & \textbf{90.7}\tabularnewline
\hline 
\end{tabular}}
\end{table}
\begin{table}
\noindent \centering{}\caption{Results (\%) on different training strategies.\label{tab:training-phase}}
\resizebox{0.86\columnwidth}{!}{\centering\setlength{\tabcolsep}{2pt}%%
\begin{tabular}{cccccccc}
\hline 
\multirow{1}{*}{Methods} & A$\rightarrow$W & A$\rightarrow$D & D$\rightarrow$W & W$\rightarrow$D & D$\rightarrow$A & W$\rightarrow$A & Avg\tabularnewline
\hline 
Without KD & 93.8 & 94.6 & 97.8 & 99.2 & 86.4 & 86.1 & 93.0\tabularnewline
With KD & \textbf{95.1} & \textbf{96.2} & \textbf{98.3} & \textbf{99.9} & \textbf{88.7} & \textbf{86.2} & \textbf{94.1}\tabularnewline
\hline 
\end{tabular}}
\end{table}
\begin{figure*}
\noindent \centering{}\subfloat[The changes of $\rho$-percentile.\label{fig:abla_percentile}]{\centering{}\includegraphics[width=0.26\textwidth]{abla_percentile}}\hspace{8mm}\subfloat[Study of twisting $\tau$.\label{fig:abla_tau}]{\centering{}\includegraphics[width=0.26\textwidth]{abla_temp}}\hspace{8mm}\subfloat[Values of $\mathcal{W}_{c,\protect\bpi}^{\epsilon}\left(\mathbb{Q}^{T},\mathcal{Q}^{S}\right)$.\label{fig:abla_ws}]{\centering{}\includegraphics[width=0.26\textwidth]{abla_ws}}\vspace{-2mm}
\caption{Ablation studies of our proposed method on the transfer task \textbf{A}\textrightarrow \textbf{W}.\label{fig:abla_qua_qua}}
\vspace{-3mm}
\end{figure*}
\begin{figure*}[t]
\noindent \centering{}\subfloat[$\alpha$\label{fig:alpha}]{\centering{}\includegraphics[width=0.26\textwidth]{abla_alpha}}\hspace{8mm}\subfloat[$\beta$\label{fig:beta}]{\centering{}\includegraphics[width=0.26\textwidth]{abla_beta}}\hspace{8mm}\subfloat[$\gamma$\label{fig:gamma}]{\centering{}\includegraphics[width=0.26\textwidth]{abla_gamma}}\caption{Analysis of hyperparameter sensitivity of $\alpha,\beta$ and $\gamma$
on transfer tasks \textbf{P} \textbf{$\rightarrow$ I }and\textbf{
A$\rightarrow$D}.\label{fig:para-tuning}}
\end{figure*}


\subsection{Analysis}

\subsubsection{Hyperparameter Sensitivity and Quantitative Evaluation}

We conduct experiments to evaluate hyperparameter sensitivity and
quantitative result for our proposed COOK in Figure \ref{fig:abla_qua_qua}.
Figure \ref{fig:abla_percentile} experiences a decrease of the model
performance when twisting $\rho$ in Eq. (\ref{eq:pseudo_loss_with_w}).
Our proposed COOK works well with $\rho$ from $10$ to $40$. Relying
on this investigation, we pick $\rho=20$ in our experiments. Similarly,
Figure \ref{fig:abla_percentile} shows results with the changes of
$\tau$. We search $\tau$ in the grid of $\left\{ 1.0,10.0,25.0,50.0,100.0\right\} $
and find that setting $\tau=10.0$ achieves the best performance to
perform knowledge distillation. Furthermore, we investigate the Wasserstein
distance $\mathcal{W}_{c,\bpi}^{\epsilon}\left(\mathbb{Q}^{T},\mathcal{Q}^{S}\right)$
in Figure \ref{fig:abla_ws}, which sees a reduction during the training.
This result shows the success of transporting target samples to their
corresponding source class-conditional distributions.

We further evaluate the effects of the trade-off parameters $\alpha,\beta,\gamma$
on model performance by twisting their values. Figure \ref{fig:para-tuning}
shows results when we search $\alpha,\beta$ and $\gamma$ in the
grid of $\left\{ 0.001,0.01,0.1,1.0,5.0,10.0\right\} $ and report
the test accuracy on two transfer tasks \textbf{P} \textbf{$\rightarrow$
I }(\emph{ImageCLEF-DA}) and\textbf{ A$\rightarrow$D} (\emph{Office-31}).
The results show that the model yields the stable performance when
$\alpha,\beta,\gamma$ from $0.001$ to $1.0$. We find that our COOK
can achieve high performance when $\alpha=\beta=1.0$ and $\gamma=0.1$,
hence we suggest picking these values on most of our experiments.

\noindent 
\begin{figure*}[t]
\subfloat[ResNet\label{fig:ResNet-AW}]{\includegraphics[width=0.47\columnwidth]{A_W_ResNet50}}\hfill{}\subfloat[COOK\label{fig:COOK-AW}]{\includegraphics[width=0.47\columnwidth]{A_W_COOK}}\hfill{}\subfloat[ResNet\label{fig:ResNet-PC}]{\includegraphics[width=0.47\columnwidth]{P_C_ResNet50}}\hfill{}\subfloat[COOK\label{fig:COOK-PC}]{\includegraphics[width=0.47\columnwidth]{P_C_COOK}}\caption{The t-SNE visualization of \textbf{A}\textrightarrow \textbf{W} (Figure
a, b) and \textbf{P}\textrightarrow \textbf{C} (Figure c, d) tasks
with label and domain information. Each color denotes a class while
the circle and cross markers represent the source and target data
respectively.\label{fig:t-sne}}
\vspace{-3mm}
\end{figure*}


\subsubsection{Effect of Losses}

\label{subsec:Effect-of-Losses}We investigate the effectiveness of
the pseudo labelling loss $\mathscr{\mathcal{L}}_{w}^{pl}$, the distillation
loss $\mathscr{\mathcal{L}}^{dl}$, and the clustering assumption
loss $\mathcal{L}^{clus}$ in Eq. (\ref{eq:final_obj}). The experimental
results are described in Table \ref{tab:effect_loss}, which shows
that all component losses contribute to the model performance since
they participate in the cyclic process and support to match target
samples to the corresponding source regions. It is noticeable that
the model performance is the best when all component losses are activated
and participate in the training process.

\subsubsection{Effect of Knowledge Distillation}

\label{subsec:Training-strategy}

We further testify the contribution of KD to our proposed method in
two different scenarios:\emph{ Without KD} and \emph{With KD}. For
\emph{Without KD} setting, we deploy a model where $h^{S}$ and $h^{T}$
are weight-sharing networks and train this model using the final optimization
problem where $\mathscr{\mathcal{L}}^{dl}=0$. We compare the \emph{Without
KD} setting with our architecture COOK (a.k.a. \emph{With KD}) and
report the accuracy score in Table \ref{tab:training-phase}. The
results show that our COOK with KD outperforms that without KD by
nearly $1\%$, which demonstrates the effectiveness of KD for our
framework.

\subsubsection{Feature Visualization}

We select transfer tasks \textbf{A}\textrightarrow \textbf{W} (\emph{Office-31})
and \textbf{P}\textrightarrow \textbf{C} (\emph{ImageCLEF-DA}) tasks
to visualize their representation in the latent space using \emph{t}-SNE
\citep{vanDerMaaten2008}. The visualizations in Figure \ref{fig:ResNet-AW}
and \ref{fig:ResNet-PC} show that after going through the backbone
model ResNet-50, there is still a mismatch between the source and
target distributions due to the data and label shifts. However, our
proposed COOK (see Figure \ref{fig:COOK-AW} and \ref{fig:COOK-PC})
is trained to transport target samples to source samples, which closes
this gap and achieves better alignment between the target and the
source samples.
