
We denote our distance metric-based approach as \textbf{CS+CCS} and adversarial training-based approach as \textbf{CS-adv}, and %compare it with other baseline methods on both synthetic and real-world datasets. 
evaluate their performance on both synthetic and real-world datasets.
Our experimental evaluation is divided into three parts: 1). advantages of CS divergence over other popular distance measures such as MMD and KL divergence in terms of statistical power (Sec.~\ref{sec:statistic}) and practical UDA performance (Sec.~\ref{sec:compare-mmd});
2). the superior performance of CS-adv with respect to other SOTA (Sec.~\ref{sec:adv-results});
3). the flexibility of CCS divergence as an injective module (Sec.~\ref{sec:injective}).

%1). statistic test (Sec.~\ref{sec:statistic}); 2). results of discrepancy minimization (Sec.\ref{sec:compare-mmd}); 3). conditional adversarial training results (Sec.~\ref{sec:adv-results}); 4). the results of CCS divergence as an injective module (Sec.~\ref{sec:injective}). }

%We evaluate and analyze CS-adv in both toy and real-world datasets and compare it with other baseline methods. In addition, to comprehensively understand the advantages of CS and CCS divergences, we also conduct experimental analysis without adversarial training, purely with two proposed divergences, on both synthetic and toy digits datasets. 

\textbf{Datasets} \quad We use three datasets in our experiments.  1) \textbf{Digits} The Digits dataset~\citep{long2018conditional} consists of two parts, \textbf{MNIST} and \textbf{USPS}, which leads to two adaptation tasks, M$\rightarrow$U and U$\rightarrow$M. 2) \textbf{Office-Home} The Office-Home dataset~\citep{venkateswara2017deep} has four domains: Art (Ar), Clipart (Cl), Product (Pr), and Real-World (Rw), which results in $12$ domain adaptation tasks. The overall dataset has 15,500 images with 65 classes. 3) \textbf{Office-31} The Office-31 dataset~\citep{saenko2010adapting} has three domains: Amazon (A), Webcam (W), and DSLR (D), which results in $6$ tasks. The overall Office-31 dataset contains 4,652 images with 31 categories. {4) \textbf{VisDA17} Additionally, we use a large scale real-world dataset VisDA17~\citep{peng2017visda} to have a fair comparison with KL~\citep{nguyen2021kl}}. 

% toy dataset Office-Caltech-10~\cite{fernando2014subspace} to comprehensively compare with other metrics, e.g. MMD, and

%In alignment with the protocol established by \cite{long2018conditional}, we utilise two digit datasets - \textbf{MNIST} and \textbf{USPS} - to carry out two adaptation tasks ($M \rightarrow U$ and $U \rightarrow M$). The MNIST dataset comprises 60,000 training images and 10,000 test images. On the other hand, the USPS dataset contains 7,291 images for training and 2,007 images for testing.

%The MNIST dataset comprises 60,000 training images and 10,000 test images, while the USPS dataset contains 7,291 images for training and 2,007 images for testing

%  as a toy example to demonstrate the effectiveness of CS and CCS components against MMD and joint distribution MMD (JPMMD).

% \paragraph{Office-Home} The Office-Home dataset~\cite{venkateswara2017deep} has four domains: Art (Ar), Clipart (Cl), Product
% (Pr) and Real-World (Rw), which results in $12$ domain adaptation tasks. The overall data has 15,500 images with 65 classes. 

% \paragraph{Office-31} The Office-31 dataset~\cite{saenko2010adapting} has three domains: Amazon (A), Webcam (W) and DSLR (D), which results in $6$ tasks. The overall data contains 4,652 images with 31 categories.

% \begin{figure}[hbpt]
% %\hfill
% \begin{minipage}[b]{.45\linewidth} 
% \centering 
%      \begin{subfigure}[b]{0.45\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{Figures/power_mean.pdf}
%          %\caption{Test set prediction accuracy (MSE) with respect to training epochs}
%          %\label{fig:y equals x}
%      \end{subfigure}
%      %\hfill
%      \begin{subfigure}[b]{0.45\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{Figures/power_variance.pdf}
%         % \caption{Information plane}
%          %\label{fig:three sin x}
%      \end{subfigure}
%         \caption{Percent ($1$ indicates $100\%$) of correctly rejecting $H_0$ hypothesis with different values of $\mu$ and $\sigma^2$.}
%         \label{fig:marginal_power}
% \end{minipage}% 
% \hfill
% \begin{minipage}[b]{.5\linewidth} 
% \centering 
% \resizebox{0.6\textwidth}{!}{
% \begin{tabular}{lcccccc}
% \toprule
%  & \multicolumn{3}{c}{Cond. CS} & \multicolumn{3}{c}{Cond. MMD} \\
% \cmidrule(r){2-4} \cmidrule(r){5-7}
%  & (a) & (b) & (c) & (a) & (b) & (c) \\
% \midrule
% (a) & 0.05 & 1 & 1 & 0.04 & 0.02 & 0.04 \\
% (b) & 1 & 0.08 & 0.92 & 0.04 & 0.04 & 0.04 \\
% (c) & 1 & 0.91 & 0.10 & 0.12 & 0.06 & 0.06 \\
% \bottomrule
% \end{tabular}}
% \captionof{table}{Percent of correctly rejecting $H_0$ hypothesis for conditional CS and class conditional MMD. An ideal result is a full-one matrix with $0$ on the diagonal.} \label{tab:conditional_power}
% \end{minipage} 
% 
% \end{figure}


% \begin{figure}[hbpt]
% %\hfill
% \centering 
%      \begin{subfigure}[b]{0.21\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{Figures/power_mean.pdf}
%          %\caption{Test set prediction accuracy (MSE) with respect to training epochs}
%          %\label{fig:y equals x}
%      \end{subfigure}
%      \begin{subfigure}[b]{0.21\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{Figures/power_variance.pdf}
%         % \caption{Information plane}
%          %\label{fig:three sin x}
%      \end{subfigure}
%         \caption{Percent ($1$ indicates $100\%$) of correctly rejecting $H_0$ hypothesis with different values of $\mu$ and $\sigma^2$.}
%         \label{fig:marginal_power}
% \end{figure}





\subsection{Comparisons among (conditional) CS divergence, MMD and KL divergence}\label{sec:compare_MMD_KL}

In this section, we first demonstrate that our conditional CS (CCS) divergence is statistically more powerful to discriminate two conditional distributions $p^s(y|\mathbf{x})$ and $p^t(y|\mathbf{x})$. %Then, we conduct experimental comparisons between the proposed CS+CCS and other distance metrics, including the classical MMD-based methods and the recent KL-based method~\citep{nguyen2021kl}.}
Then, we compare our CS+CCS with both MMD-based approaches and the recently developed KL-based approach~\citep{nguyen2021kl}. Note that, all approaches are implemented without adversarial training, but only use different distance metrics to match $p(\mathbf{z})$ (or $p(y|\mathbf{z})$). 

%Before evaluating the performance of our method on real-world large-scale datasets, we first demonstrate that our conditional CS divergence is statistically more powerful to discriminate two conditional distributions $p^s(y|\mathbf{x})$ and $p^t(y|\mathbf{x})$. 


%we first justify the relationship between CS divergence and MMD. We hypothesize that: 1) CS divergence and MMD have similar power to discriminate two marginal distributions, due to their highly similar empirical estimators that only differ by a logarithm in each term (i.e., Eq.~(\ref{eq.cs_est}) with respect to Eq.~(\ref{eq:mmd_est})); 2) conditional CS divergence (i.e., $D_{\text{CCS}}(p^s(y|\mathbf{x});p^t(y|\mathbf{x}))$) is much more powerful than existing MMD-based strategies in domain adaptation (i.e., the class conditional MMD, $\sum_{i=1}^K\text{MMD}(p^s(\mathbf{x}|y=c_i);p^t(\mathbf{x}|y=c_i))$) to discriminate two (different) conditional distributions.






\subsubsection{Statistical Test}
\label{sec:statistic}
%To test the first hypothesis, we follow~\cite{gretton2012kernel} and conduct two-sample tests on two Gaussians with $100$ samples in each distribution and dimensionality $d=100$. In the first case, the distributions have unit variance but different values of mean $\mu$ ($20$ values of $\mu$ that logarithmically spaces from $0.01$ to number $e$). In the second case, samples were drawn from distributions $\mathcal{N}(0, I)$ and $\mathcal{N}(0, \sigma^2I)$ with different values of variance $\sigma^2$ ($20$ values of $\sigma^2$ logarithmically spaces from $10^0.01$ to $1.5$). We apply the permutation test to distinguish two distributions and set the significance level for all tests as $\alpha = 0.05$. The results are plotted in Fig.~\ref{fig:marginal_power}. MMD and CS divergence have almost the same percentages to correctly reject the $H_0$ hypothesis (i.e., two Gaussians are the same).

%\input{Tables/syn_mmd_cs}
For comparison purpose, we evaluate the performance of both conditional KL divergence estimated with the $k$-NN estimator~\citep{wang2009divergence} ($k=3$) and the conditional MMD. In domain adaptation, $\text{MMD}(p^s(y|\mathbf{x});p^t(y|\mathbf{x}))$ is rarely explicitly evaluated, due to the difficulty of estimation. Rather, a much more popular strategy is to evaluate the class conditional MMD (i.e., $\sum_{i=1}^K\text{MMD}(p^s(\mathbf{x}|y=c_i);p^t(\mathbf{x}|y=c_i))$ and $c_i$ indicates the $i$-th class).
%, which implicitly assumes that $p^s(y)=p^t(y)$ (i.e., no label shift). 
For the sake of comprehensiveness, we test the performances of both strategies and measure $\text{MMD}(p^s(y|\mathbf{x});p^t(y|\mathbf{x}))$ with the estimator in \citep{ren2016conditional}.


We follow \citep{zheng2000consistent} and generate $3$ sets of data that have distinct conditional distributions: (a) $t=1+\sum^d_{i=1}x_i+ \epsilon$, where $d$ refers to the dimension of explanatory variable $\mathbf{x}$, $\epsilon$ denotes standard normal distribution; the labeling rule is $y=1$ if $t\geq 0$, otherwise $y=0$. (b) $t=1+\sum^d_{i=1}\log(x_i)+ \epsilon$; the labelling rule is again $y=1$ if $t\geq 0$, otherwise $y=0$. (c) $t=1+\sum^d_{i=1}x_i+ \epsilon$; the labelling rule becomes $y=1$ if $t\geq 1$, otherwise $y=0$. For each set, the input $\mathbf{x}$ is fixed to be Gaussian, i.e., $p(\mathbf{x})$ remains the same, but $p(y|\mathbf{x})$ differs.

%In this sense, we have in set (a), $p(y|\mathbf{x})\sim \mathcal{N}(y-\sum^d_{i=1}x_i-1, 1) $; in set (b), $p(y|\mathbf{x})\sim \mathcal{N}(y-\sum^d_{i=1}x_i^2-1, 1) $; in set (c), $p(y|\mathbf{x})\sim \text{Logistic}(y-\sum^d_{i=1}x_i^2-2, 1) $. 
We generate $200$ samples from each set and set $d=10$. We apply a permutation test with significance level $\alpha=0.05$ to test for the null hypothesis $H_0$ stating that two sets of data share a common conditional distribution, against the alternative hypothesis $H_1$ that suggests the conditional distributions are different. The results in Table~\ref{tab:conditional_power} suggest that our CCS is much more powerful. Details about the permutation test and data visualization are provided in the Appendix (Section~\ref{subsec:details-test}). % of the supplementary material.


More specifically, by stating that a test statistic is more (statistically) powerful, we mean that it has a greater probability in finite samples of correctly rejecting the null hypothesis (i.e., two conditional distributions are equal) in favor of the alternative hypothesis (i.e., the two conditional distributions differ).
That is, the statistic has a smaller Type-II error.
Table~\ref{tab:conditional_power} shows that our CCS method exhibits an empirical probability of 0.72 in correctly distinguishing between distribution (a) and distribution (c), while the probabilities associated with class CMMD and CMMD are notably lower at 0.20 and 0, respectively.

The main diagonal elements in the table refer to empirical Type-I error (i.e., the probability of falsely rejecting the null hypothesis). Given our chosen significance level of $\alpha = 0.05$, one would ideally expect the main diagonal elements to be close to 0.05. The results indicate that all methods exhibit similar performance in terms of size control. Hence, it can be concluded that the CCS approach offers statistically higher power with good size control. 

\begin{table}[htbp]
    \centering
    \resizebox{0.48\textwidth}{!}{
    \begin{tabular}{lccc|ccc|ccc|ccc}
    \toprule
     & \multicolumn{3}{c}{CCS} & \multicolumn{3}{c}{class CMMD} & \multicolumn{3}{c}{CMMD} &\multicolumn{3}{c}{CKL} \\
    \cmidrule(r){2-4} \cmidrule(r){5-7} \cmidrule(r){8-10} \cmidrule(r){11-13}
     & (a) & (b) & (c) & (a) & (b) & (c) & (a) & (b) & (c) & (a) & (b) & (c) \\
    \midrule
    (a) & 0.06 & 1 & 0.72 & 0.05 & 1 & 0.17 & 0.02 & 0 & 0.01 & 0.01 & 1 & 0.25 \\
    (b) & 1 & 0.08 & 1 & 1 & 0.03 & 1 & 0 & 0.05 & 0 & 1 & 0.07 & 1 \\
    (c) & 0.72 & 1 & 0.07 & 0.20 & 1 & 0.06 & 0 & 0 & 0.04 & 0.25 & 1 & 0.07 \\
    \bottomrule
    \end{tabular}}
    \caption{Percent of rejecting $H_0$ hypothesis for conditional CS in Eq.~(\ref{eq:conditional_CS_est}), class conditional MMD, conditional MMD with estimator in~\citep{ren2016conditional}, conditional KL with $k$-NN estimator~\citep{wang2009divergence}. An ideal result is a full-one matrix with $\alpha=0.05$ on the main diagonal.}
    \label{tab:conditional_power}
\end{table}


% \subsubsection{Baselines}
% Our primary baseline for comparison is DANN~\cite{ganin2016domain}. We have also benchmarked our method against recent techniques such as CDAN~\cite{long2018conditional}.
% Additionally, we draw comparisons with the $f$-divergence-based domain adversarial learning method, f-DAL~\cite{acuna2021f}. f-DAL also added a Sampling-Based Alignment~\cite{jiang2020implicit} module to address the label shift problem. This is specifically effective in the Office-Home dataset, so we also compare f-DAL+Alignment in this dataset. Our experimental results show that our way of modeling has a better performance. We have also considered CKB~\cite{luo2021conditional} as an alternative baseline, as it aligns with the conditional distribution $p(z|y)$.
%\subsubsection{Ablation Study}
\subsubsection{{The Performance of CS+CCS}}
\label{sec:compare-mmd}


% \begin{figure} 
% \centering 
%      \begin{subfigure}[b]{0.3\textwidth}
%          \includegraphics[width=\textwidth]{Figures/compare_mnist.png}
%          \caption{The ablation study of the \textbf{CS} and \textbf{CCS} components in MNIST to USPS task, comparing with MMD and joint distribution MMD (JPMMD).}
%          \label{Fig.abl-mnist}
%      \end{subfigure}
%      \hspace{0.05\textwidth} % Add horizontal space
%      \begin{subfigure}[b]{0.3\textwidth}
%          \includegraphics[width=\textwidth]{Figures/compare_mnist_fdal.png}
%          \caption{The ablation study of \textbf{f-DAL with CCS} divergence in MNIST to USPS task. CCS divergence improves the performance of f-DAL consistently. }
%          \label{fig:f-dal-ccs}
%      \end{subfigure}
% \end{figure}


\textbf{Comparison with MMD} \quad We demonstrate the advantages of CS and CCS over MMD in practical UDA tasks. To this end, we conduct an ablation study on the Digits \textbf{M}$\rightarrow$\textbf{U} task. In this example, we match the marginal and conditional distributions with plain CS and CCS divergences without any adversarial training techniques. We use LeNet~\citep{lecun1998gradient} as the feature extractor and a nonlinear classifier with two fully connected layers and ReLU activation. 
%three fully connected layers with ReLu activate function and Dropout (0.5) as the classifier.  

\begin{figure}[htbp]
%
  \centering
  \includegraphics[width=0.36\textwidth,trim=0.95cm 0.95cm 0.8cm 0.85cm, clip]{Figures/compare_mnist.png}
  \caption{The ablation study of the \textbf{CS} and \textbf{CCS} components in MNIST to USPS task, comparing with MMD and joint distribution MMD (JPMMD).}
  \label{Fig.abl-mnist}
\end{figure}

%\textbf{CS vs. MMD} 
We compare CS, CCS, and CCS+CS divergences with MMD and joint probability MMD (JPMMD)~\citep{zhang2020discriminative} that approximates $D(p^s(\mathbf{z},y);p^t(\mathbf{z},y))$ with $\mu_1 D(p^s(\mathbf{z});p^t(\mathbf{z})) + \mu_2 D(p^s(\mathbf{z}|y);p^t(\mathbf{z}|y))$ (not $D(p^s(y|\mathbf{z});p^t(y|\mathbf{z}))$). As shown in Fig.~\ref{Fig.abl-mnist}, all our CS divergence-based adaptations are consistently better than MMD and JPMMD. This means that our CCS divergence is better in modeling conditional alignment. Also, CCS has a better adaptation ability than CS divergence, while combining the CS and CCS divergences (CCS+CS) has the best performance. To have a better understanding of the learned representations on the two domains, we draw t-SNE~\citep{van2008visualizing} visualization in Section~\ref{subsec:add-abl} in the Appendix. Additionally, we aim to add the comparison with conditional MMD in Eq.~(\ref{eq:conditional_MMD}). However, the training fails due to the unstable numerical matrix inverse. This issue can also be observed in our statistical test in Table~\ref{tab:conditional_power}.


%\paragraph{Comparison with KL} In addition, we compare with KL~\cite{nguyen2021kl}. However, we found KL is relatively sensitive to the network architecture choice, especially for classifiers. With a nonlinear classifier, the method is hard to converge. In contrast, the proposed CS divergence can be easily adapted to different frameworks. To have a fair comparison with KL, we adapt the proposed CS and CCS divergences into KL~\cite{nguyen2021kl}'s implementation and compare the performance without adversarial training (e.g. in a metric learning-based fashion as KL). The result is shown in Table~\ref{tab:kl_ccs_abl}. Our CS-CCS method outperforms KL consistently in both the toy Digits dataset and the large-scale VisDA17 dataset. Note, that in the VisDA17 dataset, the reproduced result of KL is lower than the result in the paper. We use batch size 128 instead of 256 in the default setting due to the memory constraint. However, within the same environment and hyperparameters, CS-CCS has a better performance.

\textbf{Comparison with KL} \quad Furthermore, we draw a comparison with KL-based approach~\citep{nguyen2021kl}. We observed that its performance is very sensitive to network architecture choices. Moreover, with a nonlinear classifier, the model hardly converges. This point is also acknowledged by the authors\footnote{\url{https://shorturl.at/abgM0}, lines $172$-$175$.}.
%\textcolor{pink}{This may be attributed to the estimation of KL relies on the variational inference, which minimizes the upper bound instead of KL itself.} 
{The unstability of KL divergence can also be attributed to the term $\log(\frac{p(x)}{q(x)})$ which is likely to explode when $q(x) \rightarrow 0$}.
In contrast, our CS divergence can easily be adapted to different frameworks. To ensure fairness, we incorporated our CS and CCS divergences into \citep{nguyen2021kl} framework, maintaining the identical architecture and hyperparameters. 
%We then evaluate the performance in a non-adversarial training setting, akin to the metric learning approach of KL. 
The comparative results are presented in Table~\ref{tab:kl_ccs_abl}. The proposed CS+CCS surpasses KL on both the Digits and VisDA17 datasets. Note that our reproduced results for KL on the VisDA17 dataset are lower than the reported scores. This may be attributed to the different computing environments and our adoption of a batch size of 128, as opposed to the original 256, due to memory limitations. Furthermore, we show that 
%KL can be improved by integrating with CCS divergence (KL+CCS ).}
the performance of KL~\citep{nguyen2021kl} can be further improved by integrating our CCS regularization (KL+CCS). This result indicates that the assumption made by \citep{nguyen2021kl} on the sufficient closeness between $p^s(y|\mathbf{z})$ and $p^t(y|\mathbf{z})$ may be stringent. 
%Nevertheless, within the same environment and hyperparameters, CS+CCS demonstrates superior performance than KL.
% Nguyen~\textit{et al.}'s~
% where CS+CCS indicates training with CS and CCS divergences

\begin{table}[htbp]
\centering
\resizebox{0.4\textwidth}{!}{
\begin{tabular}{lccc}
\toprule
Method & M$\rightarrow$U & U$\rightarrow$M & VisDA17 \\
\midrule
KL (Reproduced)     & 98.2 & 97.2 &    52.3    \\
\midrule 
KL+CCS & \textbf{98.4} & \textbf{97.3} & \textbf{64.1}    \\
CS+CCS & \textbf{98.3} & \textbf{97.9} &   \textbf{64.5} \\
\bottomrule
\end{tabular}}
\caption{Results on Digits and VisDA17 datasets. KL+CCS integrates CCS divergence with KL. CS+CCS replaces KL with CS and CCS divergences.}
\label{tab:kl_ccs_abl}
\end{table}



% \begin{figure} [htbp]
% %\hfill
% \centering 
%      \begin{subfigure}[b]{0.2\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{Figures/TSNE_25_no_adaptation.png}
%          \caption{No adaptation}
%          \label{fig:tsne-no-ad}
%      \end{subfigure}
%      \begin{subfigure}[b]{0.2\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{Figures/TSNE_25_cs.png}
%          \caption{CS divergence}
%          \label{fig:tsne-cs}
%      \end{subfigure}
%      %\hfill
%      \begin{subfigure}[b]{0.2\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{Figures/TSNE_25_ccs.png}
%         \caption{CCS divergence}
%          \label{fig:tsne-ccs}
%      \end{subfigure}
%      \begin{subfigure}[b]{0.2\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{Figures/TSNE_25_cs_ccs.png}
%         \caption{CCS+CS}
%          \label{fig:tsne-cs-ccs}
%      \end{subfigure}
%         \caption{t-SNE visualization of feature trained without adaptation (\ref{fig:tsne-no-ad}), with CS divergence (\ref{fig:tsne-cs}, with CCS divergence (\ref{fig:tsne-ccs}), and with both CCS and CS divergences (\ref{fig:tsne-cs-ccs}).}
%         \label{fig:tsne-overall}
%         
% \end{figure}

% \textbf{Visualization} \quad In order to better understand the adaptation ability of CS and CCS divergence, we use t-SNE~\cite{van2008visualizing} to visualize the feature trained without adaptation (Fig.~\ref{fig:tsne-no-ad}), with CS divergence (Fig.~\ref{fig:tsne-cs}, with CCS divergence (Fig.~\ref{fig:tsne-ccs}), and with both CCS and CS divergences (Fig.~\ref{fig:tsne-cs-ccs}). Fig~\ref{fig:tsne-overall} shows the aligned quality on \textbf{M}$\rightarrow$\textbf{U} task. As shown in Fig~\ref{fig:tsne-overall}, CS divergence has a worse performance on inter-class separability, while CCS divergence can alleviate this issue. This can also be observed in Fig.~\ref{fig:tsne-cs-ccs}, where CCS divergence is added on top of CS divergence and leads to better separability compared with Fig.~\ref{fig:tsne-cs}. Hence, modeling the conditional distribution alignment is necessary and the proposed CCS divergence has an advantage. 



\input{Tables/officehome}
%\input{Tables/office31}

% \begin{figure*}[htbp]
%    \centering
%    \begin{minipage}{0.65\textwidth} % adjust the width to your needs

\begin{table*}[htbp]
\centering
        \resizebox{0.8\textwidth}{!}{%
\begin{tabular}{lccccccc}
\toprule
Method & A$\rightarrow$W & D$\rightarrow$W & W$\rightarrow$D & A$\rightarrow$D & D$\rightarrow$A & W$\rightarrow$A & Avg \\
\midrule
ResNet~\citep{he2016deep} & 68.4$\pm$0.2 & 96.7$\pm$0.1 & 99.3$\pm$0.1 & 68.9$\pm$0.2 & 62.5$\pm$0.3 & 60.7$\pm$0.3 & 76.1\\
DANN~\citep{ganin2016domain} & 82.0$\pm$0.4 & 96.9$\pm$0.2 & 99.1$\pm$0.1 & 79.7$\pm$0.4 & 68.2$\pm$0.4 & 67.4$\pm$0.5 & 82.2 \\
JAN~\citep{long2017deep} & 85.4$\pm$0.3 & 97.4$\pm$0.2 & 99.8$\pm$0.2 & 84.7$\pm$0.3 & 68.6$\pm$0.3 & 70.0$\pm$0.4 & 84.3 \\
GTA~\citep{sankaranarayanan2018generate} & 89.5$\pm$0.5 & 97.9$\pm$0.3 & 99.8$\pm$0.4 & 87.7$\pm$0.5 & 72.8$\pm$0.3 & 71.4$\pm$0.4 & 86.5 \\
MCD~\citep{saito2018maximum} & 88.6$\pm$0.2 & 98.5$\pm$0.1 & 100.0$\pm$.0 &92.2$\pm$0.2 & 69.5$\pm$0.1 & 69.7$\pm$0.3 & 86.5 \\
MDD~\citep{zhang2019bridging} & 94.5$\pm$0.3 & 98.4$\pm$0.1 & 100.0$\pm$.0 &93.5$\pm$0.2 &74.6$\pm$0.3& 72.2$\pm$0.1 & 88.9 \\
KL~\citep{nguyen2021kl} & 87.9$\pm$0.4 & 99.0$\pm$0.2 & 100.0$\pm$0.0 & 85.6$\pm$0.6 & 70.1$\pm$1.1 & 69.3$\pm$0.7 & 85.3 \\
CDAN~\citep{long2018conditional} & 94.1$\pm$0.1 & 98.6$\pm$0.1 & 100.0$\pm$.0 & 92.9$\pm$0.2 & 71.0$\pm$0.3 & 69.3$\pm$0.3 & 87.7 \\
f-DAL~\citep{acuna2021f} & \textbf{95.4}$\pm$0.7 & \textbf{98.8}$\pm$0.1 & \textbf{100.0}$\pm$.0 & 93.8$\pm$0.4 & 74.9$\pm$1.5 & 74.2 $\pm$0.5 & 89.5 \\
%f-DAL+Alignment~\cite{acuna2021f} & 93.4$\pm$0.4 & \textbf{99.0}$\pm$0.1 & \textbf{100.0}$\pm$.0 & 94.8$\pm$0.6 & 73.6$\pm$0.2 & 74.6$\pm$0.4 & 89.2 \\
\midrule
CS-adv (Ours) & 95.1$\pm$0.6 & \textbf{98.8}$\pm$0.1 &	99.7$\pm$0.1 & \textbf{94.0}$\pm$0.5 &	\textbf{76.2}$\pm$0.3 & \textbf{76.4}$\pm$0.4 &	\textbf{90.0} \\
\bottomrule
\end{tabular}%
}
\captionof{table}{Comparative results (Accuracy \%) of different methods on \textbf{Office-31}.}
\label{tab:results-office31}
\end{table*}
%    \end{minipage}

%
%\end{figure*}







% \begin{figure}
%     \centering
%     \includegraphics[width=0.4\textwidth]{Figures/compare_mnist_fdal.png}
%     \caption{\textbf{Integrating CCS with f-DAL} in M$\rightarrow$U task.}
%     \label{fig:f-dal-ccs}
% \end{figure}
    
    


%\subsection{Experimental Results}
\subsection{The Performance of CS-adv}
\label{sec:adv-results}

In this section, we demonstrate how CS-adv can achieve competitive results against other SOTA methods. 

\textbf{Implementation Details} \quad
Our experiments were carried out using the PyTorch framework~\citep{paszke2019pytorch}, on an NVIDIA GeForce RTX 3090 GPU. We use SGD optimizer with batch size $32$. For Office-Home and Office-31 datasets, we resize the images to dimensions of $224 \times 224 \times 3$. We use the ResNet-50 model~\citep{he2016deep}, pretrained on the ImageNet dataset~\citep{deng2009imagenet}, as the feature extractor, $f$. In addition, for the classifiers, $g_1$ and $g_2$, we use two fully connected layers with Leaky-ReLU activation functions (similar to ~\citep{acuna2021f}). For the hyperparameters, we set $\lambda$ and $\beta$ as $1$, and $\gamma$ as $0.1$. 
For the Digit datasets, we follow the implementation in ~\citep{long2018conditional,acuna2021f}, utilizing LeNet~\citep{lecun1998gradient} as the backbone feature extractor, $f$. The two classifiers, $g_1$ and $g_2$, are structured identically, each comprising two linear layers, with ReLU activation functions. {In our implementation, we normalize $\mathbf{z}$ and $\hat{y}$ and set kernel size $\sigma=1$, which is a common heuristic~\citep{greenfeld2020robust}. }%in all our experiments. 
%We use kernel size $\sigma=1$ in our experiments. 
%non-linearities and Dropout (0.5) in the final layer. 


\textbf{Baselines} \quad We compare the proposed conditional adversarial training method with some state-of-the-art
domain adaptation approaches in real-world datasets that are presented in Tables~\ref{tab:office-home} and \ref{tab:results-office31}. 
%We benchmark our method against DANN~\cite{ganin2016domain} and CDAN~\cite{long2018conditional}, which are two classical adversarial training Domain Adaptation methods.
We compare our method with three classical adversarial training methods for domain adaptation, namely, DANN~\citep{ganin2016domain} CDAN~\citep{long2018conditional}, {and MDD~\citep{zhang2019bridging}}.
We additionally compare our method with the $f$-divergence-based domain adversarial learning method, f-DAL~\citep{acuna2021f}. We consider f-DAL for comparison since it also uses a new family of divergence, $f$-divergence in an adversarial training framework. 
f-DAL-Alignment is a variant of f-DAL, which combines a Sampling-Based Alignment~\citep{jiang2020implicit} module for label shift. 
We further compare KL~\citep{nguyen2021kl} due to its similar motivation in offering a tighter generalization bound.
We also compare our approach with Wasserstein distance-based methods (optimal transport) such as DEEPJDOT~\citep{damodaran2018deepjdot} and JUMBOT~\citep{fatras2021unbalanced}, which belong to another line of related research 
(aligning the joint distribution $p(\mathbf{z}, y)$, but assuming $\mathbf{z}$ and $y$ are independent).


%. These methods also seek to align the joint distribution $p(\mathbf{z}, y)$, but assume $\mathbf{z}$ and $y$ are independent. 
%Another line of related research we draw comparisons with involves Optimal Transport-based methods such as DEEPJDOT~\cite{damodaran2018deepjdot} and JUMBOT~\cite{fatras2021unbalanced}. These methods also seek to align the joint distribution $p(\mathbf{z}, \mathbf{y})$, but assume $\mathbf{z}$ and $\mathbf{y}$ are independent. 
%However, DEEPJDOT and JUMBOT match $p(\mathbf{z})$ and $p(\mathbf{y})$ separately, which assumes $\mathbf{z}$ and $\mathbf{y}$ are independent. 

%We have also considered CKB~\cite{luo2021conditional} as an alternative baseline, as it aligns with the conditional distribution $p(\mathbf{z}|y)$.
%  Our experimental results show that our way of modeling has a better performance.
% Another line of research works we compare with is Optimal Transport-based methods, DEEPJDOT~\cite{damodaran2018deepjdot} and JUMBOT~\cite{fatras2021unbalanced} due to they also aim to match the joint distribution $p(\mathbf{z}, \mathbf{y})$. However, DEEPJDOT and JUMBOT match $p(\mathbf{z})$ and $p(\mathbf{y})$ separately, which assumes $\mathbf{z}$ and $\mathbf{y}$ are independent. 


%\paragraph{Results on Real-World Datasets}

\textbf{Results} \quad
From the results on Office-Home in Table~\ref{tab:office-home}, we observe that the proposed method significantly outperforms the rest of the methods in most of the adaptation tasks, and has the best performance at $71.2\%$ on average. Also, we observe that in some tasks where the alignment for label shift leads to large improvement for f-DAL (Ar$\rightarrow$Pr, Cl$\rightarrow$Rw), our method yields a similar or better performance. This implies that the proposed CCS divergence has the ability to alleviate the label shift problem. The experiment results on Office-31 are shown in Table~\ref{tab:results-office31}. It appears that the proposed method has the best performance on average. More results on Digits can be found in the Appendix (Section~\ref{subsec:compare-fdal-digits}). 


% \input{Tables/abl-fdl-ccs}
% In Table~\ref{tab:abl-fdal-ccs}, we present the comparison between the proposed CS-adv method and other methods. It shows that the proposed method has the best performance on the best adaptation tasks. 

%\paragraph{Results on Digits Dataset} 




% \subsection{Ablation Study}
% \textbf{Visualization} \quad In order to better understand the adaptation ability of CS and CCS divergence, we use t-SNE~\cite{van2008visualizing} to visualize the feature trained without adaptation (Fig.~\ref{fig:tsne-no-ad}), with CS divergence (Fig.~\ref{fig:tsne-cs}, with CCS divergence (Fig.~\ref{fig:tsne-ccs}), and with both CCS and CS divergences (Fig.~\ref{fig:tsne-cs-ccs}). Fig~\ref{fig:tsne-overall} shows the aligned quality on \textbf{M}$\rightarrow$\textbf{U} task. As shown in Fig~\ref{fig:tsne-overall}, CS divergence has a worse performance on inter-class separability, while CCS divergence can alleviate this issue. This can also be observed in Fig.~\ref{fig:tsne-cs-ccs}, where CCS divergence is added on top of CS divergence and leads to better separability compared with Fig.~\ref{fig:tsne-cs}. Hence, modeling the conditional distribution alignment is necessary and the proposed CCS divergence has an advantage in this. 

% \begin{figure} [htbp]
% %\hfill
% \centering 
%      \begin{subfigure}[b]{0.2\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{Figures/TSNE_25_no_adaptation.png}
%          \caption{No adaptation}
%          \label{fig:tsne-no-ad}
%      \end{subfigure}
%      \begin{subfigure}[b]{0.2\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{Figures/TSNE_25_cs.png}
%          \caption{CS divergence}
%          \label{fig:tsne-cs}
%      \end{subfigure}
%      %\hfill
%      \begin{subfigure}[b]{0.2\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{Figures/TSNE_25_ccs.png}
%         \caption{CCS divergence}
%          \label{fig:tsne-ccs}
%      \end{subfigure}
%      \begin{subfigure}[b]{0.2\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{Figures/TSNE_25_cs_ccs.png}
%         \caption{CCS+CS}
%          \label{fig:tsne-cs-ccs}
%      \end{subfigure}
%         \caption{t-SNE visualization of feature trained without adaptation (\ref{fig:tsne-no-ad}), with CS divergence (\ref{fig:tsne-cs}, with CCS divergence (\ref{fig:tsne-ccs}), and with both CCS and CS divergences (\ref{fig:tsne-cs-ccs}). \textcolor{blue}{Blue} indicates the source domain, and \textcolor{red}{red} refers to the target domain.}
%         \label{fig:tsne-overall}
%         
% \end{figure}




%%%%%%%%%%%%%%%%%% original f-DAL CCS %%%%%%%%%%%%%%
% \begin{figure}[htbp]
% %
%   \centering
%   \includegraphics[width=0.4\textwidth]{Figures/compare_mnist_fdal.png}
%   \caption{The ablation study of \textbf{f-DAL with CCS} divergence in MNIST to USPS task. CCS divergence improves the performance of f-DAL consistently.}
%   %
%   \label{fig:f-dal-ccs}
%   %
% \end{figure}
%%%%%%%%%%%%%%%%%% original f-DAL CCS %%%%%%%%%%%%%%
% \begin{table}[htbp]
% \centering
% \resizebox{0.4\textwidth}{!}{
% \begin{tabular}{lccc}
% \toprule
% Method & M$\rightarrow$U & U$\rightarrow$M & VisDA17 \\
% \midrule
% KL (Reproduced)     & 98.2 & 97.2 &    48.3    \\
% \midrule 
% KL-CCS & \textbf{98.4} & \textbf{97.3} & \textbf{64.1}    \\
% CS+CCS & \textbf{98.3} & \textbf{97.9} &   \textbf{64.5} \\
% \bottomrule
% \end{tabular}}
% \caption{Results on Digits and VisDA17 datasets. KL-CCS integrates CCS divergence with KL. CS+CCS replaces KL with CS and CCS divergences.}
% \label{tab:kl_ccs_abl}
% \end{table}

\subsection{CCS as an Injective Module}
\label{sec:injective}
%\subsection{CCS as an Injective Module}

We finally provide two examples to demonstrate that the CCS divergence alone (i.e., Eq.~\ref{eq:conditional_CS_est}) can be used as a plug-in module in existing methods. 

\textbf{CCS in f-DAL} \quad   We choose f-DAL\citep{acuna2021f} as the first base model and simply integrate our CCS divergence into the adversarial training loss of f-DAL (f-DAL-CCS) {without hyperparameter tuning}. As shown in Fig.~\ref{fig:f-dal-ccs}, CCS improves f-DAL consistently, which implies the necessity of aligning conditional distribution by CCS. 

\begin{figure}[htbp]
%
  \centering     \includegraphics[width=0.36\textwidth]{Figures/compare_mnist_fdal.png}
  \caption{\textbf{Integrating CCS with f-DAL} in M$\rightarrow$U task.} %CCS divergence improves the performance of f-DAL consistently.}
  \label{fig:f-dal-ccs}
\end{figure}

%\textbf{CCS in KL} \quad Moreover, we integrate our CCS divergence into KL~\citep{nguyen2021kl} framework (KL-CCS) and test it on both the toy Digtis dataset and the large-scale real-world dataset, VisDA17 (Table~\ref{tab:kl_ccs_abl}). It shows that our CCS module consistently improves the KL. 

%\textbf{CCS in KL} \quad Moreover, we integrate our CCS divergence into KL~\citep{nguyen2021kl} framework (KL-CCS)  and test it on both the toy Digtis dataset and the large-scale real-world dataset, VisDA17 (Table~\ref{tab:kl_ccs_abl}). It shows that our CCS module consistently improves the KL. 
{\textbf{CCS with kSHOT} \quad The most recent UDA approach, like kSHOT~\citep{sun2022prior}, uses additional prior knowledge such as the target class distribution, resulting in superior performance compared to most methods that rely solely on adversarial training. However,
by integrating the CCS divergence, the performance of kSHOT is consistently improved on different tasks on Office-Home, with $0.4$ percent on average. Details can be found in Section~\ref{subsec:add-abl} in the Appendix.}


% Morden UDA use more sphostic method, in this example, we show boost. use words to describe 


%1) the conditional misalignment cannot be ignored, and 2) CCS divergence is able to address this issue. 
