
% \begin{itemize}
%     \item Intro UDA problem
%     \item intro general methods, moment matching (MMD), wasstertine, and two types of adversial training. 
%     \item Intro contiditional distribution is important and previous methods don't have a good way to model this
%     \item intro CS and CCS divergence 
%     \item our contribution
% \end{itemize}
%  However, the performance of these models relies heavily on the availability of large-scale annotated data.

Deep learning has achieved outstanding performance in different vision tasks, {including image classification~\citep{he2016deep} and semantic segmentation~\citep{ronneberger2015u}}. Typically, it is assumed that the training and test data are drawn from the same distribution. In reality, this assumption is often violated due to a variety of factors, such as changes in lighting conditions, viewpoints, and the appearance of objects. This discrepancy between the source and target domains is referred to as domain shift~\citep{mansour2008domain,yosinski2014transferable}, which can significantly degrade the generalization capability of the learned model.

Domain adaptation aims to mitigate the effects of domain shift by leveraging the knowledge acquired from one or multiple source domains to improve the model's performance in different, but related, target domains. Most of the previous methods aim to learn a domain-invariant feature representation $\mathbf{z}$ that has the same marginal distribution $p(\mathbf{z})$ across domains.
{Usually, this is achieved by either using different kinds of divergence measures, such as Maximum Mean Discrepancy (MMD)~\citep{gretton2012kernel,zhang2020discriminative}, Kullback-Leibler (KL) divergence~\citep{nguyen2021kl}, {Wasserstein distance which arises from the idea of optimal transport}~\citep{damodaran2018deepjdot,fatras2021unbalanced}, or adopting advanced optimization strategies such as adversarial training~\citep{ganin2016domain,long2015learning,saito2018maximum,zhang2019bridging,du2021cross}}.
However, these approaches implicitly assume that the conditional distribution $p(y|\mathbf{z})$ remains the same, {which is generally an overly optimistic assumption (cf. Fig.~1 in \citep{zhao2019learning})}. 




Moreover, due to the high dimensionality and continuous nature of representation $\mathbf{z}$, estimating the discrepancy of $p(y|\mathbf{z})$ in two domains is challenging. %Previously, the most common way is, instead, to match $p(\mathbf{z}|y)$ (e.g., \cite{luo2021conditional,zhang2020discriminative}), which equals to $\sum_i p(\mathbf{z}|y=c_i)p(y=c_i)$. 
The above-mentioned distance measures, including MMD, KL divergence, and Wasserstein distance, have been considered to model the discrepancy of $p(y|\mathbf{z})$. However, earlier approaches either resort to matching $p(\mathbf{z}|y)$ instead (e.g., \citep{luo2021conditional,zhang2020discriminative}), or na\"ively assume that such discrepancy is sufficiently small~\citep{nguyen2021kl}. 
Note that, given that $p(\mathbf{z})$ is aligned, aligning $p(y|\mathbf{z})$ by solely matching $p(\mathbf{z}|y)$ implicitly assumes that $p(y)$ is invariant.
% Note that, aligning $p(y|\mathbf{z})$ by just matching $p(\mathbf{z}|y)$ implicitly assumes that $p(y)$ is invariant, provided that $p(\mathbf{z})$ has been aligned.


%In earlier studies, the predominant approach to tackle this challenge has been to match $p(\mathbf{z}|y)$ instead (e.g., \citep{luo2021conditional,zhang2020discriminative}), which equals to $\sum_i p(\mathbf{z}|y=c_i)p(y=c_i)$, provided that $y$ is discrete.
%\sj{However, aligning $p(y|\mathbf{z})$ by just matching $p(\mathbf{z}|y)$ implicitly assumes that $p(y)$ is invariant (when $p(\mathbf{z})$ is aligned as well)~\citep{zhao2020domain}. }

%\wy{Previous attempts with KL and MMD on aligning $p(y|\mathbf{z})$ fail. } %This motives us to 

%As suggested in~\cite{zhao2020domain}, the two can only be equivalent if $p(y)$ is invariant (i.e., no label shift). Hence, it still remains an open problem how to align $p(y|\mathbf{z})$. 
% $p_s(\mathbf{x})=p_t(y)$
%As suggested in \cite{zhao2020domain}, the two can only be equivalent under certain conditions. Hence, accurately estimating the conditional divergence is still an unsolved problem. 
% Cauchy-Schwarz divergence is inspired by the Cauchy-Schwarz inequality to measure two probability density functions. 

%To tackle this issue, we introduce Cauchy-Schwarz (CS) divergence~\cite{principe2000learning,principe2000information,yu2023conditional} to the problem of unsupervised domain adaptation (UDA). The conditional Cauchy-Schwarz (CCS) divergence is able to handle the  distributions of continuous variables in a tractable way. Hence, we use CS and CCS divergences to measure the the discrepancy of $p(\mathbf{z})$ and $p(y|\mathbf{z})$ between two domains, respectively. Firstly, we demonstrate that we are able to directly match the joint distribution ($p(\mathbf{z}, y)=p(\mathbf{z})p(y|\mathbf{z})$) in an efficient and tractable way with CS and CCS divergences. By aligning $p(y|\mathbf{z})$ between two domains, our method naturally avoid the problems that exist in other domain adaptation methods, like label shift. 

The above issue motivates us to introduce the Cauchy-Schwarz (CS) divergence~\citep{principe2000learning,principe2000information} to the problem of unsupervised domain adaptation (UDA), in which the target domain is unlabeled. 
%
%The CS divergence originates from the signal processing community in $00$s, which has received re-emerging attention in recent deep learning applications such as representation learning~\citep{tran2022cauchy} and sequential decision making~\citep{yu2023conditional}. 
%
Firstly, we demonstrate that CS divergence can explicitly align the discrepancy of $p(y|\mathbf{z})$ between source and target domains.
%\wy{leading to the decomposition of $p(\mathbf{z},y)=p(y|\mathbf{z})p(\mathbf{z})$, which avoids the assumption of ``no label shift" ($p^s(y) \neq p^t(y)$, but $p^s(z|y) =  p^t(z|y)$)} in previous literature. 
%Secondly, relying on CS divergence, we obtain a tighter generalization error bound by extending the result of Nguyen~\textit{et al.}~\cite{nguyen2021kl}. 
Secondly, utilizing the CS divergence, we establish a tighter generalization bound in comparison to the commonly adopted KL divergence. This offers a theoretical guarantee on the improved performance of our overall model compared to other state-of-the-art (SOTA) approaches.
%The new bound is easy to optimize and more general. Finally, we integrate CS divergence into a popular bi-classifier adversarial training framework~\cite{saito2018maximum,lee2019sliced,du2021cross} and achieve compelling performance. 

%We additionally demonstrate that our CS divergence can be easily used as a flexible plug-in module to improve the performance of other state-of-the-art UDA methods.

% Unlike classic generalization error bound~\cite{ben2010theory} that only considers the binary classification setup and is hard to optimize in practice, our bound applies to general domain adaptation tasks including multi-class classification and regression. 

Our contributions can be summarized as follows: 
\begin{enumerate}
    \item To the best of our knowledge, this is the first attempt to introduce CS divergence to UDA for aligning $p(y|\mathbf{z})$.
    \item We show that the CS divergence enables a tighter generalization bound on UDA than the popular KL divergence. Unlike classic generalization error bound~\citep{ben2010theory} that only applies to the binary classification setup and is hard to optimize in practice, our bound applies to general domain adaptation tasks including multi-class classification and regression. 
    \item We provide a simple, non-parametric approach of estimating the CS divergence for both $p(\mathbf{z})$ and $p(y|\mathbf{z})$ between source and target domains, without relying on any distributional assumptions.
    %we show that the CS divergence for both $p(\mathbf{z})$ and $p(y|\mathbf{z})$ between source and target domains can be elegantly estimated with closed-form expressions in a non-parametric way without any parametric distributional assumption;
    %\item We integrate CS divergence into a popular bi-classifier adversarial training framework~\cite{saito2018maximum} and achieve competitive performance. We also illustrate how the CS divergence can be smoothly integrated as a flexible plug-in module to improve other domain adaptation methods.
    \item We show that the proposed CS divergence can be conveniently used in both distance metric- or adversarial training-based UDA frameworks. The CS divergence can also be smoothly integrated as a flexible plug-in module to improve modern UDA approaches.
\end{enumerate}

%\textcolor{red}{Mark: check the consistency of the symbols used.}
% In this paper, we revisit the domain-invariant feature representation learning methods. Most of existing methods assume that the marginal distribution P(X) changes while the conditional distribution P(Y |X) stays stable across domains. Therefore, significant effort has been made in learning a feature representation F(X) that has invariant P(F(X)), either by traditional moment matching [25] or modern adversarial training [15, 14]. To ensure the universality of F(X) and also make it discriminative, a joint classification model is trained on all the source domains and can be used for prediction in new datasets. However, the stability of P(Y |X) is often violated in real applications, leading to sub-optimal solutions. Li et al. [14] proposed to learn invariant class conditional distribution (P(F(X)jY )) by doing adversarial training for each class. However, the method becomes less effective as the number of classes increases.
%However, such methods either consider matching the marginal distribution, $D(P(Z_S) | P(Z_T))$ or matching the conditional distribution $D(P(Y_X|Z_S) | P(Y|TZ_T))$ via $D(P(Z_S|Y_S) | P(Z_T|Y_T))$. Directly matching the conditional distribution is still a challenging problem. Due to the nature of domain adaptation, the classifier adaptation cannot be easily ignored. 


% As neural network architectures continue to advance (He et al., 2016; Vaswani et al., 2017), machine learning algorithms have achieved unprecedented performance in various tasks such as object classification, object detection, and natural language processing. However, these models predominantly focus on independent and identically distributed (i.i.d.) data points, an assumption that often fails in real-world scenarios. When the i.i.d. assumption is violated and the target domain's distribution differs from the source domain, a learner trained on the source data using empirical risk minimization may struggle during testing, as it does not account for the distribution shift. 




