% \begin{itemize}
%     \item UDA related work, mainly MMD, and adversarial training
%     \item CS related work
% \end{itemize}

\paragraph{Domain Adaptation} 
Prior research in the field has predominantly concentrated on aligning the marginal distribution $p(\mathbf{z})$ with a valid distance metric or in an adversarial training manner. 

The most popular distance metric is Mean Maximum Discrepancy (MMD)~\citep{gretton2012kernel}, which has been widely used in domain adaptation task~\citep{pan2010domain,ding2018graph,long2015learning}. Similar to MMD, CORAL~\citep{sun2016deep} matches the first two moments of distributions. Other distance metrics, such as Wasserstein distance~\citep{shen2018wasserstein}, manifold matching~\citep{wang2018visual}, optimal transport~\citep{nicolas2017ot}, and margin disparity discrepancy (MDD)~\citep{zhang2019bridging} have also been used in matching the marginal distributions. Another line of methods utilizes the adversarial training~\citep{ganin2016domain,saito2018maximum} for matching $p(\mathbf{z})$ through a min-max optimization. 

However, such methods only consider matching marginal distributions. 
%Gretton~\textit{et al.}~\cite{gretton2012kernel} uses Mean Maximum Discrepancy (MMD). CORAL~\cite{sun2016deep} matches the two moments of the distribution which is not sufficient to align more complex distributions. Other distance metrics, such as Wasserstein distance~\cite{shen2018wasserstein}, manifold matching~\textit{et al.}~\cite{wang2018visual}, optimal transport~\cite{nicolas2017ot} has also been used in matching the marginal distributions. Some other methods utilize the adversarial training~\cite{ganin2016domain,saito2018maximum} for matching $p(z)$ in a minimax training strategy. However, such methods only consider matching the marginal distribution. 
Aligning the conditional distributions of the source ($p^s(y|\mathbf{z})$) and target ($p^t(y|\mathbf{z})$) domains presents a considerable challenge due to the continuous and high-dimensional nature of $\mathbf{z}$, and the fact that the ground truth $y$ in the target domain is unknown in UDA setting. To address this issue, several attempts such as class condition MMD~\citep{zhang2020discriminative,ge2023unsupervised} and conditional kernel Brues (CKB) metric~\citep{zhang2020discriminative} have been developed. It is important to note that the term ``conditional" here refers to matching $p(\mathbf{z}|y) = \sum p(y=c_i) p(\mathbf{z}|y=c_i)$. Such formulation has two major limitations: 1) it implicitly assumes $p(y)$ is invariant (see Section~\ref{sec:bound} for a detailed discussion); 2) the scalability could be a problem when the number of classes is large, e.g., more than $1,000$ classes as commonly seen in vision tasks. Optimal transport has been used to match the joint distribution $p(\mathbf{z},y)$~\citep{courty2017joint, damodaran2018deepjdot,fatras2021unbalanced}. Typically, the transportation cost is represented as a weighted combination of costs in both feature and label spaces.
In contrast, we match $p(\mathbf{z},y)$ by explicitly modeling both $p(\mathbf{z})$ and $p(y|\mathbf{z})$, following the decomposition $p(\mathbf{z},y)=p(y|\mathbf{z})p(\mathbf{z})$. 


% Several attempts have been made to solve this issue. Zhang~\textit{et al.}~\cite{zhang2020discriminative} uses conditional MMD to match $p(\mathbf{z}|y)$, and Luo~\textit{et al.}~\cite{luo2021conditional} introduces kernel Brues, a new distance metric with Gaussian assumption to match $p(\mathbf{z}|y)$. It is important to note that the term ``conditional'' here refers to the ``class conditional distribution'', i.e., $p(\mathbf{z}|y) = \sum p(y=c_i) p(\mathbf{z}|y=c_i)$.

%joint distribution alignment~\cite{long2013transfer}.
% joint Adaptation Network (JAN) [23] builds a joint distribution alignment model via the features from different hidden layers. Conditional Domain Adversarial Network (CDAN) [22] extends the Domain Adversarial Neural Network (DANN) [9] by exploring a multilinear map to describe the conditional variables in adversarial training

% However, the analyses in these studies predominantly focus on binary classification problems, which limits their applicability. Furthermore, these bounds are based on the $\mathcal{H}$-divergence, which is hard to measure.

%\paragraph{Generalization Bound for Distribution Shift} 

% However, \citep{zhao2019learning} and \citep{nguyen2021domain} argue that the alignment of solely the marginal distribution $p(\mathbf{z})$ is insufficient. One has to consider the conditional relation $p(y|\mathbf{z})$ when dealing with domain shift problems, which inspires our paper.

% \citep{cortes2019adaptation} use discrepancy minimization algorithm and solve a semi-definite programming (SDP) problem.


%\wy{In addition, Y-disc~\citep{medina2015learning} shows a tighter bound than KL. However, it has not been used in the deep learning context, which is beyond our scope. }
 % Acuna~\textit{et al.}
 
%that aligning the marginal distribution $p(\mathbf{z})$ is insufficient, and alignment should also consider the conditional distribution $p(y|\mathbf{z})$ for domain shift problems, %which served as an inspiration for our work.

% However, such bounds are mainly based on the assumption of aligning the marginal distribution. 
\paragraph{Generalization Error Bound}  A tight generalization error bound coupled with a valid discrepancy measure plays a fundamental role in designing modern UDA approaches. Early studies have explored generalization bounds for UDA on binary classification with the aid of $\mathcal{H}\triangle\mathcal{H}$-divergence~\citep{ben2010theory,mansour2009domain}.
Later, \citep{cortes2011domain} extend the result to regression scenario, \citep{medina2015learning,mohri2012new} provide a tighter bound in on-line learning by introducing the $\mathcal{Y}$-discrepancy. 
\citep{cortes2019adaptation} use discrepancy minimization algorithm and solve a semi-definite programming (SDP) problem.
Recently, \citep{acuna2021f} refine the previous bounds and generalize them to a multi-class classification setting with the $f$-divergence, whereas \citep{richard2021unsupervised} consider multi-source domain adaptation for regression with hypothesis-discrepancy. 

\paragraph{Cauchy-Schwarz Divergence} 
%The CS divergence originates from the signal processing community in $00$s, which has received re-emerging attention in recent deep learning applications such as representation learning~\citep{tran2022cauchy} and sequential decision making~\citep{yu2023conditional}. 
Motivated by the well-known Cauchy-Schwarz (CS) inequality for square-integrable functions:
%\vspace{-0.5cm}
\begin{equation} 
\left(\int p(\mathbf{x})q(\mathbf{x})d\mathbf{x} \right)^2 \leq \int p(\mathbf{x})^2d\mathbf{x} \int q(\mathbf{x})^2d\mathbf{x},
%\vspace{-0.5cm}
\label{eq.cs_inequ}
\end{equation}
with equality if and only if $p(\mathbf{x})$ and $q(\mathbf{x})$ are linearly dependent, the CS divergence~\citep{principe2000information,principe2000learning} defines the distance between probability density functions by measuring the tightness (or gap) of the left-hand side and right-hand side of Eq.~(\ref{eq.cs_inequ}) using the logarithm of their ratio:
%\vspace{-0.5cm}
\begin{equation} 
D_{\text{CS}}(p;q)=-\log \Bigg(\frac{(\int p(\mathbf{x})q(\mathbf{x})d\mathbf{x})^2}{\int p(\mathbf{x})^2d\mathbf{x} \int q(\mathbf{x})^2d\mathbf{x}} \Bigg).
\label{eq.cs_divergence_main}
\end{equation}

% Moreover, it is a symmetric metric with $0\leq D_{CS} < \infty$

%The CS divergence is symmetric and has closed-form expression for mixture-of-Gaussians (MoG)~\cite{kampa2011closed}, both properties do not hold for the popular KL divergence. 
%\vspace{-0.5cm}
Eq.~(\ref{eq.cs_inequ}) also applies for two conditional distributions $p(y|\mathbf{x})$ and $q(y|\mathbf{x})$, the resulting conditional Cauchy-Schwarz (CCS) divergence can be defined as~\citep{yu2023conditional}:
\begin{equation} 
\begin{aligned}
& D_{\text{CS}}(p(y|\mathbf{x});q(y|\mathbf{x})) =  -2\log(\iint_{\mathcal{X}, \mathcal{Y}} p(y|\mathbf{x})q(y|\mathbf{x})d\mathbf{x}dy )  \\
& + \log (\iint_{\mathcal{X}, \mathcal{Y}} p^2(y|\mathbf{x})d\mathbf{x}dy ) + \log (\iint_{\mathcal{X}, \mathcal{Y}}  q^2(y|\mathbf{x})d\mathbf{x}dy )  \\
 &  = -2\log(\iint_{\mathcal{X}, \mathcal{Y}} \frac{p(\mathbf{x},y)q(\mathbf{x},y)}{p(\mathbf{x})q(\mathbf{x})} d\mathbf{x}dy )   \\
 & + \log (\iint_{\mathcal{X}, \mathcal{Y}}  \frac{p^2(\mathbf{x},y)}{p^2(\mathbf{x})} d\mathbf{x}dy )
  + \log (\iint_{\mathcal{X}, \mathcal{Y}}  \frac{q^2(\mathbf{x},y)}{q^2(\mathbf{x})} d\mathbf{x}dy ).
\end{aligned}
\label{eq.ccs_divergence_main}
\end{equation}

So far, due to the favorable properties of the CS divergence (e.g., enjoying closed-form expression for mixture-of-Gaussians~\citep{kampa2011closed}), it has been successfully applied to deep clustering~\citep{trosten2021reconsidering}, disentangled representation learning~\citep{tran2022cauchy}, point-set registration~\citep{giraldo2017group}, just to name a few. However, several questions remain unanswered. For example, how does the CS divergence relate to the MMD or the KL divergence? Can these relations contribute to improving domain adaptation and generalization? How such new notions can be applied to construct practical domain adaptation models that achieve superior performance? We shall answer these questions in this paper.

%  hyperspectral image classification~\citep{wei2023multiscale}

%To the best of our knowledge, this is also the first attempt that introduces the notion of CS divergence to the problem of domain adaptation. %\wy{to the best, move to introduction contribution part?}