\subsection{Preliminary Knowledge of UDA}

%In this paper, we focus on the unsupervised domain adaptation (UDA) problem. In UDA, it assumes that $M$ labeled samples $\mathcal{D}^s = {(x_i^s, y_i^s)}_{i=1}^{M}$ from the source domain and $N$ unlabeled samples $\mathcal{D}^t = {(x_j^t)}_{j=1}^{N}$ from the target domain are given. $\mathbf{x}\in\mathcal{R}^d_x$ is the observation samples, and $y\in \{1,...,K\}$ is the label where $d_x$ is the dimension of $x$, and $K$ is the number of classes. Typically, the goal of UDA is to find a hypothesis function $h=f\circ g: \mathcal{X} \rightarrow \mathcal{Y}$ such that the risk on the target domain $R_t(h)$ is minimized. $f$ is the feature extractor that maps the observations into the latent space, $f: \mathcal{X} \rightarrow \mathcal{Z}$, where $\mathbf{z} \in \mathcal{R}^d$ is the feature representation with $d$ dimension. Also, $g: \mathbf{z} \rightarrow y$ is the classifier based on the features. We denote the prediction by $\hat{y}=g(\mathbf{z})$.

In this paper, we focus on the unsupervised domain adaptation (UDA) problem. In UDA, we assume that there are $M$ labeled samples $\mathcal{D}^s = {(\mathbf{x}_i^s, y_i^s)}_{i=1}^{M}$ from the source domain and $N$ unlabeled samples $\mathcal{D}^t = {(\mathbf{x}_j^t)}_{j=1}^{N}$ from the target domain. Here, $\mathbf{x}\in\mathbb{R}^{d_x}$ represents the $d_x$-dimensional samples observed, and $y\in \{1,...,K\}$ is the label of $K$ classes. Further, we denote the probability density function for source and target domains by $p^s$ and $p^t$, respectively. The primary goal of UDA is to find a hypothesis function $h=g\circ f: \mathbf{x} \mapsto  \mathbf{y}$ such that the risk on the target domain %$R_t(h)$ 
is minimized. Here, $f: \mathbf{x} \mapsto  \mathbf{z}$ is the feature extractor that maps the observations into a latent space, where $\mathbf{z} \in \mathbb{R}^{d_z}$ is the feature representation with $d_z$ dimensions. $g: \mathbf{z} \mapsto  y$ is the classifier. We denote the prediction by $\hat{y}=g(\mathbf{z})$.

% typically, h = f o g -> z  x\in R^dx -> R^dz, g: z->y 
% x, z , mathbf
% K - number of classes advance 

Most of the previous methods aim to learn a domain invariant feature representation $p(\mathbf{z}|\mathbf{x})$ by the feature extractor $f$ under the assumption of $p^s(\mathbf{x}) \neq p^t(\mathbf{x})$. This means that the distribution of the target domain is shifted from the source domain. Hence, it is intuitive to align the distribution $p(\mathbf{z})$ by a divergence $D(p^s(\mathbf{z});p^t(\mathbf{z}))$ . However, this ignores the conditional shift $p(y|\mathbf{z})$, which involves the label shift and classifier adaptation. 
%Upon deriving the representation $\mathbf{z}$, the classifier $g$ is trained to predict $y$ via an approximated distribution ${p}(y|\mathbf{z})$. Throughout the training process, the representation network $f$ and the classifier $g$ are co-trained on the source domain data. 
%The goal is to effectively generalize the learned models to the target domain by aligning both $p(\mathbf{z}|\mathbf{x})$ and $p(y|\mathbf{z})$ between two domains. 
Note that in UDA, we do not have access to $y$ in the target domain. The common strategy is using the discrete pseudo label from the prediction for matching the class conditional discrepancy $p(\mathbf{z}|y)$. However, due to the CCS divergence being able to handle the continuous variables, we use the prediction vector from the classifier $\hat{y}=g(\mathbf{z})$ as the target label which leads to $p^t(\hat{y}|\mathbf{z})$. %Our aim is to match both $p(\mathbf{z})$ and $p(y|\mathbf{z})$.

%The risk on the target domain $R_t(h)$ is defined as the expected loss over the target distribution, formally represented as $R_t(h) = \mathbb{E}_{(x, y) \sim p^t(x, y)} [l(h(x), y)]$ where $l(h(x), y)$ is the loss of the prediction $h(x)$ with respect to the true label $y$. 
%The standard representation learning framework seeks to derive a latent representation, denoted as $z$, from the input $x$ using a feature extractor $f$. Ideally, this latent representation $z$ encapsulates information pertinent to the label, and is subsequently employed by the classifier $h$ to predict the label $y$. Importantly, since the source and target domains share the same support set for $x$ and utilize the same representation mapping $p(z|x)$, they also share an identical support set for $z$, which we denote as $Z$. 

%In this paper, we explore matching the marginal distribution $p(z)$ and conditional distribution $p(y|z)$ jointly in the framework of Cauchy-Schwarz divergence. Both CS and CCS divergences are easy to estimate for any given continuous variables. 

% \textcolor{red}{lack a paragraph: to guarantee domain adaptation performance, the most popular is \cite{ben2010theory}, \mathcal{H}-\delta bound. test <= train + H\deltaH(p^s(z), p^t(z) + ), third implicitly assumes the third term is very small...., then refer majority,use L_ce + ||||, then say L_ent is widely used. L_{ent} refer more papers, more 3 papers. CS, CCS not mentioned. \\ 
% first old bound problem, \\ 
% bound 3.3 -> 3.2, refer iclr paper. old bound: does not know tight or loss. third term (usually assume very small), 3, H\delta H - hard to estimate or optimize, so we introduce \\
% add a remark, iclr. different to the old bound. 1. general to multi-class. we also 3 terms, the last two controllable. 2. previous bound not bound, not matching conditional, 3. tighter than iclr.}

% \sj{A tight generalization error bound coupled with a valid discrepancy measure plays a fundamental role in designing modern UDA approaches. Early studies have explored generalization bounds for UDA on binary classification with the aid of $\mathcal{H}\triangle\mathcal{H}$-divergence~\citep{ben2010theory,mansour2009domain}.
% Later, \citep{cortes2011domain} extend the result to regression scenario, \citep{medina2015learning,mohri2012new} provide a tighter bound in on-line learning by introducing the $\mathcal{Y}$-discrepancy.} %

% \wy{\citep{cortes2019adaptation} use discrepancy minimization algorithm and solve a semi-definite programming (SDP) problem.}
% Recently, \citep{acuna2021f} revise the previous bounds and generalize them to a multi-class classification setting with the $f$-divergence, \wy{whereas \citep{richard2021unsupervised} consider multi-source domain adaptation for regression with hypothesis-discrepancy}. 

%\citep{nguyen2021kl} propose a new generalization bound with the KL-divergence.


\begin{comment}
In order to ensure effective domain adaptation, the most widely used theoretical generalization bound is provided by \cite{ben2010theory}, and is predicated on the $\mathcal{H}\triangle\mathcal{H}$-divergence. We proceed by reviewing this bound.% Ben-David~\textit{et al.}~

%Assume we have two ground truth labeling functions, $\varphi_s$ and $\varphi_t$ for source and target domains, respectively, where $\varphi: \mathbf{x} \rightarrow y$. 
Given a hypothesis mapping $h$, the risk in the target domain (i.e., $R_{t}(h)$) can be bounded by:
% \begin{equation} 
% \begin{aligned}
% &R_{t}(h) \leq  R_s(h) + d_{\mathcal{H}\triangle\mathcal{H}}(\mathcal{D}_s, \mathcal{D}_t) + \\
% &\min\{ \mathbb{E}_{\mathcal{D}_s}[|\varphi_s(\mathbf{x}) - \varphi_t(\mathbf{x})|],  \mathbb{E}_{\mathcal{D}_t}[|\varphi_s(\mathbf{x}) - \varphi_t(\mathbf{x})|]\},
% \label{eq.h_bound}
% \end{aligned}
% \end{equation}
\begin{equation} 
\begin{aligned}
&R_{t}(h) \leq  R_s(h) + d_{\mathcal{H}\triangle\mathcal{H}}(\mathcal{D}^s, \mathcal{D}^t) + \lambda_{\mathcal{H}},
\label{eq.h_bound}
\end{aligned}
\end{equation}
where $R_s(h)$ is the empirical risk in the source domain, and $d_{\mathcal{H}\triangle\mathcal{H}}(\mathcal{D}^s, \mathcal{D}^t)$ is the $\mathcal{H}\triangle\mathcal{H}$-divergence to measure the disparity of marginal distributions between two domains. The third term $\lambda_{\mathcal{H}} = \inf_{h\in \mathcal{H}}[R_s(h)+R_t(h)]$ represents to the optimal joint risk that can be achieved by the hypotheses in $\mathcal{H}$. It
measures the conditional misalignment of two true labeling functions. 
There are several issues in Eq.~(\ref{eq.h_bound}): 1) it only applies to binary classification; 2) the third term is intractable and usually assumed to be very small or neglected~\citep{ganin2016domain,richard2021unsupervised}, which is a too strong and unrealistic assumption; 3) $\mathcal{H}\triangle\mathcal{H}$-divergence is hard to estimate. Other well-known bounds following this line of research include \cite{mansour2009domain} and  \cite{zhao2019learning}.
% is too strong and could be problematic in the label shift scenario

\end{comment}



%Hence, we introduce a new tight bound based on the Cauchy-Schwarz divergence. 

% \wy{merge with Eq.~\ref{eq.adv_cls}} In practice, the majority of works use the cross-entropy loss ($L_{\text{CE}}$) for $R_s(h)$ in the source domain. Also, the entropy loss $L_{\text{Ent}}$~\cite{grandvalet2004semi} has been widely used in the domain adaptation task (\cite{long2018conditional, luo2021conditional, du2021cross}) as an additional constraint in the target domain. $L_{\text{CE}}$ and  $L_{\text{Ent}}$ are defined as:
% \begin{equation} 
% L_{\text{CE}} =  \frac{1}{M}\sum^{M}_{i=1} -y^s_{i}\log \hat{y}^s_{i}, \qquad 
% L_{\text{Ent}} = \frac{1}{N} \sum^{N}_{i=1} -\hat{y}^t_{i}\log \hat{y}^t_{i}.
% \label{eq.ce}
% \end{equation}
% \begin{equation} 
% L_{\text{Ent}} = \frac{1}{N} \sum^{N}_{i=1} -\hat{y}^t_{i}\log \hat{y}^t_{i}.
% \label{eq.ent}
% \end{equation}
%Then, the remaining problem is how to estimate the domain discrepancy and the labeling function shift (the second and third term in Eq.~\ref{eq.h_bound}). In this work, we propose to use CS and CCS divergences for this problem. We will elaborate on the estimation of CS and CCS divergences in detail in Sec.~\ref{sec:cs-est}.
%In this work, we propose to use CS divergence $D_{CS}(p^s(\mathbf{z});p^t(\mathbf{z}))$ for matching the marginal distribution, and CCS divergence $D_{CCS}(p^s(\mathbf{\hat{y}}|\mathbf{z});p^t(\mathbf{\hat{y}}|\mathbf{z}))$ for matching the conditional distribution. We will elaborate on the estimation of CS and CCS divergences in detail in Sec.~\ref{sec:cs-est}.


\subsection{Domain Shift Generalization Bound} \label{sec:bound}

%\textcolor{red}{TBD: bound derive}
% bound 3.3 -> 3.2, refer iclr paper. old bound: does not know tight or loss. third term (usually assume very small), 3, H\delta H - hard to estimate or optimize, so we introduce \\
% add a remark, iclr. different to the old bound. 1. general to multi-class. we also 3 terms, the last two controllable. 2. previous bound not bound, not matching conditional, 3. tighter than iclr.}

%We derive the generalization bounds with CS and CCS divergences. 

% which is easier to estimate and can be generalized to multiclass classification compared with $\mathcal{H}$-divergence bound

We proceed by reviewing a newly developed KL-guided bound~\citep{nguyen2021kl}, which can be used for general scenarios (including multi-class classification and regression) and makes no assumptions about the labeling mechanism (can be probabilistic or deterministic). According to~\citep{nguyen2021kl}, the loss $l_{\text{test}}$ in the test distribution (a.k.a., target domain) satisfies: 
\begin{equation} 
\begin{aligned}
& l_{\text{test}} = \mathbb{E}_{p^{t}(\mathbf{x},y)}[-\log \hat{p}(y|\mathbf{x})] \leq \mathbb{E}_{p^{t}(\mathbf{z},y)}[ -\log \hat{p}(y|\mathbf{z})] \\
% &= \mathbb{E}_{p_{t}(\mathbf{x},y)}[-\log \mathbb{E}_{p(\mathbf{z}|\mathbf{x})} [\hat{p}(y|\mathbf{z})]] \\
% & \leq \mathbb{E}_{p_{t}(\mathbf{x},y)}[ \mathbb{E}_{p(\mathbf{z}|\mathbf{x})} [-\log \hat{p}(y|\mathbf{z})]] \\
% & \leq \mathbb{E}_{p_{t}(\mathbf{z},y)}[ -\log \hat{p}(y|\mathbf{z})] \\
& = \int - \log \hat{p}(y|\mathbf{z}) p^s(\mathbf{z},y) d\mathbf{z}dy + \\ 
& \int -\log \hat{p}(y|\mathbf{z}) [p^t(\mathbf{z},y) - p^s(\mathbf{z},y)] d\mathbf{z}dy \\
& = l_{\text{train}} + \int -\log \hat{p}(y|\mathbf{z}) [p^t(\mathbf{z},y) - p^s(\mathbf{z},y)] d\mathbf{z}dy \\
& \leq l_{\text{train}} + \frac{M}{2} \int |p^t(\mathbf{z},y) - p^s(\mathbf{z},y)| d\mathbf{z}dy \\
& \leq l_{\text{train}} + \frac{M}{2} \sqrt{ 2 \int p^t(\mathbf{z},y) \log \frac{p^t(\mathbf{z},y)}{p^s(\mathbf{z},y)} d\mathbf{z}dy } \\
& = l_{\text{train}} + \frac{M}{\sqrt{2}} \sqrt{ D_{\text{KL}} (p^t (\mathbf{z},y); p^s(\mathbf{z},y)) } \\
& = l_{\text{train}} + \frac{M}{\sqrt{2}} \sqrt{ D_{\text{KL}} (p^t (\mathbf{z}); p^s(\mathbf{z}) ) + D_{\text{KL}} (p^t (y|\mathbf{z}); p^s(y|\mathbf{z})) },
\end{aligned}
\label{eq.kl_bound_pre}
\end{equation}
in which {$l_{\text{train}}$ is the loss in the source domain, and} the fourth line assumes that $-\log \hat{p}(y|\mathbf{z})$ is upper bounded by a constant $M$\footnote{In classification, we can enforce this condition easily by augmenting the output softmax of the classifier so that each class probability is always at least $\exp(-M)$~\citep{nguyen2021kl}. For example, if we choose $M = 4$, then $\exp(-M) \approx 0.02$.}, the fifth line uses the famed Pinsker's inequality~\citep{pinsker1964information}, which states that the total variation (TV) distance $D_{\text{TV}} = \frac{1}{2}\int|p(\mathbf{x})-q(\mathbf{x})|d\mathbf{x}$ is upper bounded by the KL divergence $D_{\text{KL}} = \int p(\mathbf{x})\log\left( \frac{p(\mathbf{x})}{q(\mathbf{x})} \right)$ in the form of $D_{\text{TV}} \leq \sqrt{\frac{1}{2} D_{\text{KL}} }$. 
The last line follows the chain rule.

%They have proved that if $-\log \hat{p}(y|\mathbf{z})$ is bounded by a constant $M$, then the test loss is bounded by:

% \begin{equation} 
% \begin{aligned}
% l_{\text{test}} & \leq l_{\text{train}} + \\
% & \frac{M}{2} \sqrt{ D_{\text{KL}} (p^t (\mathbf{z}); p^s(\mathbf{z}) ) + D_{\text{KL}} (p^t (y|\mathbf{z}); p^s(y|\mathbf{z})) }
% \end{aligned}
% \label{eq.kl_bound}
% \end{equation}

%Based on this analysis, we demonstrate a tighter bound based on the CS divergence. 

Referring to the second-to-last line of Eq.~(\ref{eq.kl_bound_pre}), achieving small test errors necessitates matching the joint distribution $p(\mathbf{z},y)$, rather than solely focusing on the marginal distribution $p(\mathbf{z})$. This result is also consistent with \citep{zhao2019learning} and \citep{nguyen2021domain}. Our paper utilizes the chain rule $p(\mathbf{z},y)=p(y|\mathbf{z})p(\mathbf{z})$, indicating alignments for both $p(\mathbf{z})$ and $p(y|\mathbf{z})$. By contrast, existing literature (e.g., \citep{ge2023unsupervised,luo2021conditional,zhang2020discriminative}) often employs an alternative decomposition $p(\mathbf{z},y)=p(\mathbf{z}|y)p(y)$ but aligns only the classical conditional distribution $p(\mathbf{z}|y)$, thereby overlooking the impact of shift of $p(y)$.

% ensuring proximity between $p^s(z,y)$ and $p^t(z,y)$



%We then show how CS divergence can be utilized to theoretically tighten this bound and practically simplify the estimation of associated information-theoretic quantities. 


Before providing a possibly tighter generalization error bound than the above-mentioned KL-guided bound, we first establish the connection between the CS divergence with respect to the KL divergence and the TV distance. 

We proceed our analysis with a Gaussian assumption, in which their connections are demonstrated in Propositions~\ref{proposition_Gaussian} and \ref{proposition_Gaussian_TV}. Note that, the Gaussian assumption on the learned representations (of deep neural networks) is commonly used in vision tasks~\citep{he2015delving,ioffe2015batch}. 
%Then, we extend the assumption to the general distributions without Gaussianity in Propositions~\ref{proposition_general} and \ref{proposition_general_TV}. Finally, our generalization bound is given for general distributions.

%But both propositions could be extended to general distributions without Gaussianity (under mild conditions). We refer interested readers to Sections \ref{subsec:extension-prop1} and \ref{subsec:extension-prop2} in the Appendix. 



\begin{proposition}\label{proposition_Gaussian}
For any $d$-variate Gaussian distributions $p\sim \mathcal{N}(\mu_1,\Sigma_1)$ and $q\sim \mathcal{N}(\mu_2,\Sigma_2)$, where $\Sigma_1$ and $\Sigma_2$ are positive definite, we have:
\begin{equation}\label{eq:gaussian}
D_{\mathrm{CS}}(p;q) \leq \min \big\{ D_{\mathrm{KL}}(p;q), D_{\mathrm{KL}}(q;p)\big\}.
\end{equation}
\end{proposition}

\begin{proof}
All proofs of this paper are available in Section~\ref{sec:proofs} of the Appendix.
%Sections \ref{subsec:proof_prop1} and \ref{subsec:proof-prop2} of the Appendix. 
%Sections 1 and 2 of the supplementary material.
\end{proof}











\begin{proposition}\label{proposition_Gaussian_TV}
    Let $\Phi$ be the cumulative distribution function of a standard normal distribution. Let $p\sim \mathcal{N}(\mu_1, \Sigma_1)$ and $q\sim \mathcal{N}(\mu_2, \Sigma_2)$ be any $d$-dimensional Gaussian distributions. We have:
\begin{equation}
 D_{\mathrm{TV}} \leq \sqrt{D_{\mathrm{CS}} },
\end{equation}
if one of the following conditions is satisfied:
\begin{enumerate}
\item $\Sigma_1=\Sigma_2=\Sigma$ and $1/2\sqrt{ \delta^{\top} \Sigma^{-1} \delta } \geq 2\Phi (\|\Sigma^{-1/2}\delta\|_2/2)-1$, where $\delta = \mu_1 - \mu_2$;
\item $\sum_{i=1}^d \log\left( \frac{2+\lambda_i + 1/\lambda_i}{4} \right) \geq 4 $, where $\lambda_i$ is the $i$-th eigenvalue of $\Sigma_2^{-1}\Sigma_1$.
\end{enumerate}
\end{proposition}

%\begin{remark}
The conditions above in Proposition~\ref{proposition_Gaussian_TV} are easily met, especially when $p$ and $q$ are not sufficiently similar and the variable dimension $d$ is large. For example, when $d=1024$ as in our ResNet50 feature exactor, it suffices to require $\frac{2+\lambda_i + 1/\lambda_i}{4} \geq 1.003$, which implies $\sum_{i=1}^d \log\left( \frac{2+\lambda_i + 1/\lambda_i}{4} \right) \geq 4 $.
Note that, both CS and KL divergences are unbounded, whereas the TV distance is confined by an upper limit of $1$. 



In fact, the above connections can be extended to general distributions without assuming Gaussianity, as demonstrated in Propositions~\ref{proposition_general} and \ref{proposition_general_TV}, respectively.



\begin{comment}
Accordingly, we propose the following Hypothesis.
\begin{hypothesis}\label{hypothesis}
For arbitrary distributions $p$ and $q$, we have:
\begin{equation}\label{eq:approx_relation}
D_{\text{TV}}(p;q) \leq \sqrt{D_{\text{CS}}(p;q)} \quad \text{and} \quad D_{\text{CS}}(p;q) \leq D_{\text{KL}}(p;q),
\end{equation}
\end{hypothesis}
\end{comment}


%\vspace{-0.5cm}
\begin{proposition}\label{proposition_general}
For any density functions $p:\,\mathbb{R}^d\to \mathbb{R}_{\geq 0}$ and $q:\,\mathbb{R}^d\to \mathbb{R}_{\geq 0}$, let $K$ be an integration domain over which $p$ and $q$ are Riemann integrable.
Suppose $|K|<\infty$, where $|K|$ denotes the volume. Then
% For any density functions $p$ and $q$, let $|K|$ denote the length of the integral's integration range $K$ with $|K| \gg 0$, we have:
\begin{equation}
C_1 \left[D_{\mathrm{CS}}(p;q) - \log{|K|} + 2\log C_2 \right] \leq D_{\mathrm{KL}}(p;q),
\end{equation}
where $C_1=\int_K p(\mathbf{x})\,\der \mathbf{x}$, $C_2 = { C_1 }{ \left(\int_K p^2(\mathbf{x})\,\der \mathbf{x} \int_K q^2(\mathbf{x})\,\der \mathbf{x} \right)^{-1/4} }$.  Clearly, for $K$ such that $|K\cap S| \gg 0$, where $S=\big\{\mathbf{x}:\, p(\mathbf{x})>0\big\}$, one can have $C_1 \approx 1$.
% $\approx \dfrac{ 1 }{ \left(\int_K p^2(\mathbf{x})\,\der \mathbf{x} \int_K q^2(\mathbf{x})\,\der \mathbf{x} \right)^{1/4} }$
\end{proposition}


\begin{figure} [htbp]
%\hfill
\centering 
 \includegraphics[width=0.5\textwidth]{Figures/TV_threshold_illustration.pdf}
 \caption{A graphical illustration of the sets $\mathcal{A}_{\epsilon}$ and $\mathcal{A}_{\epsilon}^{\complement}$ defined in Proposition \ref{proposition_general_TV}.}
\label{fig:TV_threshold_main}
\end{figure}

%\vspace{-0.5cm}
\begin{proposition}\label{proposition_general_TV}
For any density functions $p$ and $q$, and any $\epsilon>0$, let $\calA_{\epsilon}=\left\{\mathbf{x}:\, p(\mathbf{x})\leq \epsilon\right\}\cup \left\{\mathbf{x}:\, q(\mathbf{x})\leq \epsilon\right\}$ and $\calA_{\epsilon}^{\complement}$ be its complement (see Fig.~\ref{fig:TV_threshold_main}). Moreover, define $T_{\calA_\epsilon^{\complement}}=\sup\left\{p(\mathbf{x})q(\mathbf{x}),\, \mathbf{x}\in\calA_\epsilon^{\complement}\right\}$
and $\left|\calA_\epsilon^{\complement}\right|$ to denote the ``length'' of the set $\calA_\epsilon^{\complement}$ (strictly speaking, the Lebesgue measure of the set $\calA_\epsilon^{\complement}$). Suppose there exists an $\epsilon>0$ such that $T_{\calA_\epsilon^{\complement}}\left|\calA_\epsilon^{\complement}\right|<\infty$ and $C_3 = \int p^2(\mathbf{x})\, \der \mathbf{x} \int q^2(\mathbf{x})\, \der \mathbf{x} \geq \exp(2) \left(2\epsilon+T_{\calA_\epsilon^{\complement}}\left|\calA_\epsilon^{\complement}\right|\right)^2$, then 
\begin{equation}
D_{\mathrm{TV}}(p;q)\leq \sqrt{D_{\mathrm{CS}}(p;q)}.
\end{equation}
\end{proposition}


%\begin{remark}
%The above analysis imply that, in practice, it is very likely that:
In our context, where $p$ and $q$ may differ substantially, using $D_{\text{TV}}$ can be too restrictive to yield meaningful results. This is because $D_{\text{TV}}$ measures the largest possible difference between $p$ and $q$, and it rapidly reaches its upper bound of $1$, particularly when the location parameters of $p$ and $q$ differ. In this case, one is no longer be able to distinguish the distance between any sufficiently distinct pairs of $(p,q)$. Furthermore,
% Interestingly, our above analysis suggests:
% \begin{equation}\label{eq:divergence_relation_main}
%     D_{\text{TV}} \lesssim \sqrt{D_{\text{CS}} } \quad \text{and} \quad D_{\text{CS}} \lesssim D_{\text{KL}},
% \end{equation}
% in which $p$ and $q$ need not be Gaussian, and the symbol $\lesssim$ denotes ``less than or similar to".
% Following Eq.~(\ref{eq.kl_bound_pre}), for some $M\geq 0$, the generalization error can also be bounded in the following way:
% \begin{equation} 
% \begin{aligned}
% l_{\text{test}} & \leq l_{\text{train}} + M  D_{\text{TV}} (p^t (\mathbf{z},y); p^s(\mathbf{z},y)) \\
% & \lesssim l_{\text{train}} + M \sqrt{ D_{\text{CS}} (p^t (\mathbf{z},y); p^s(\mathbf{z},y)) } \\
% & \lesssim l_{\text{train}} + M \sqrt{ D_{\text{KL}} (p^t (\mathbf{z},y); p^s(\mathbf{z},y)) }.
% \end{aligned}
% \label{eq.bounds}
% \end{equation}
% %It is well-known that  KL-divergence can be numerically sensitive. While both measures $D_{CS}$ and $D_{KL}$ are unbounded, Eq.~(\ref{eq:ourbound}) implies that minimizing $D_{CS}$ could potentially yield more stable performance, as $D_{CS}$ is upper bounded by $D_{KL}$.
% \end{remark}
% \textcolor{blue}{
% \begin{remark}
as shown in Proposition~\ref{proposition_general}, the inequality $C_1 \left[D_{\mathrm{CS}}(p;q) - \log{|K|} + 2\log C_2 \right] \leq D_{\mathrm{KL}}(p;q)$ holds for any density functions $p$ and $q$, where $C_1 =  \int_K p(\mathbf{x})\,\mathrm{d} \mathbf{x}>0$ and $C_2 = C_1  \left(\int_K p^2(\mathbf{x})\,\mathrm{d} \mathbf{x} \int_K q^2(\mathbf{x})\,\mathrm{d} \mathbf{x} \right)^{-1/4} $. Note that $C_1$ and $C_2$ are not conditions, but two constant values that depend on the distributions themselves. Furthermore, according to Proposition~\ref{proposition_general_TV}, if $p$ and $q$ do not sufficiently overlap, as ensured by the condition $\int p^2(\mathbf{x})\, \mathrm{d} \mathbf{x} \int q^2(\mathbf{x})\, \mathrm{d} \mathbf{x} \geq \exp(2) \left(2\epsilon+T_{\calA_\epsilon^{\complement}}\left|\calA_\epsilon^{\complement}\right|\right)^2$ for some $\epsilon>0$, then $D_{\mathrm{TV}}\leq \sqrt{D_{\mathrm{CS}}}$. 

By combining these results, we have the following general error bound:
\begin{equation}
\begin{aligned}
\label{eq:general_bound}
l_{\text{test}} 
& \leq l_{\text{train}} + M  D_{\text{TV}} (p^t (\mathbf{z},y); p^s(\mathbf{z},y)) \\
& \leq l_{\text{train}} + M \sqrt{ D_{\text{CS}} (p^t (\mathbf{z},y); p^s(\mathbf{z},y)) } \\
& \leq l_{\text{train}} + \\
&M \sqrt{ C_1^{-1} D_{\text{KL}} (p^t (\mathbf{z},y); p^s(\mathbf{z},y)) + \log{|K|} - 2\log C_2}\, ,
\end{aligned}
\end{equation}
if $\int p^2(\mathbf{x})\, \mathrm{d} \mathbf{x} \int q^2(\mathbf{x})\, \mathrm{d} \mathbf{x} \geq \exp(2) \left(2\epsilon+T_{\calA_\epsilon^{\complement}}\left|\calA_\epsilon^{\complement}\right|\right)^2$ for some $\epsilon>0$.
The latter condition is quite feasible to be satisfied. More discussion is in Remark~\ref{remark:exp-prop6} in the Appendix.
%\end{remark}


 
Similar to TV distance and most of $f$-divergence measures~\citep{collet2019exact}, the CS divergence does not satisfy the chain rule, indicating that the joint CS divergence cannot be expressed as the sum of marginal and conditional divergence. However, in practice, one can nevertheless control the joint divergence by minimizing the marginal and conditional counterparts separately. For simplicity, in this paper, we aim at minimizing:
\begin{equation}\label{eq:final_objective}
    l_{\text{train}} + M \sqrt{ D_{\text{CS}} (p^t (\mathbf{z}); p^s(\mathbf{z}) ) + D_{\text{CS}} (p^t (y|\mathbf{z}); p^s(y|\mathbf{z})) }.
\end{equation}
%It makes sense to match both the marginal distribution $p(\mathbf{z})$ and the conditional distribution $p(y|\mathbf{z})$, separately. 
%This idea aligns with that of~\cite{nguyen2021domain} and~\cite{zhao2019learning}, but relies on the CS divergence, which is also tighter than the KL divergence counterpart.



%\begin{remark}\label{compare-kl}
%Same to Eq.~(\ref{eq:final_objective}), 
Although the KL-guided bound in Eq.~(\ref{eq.kl_bound_pre}) implies the necessity of minimizing the conditional divergence of $p(y|\mathbf{z})$, \citep{nguyen2021kl} neglect this term (due to difficulty of estimation) and assume $p^t (y|\mathbf{z})$ and $p^s (y|\mathbf{z})$ are sufficiently close, which may not hold true~\citep{zhao2020domain}. 
%In Section~\ref{sec:cs-est}, we illustrate that the use of CS divergence makes the alignment of both marginal and conditional distributions tractable.
%\end{remark}

\subsection{Estimation of Cauchy-Schwarz Divergence}
\label{sec:cs-est}
%\textcolor{red}{TBD: Remarks on theoretical analysis with MMD}

Suppose we have $M$ labeled samples $\{\mathbf{x}^s_i,y^s_i\}_{i=1}^M$ from the source domain and $N$ unlabeled samples $\{\mathbf{x}^t_i\}_{i=1}^N$ from the target domain, let us denote the predicted class probabilities for $\{\mathbf{x}^s_i\}_{i=1}^M$ and $\{\mathbf{x}^t_i\}_{i=1}^N$ are respectively $\{\hat{y}_i^s\}_{i=1}^M$ and $\{\hat{y}_i^t\}_{i=1}^N$, the following two propositions provide the empirical estimator of $D_{\text{CS}}(p^s(\mathbf{z});p^t(\mathbf{z}))$ and $D_{\text{CCS}}(p^s(y|\mathbf{z});p^t(y|\mathbf{z}))$. Moreover, in the following two remarks, we discuss the relationship and difference between (conditional) CS divergence and (conditional) MMD.

%Assume we have $M$ samples from the source domain and $N$ samples from the target domain. 

%The CS divergence can be easily estimated by using the kernel density estimator (KDE)~\cite{parzen1962estimation} with Gaussian kernel $G_{\sigma}(\cdot) = \exp (-\frac{||\cdot||^2}{2\sigma^2})$:

\begin{proposition}
[Empirical Estimator of $D_{\text{CS}}(p^s(\mathbf{z});p^t(\mathbf{z}))$~\citep{jenssen2006cauchy}]
Given extracted features from two domains $\{\mathbf{z}_i^s\}_{i=1}^M$ and 
$\{\mathbf{z}_i^t\}_{i=1}^N$, the empirical estimator of $D_{\text{CS}}(p^s(\mathbf{z});p^t(\mathbf{z}))$ is given by:
% \begin{equation}
% \label{eq.cs_est}
% \begin{aligned}
% & \widehat{D}_{\text{CS}} (p^s(\mathbf{z});p^t(\mathbf{z})) = \log\left(\frac{1}{M^2}\sum_{i,j=1}^M \kappa({\bf z}_i^s,{\bf z}_j^s)\right) +  \\
% & \log\left(\frac{1}{N^2}\sum_{i,j=1}^N \kappa({\bf z}_i^t,{\bf z}_j^t)\right)
% -2 \log\left(\frac{1}{MN}\sum_{i=1}^M \sum_{j=1}^N \kappa({\bf z}_i^s,{\bf z}_j^t)\right).
% \end{aligned}
% \end{equation}
\begin{equation}
\label{eq.cs_est}
\begin{aligned}
& \widehat{D}_{\text{CS}} (p^s(\mathbf{z});p^t(\mathbf{z})) = \log (\frac{1}{M^2}\sum_{i,j=1}^M \kappa({\bf z}_i^s,{\bf z}_j^s)) +  \\ & \log(\frac{1}{N^2}\sum_{i,j=1}^N \kappa({\bf z}_i^t,{\bf z}_j^t)) 
-2 \log(\frac{1}{MN}\sum_{i=1}^M \sum_{j=1}^N \kappa({\bf z}_i^s,{\bf z}_j^t)),
\end{aligned}
\end{equation}
where $\kappa$ is a kernel function such as Gaussian $\kappa_{\sigma}(\mathbf{z},\mathbf{z}')=\exp(-\|\mathbf{z}-\mathbf{z}'\|_2^2/2\sigma^2)$.
\end{proposition}

\begin{remark}\label{remark_CS_MMD}
The CS divergence is closely related to the MMD~\citep{gretton2012kernel}. In fact, the empirical estimator of the ``biased" MMD can be expressed as:
\begin{equation}\label{eq:mmd_est_main}
\begin{split}
& \widehat{\text{MMD}}^2(p^s;p^t) = \frac{1}{M^2}\sum_{i,j=1}^M \kappa({\bf z}_i^s,{\bf z}_j^s) \\
& + \frac{1}{N^2}\sum_{i,j=1}^N \kappa({\bf z}_i^t,{\bf z}_j^t) - \frac{2}{MN}\sum_{i=1}^M \sum_{j=1}^N \kappa({\bf z}_i^s,{\bf x}_j^t).
\end{split}
\end{equation}
Comparing Eq.~(\ref{eq.cs_est}) with Eq.~(\ref{eq:mmd_est_main}), we observe that the CS divergence estimator puts a ``logarithm" on each term of that of MMD. Both estimators capture the within-distribution similarity subtracted by cross-distribution similarity, similar to the energy distance~\citep{sejdinovic2013equivalence}. 
\end{remark}


% \begin{remark}\label{remark_CS_MMD}
% The CS divergence is closely related to the maximum mean discrepancy (MMD)~\citep{gretton2012kernel}. In fact, given a characteristic kernel $\kappa(\mathbf{z},\mathbf{z}')=\langle \phi(\mathbf{z}),\phi(\mathbf{z}') \rangle_\mathcal{H}$, let us denote the (empirical) mean embedding for $\{\mathbf{z}_i^s\}_{i=1}^M$ and $\{\mathbf{z}_i^t\}_{i=1}^N$ as $\mu_s = \frac{1}{M}\sum_{i=1}^M \phi(\mathbf{z}_i^s)$ and $\mu_t = \frac{1}{N}\sum_{i=1}^n \phi(\mathbf{z}_i^t)$, the empirical estimators of CS divergence and MMD can be expressed as:
% \begin{equation}
% \begin{aligned}
% \widehat{D}_{\text{CS}} (p^s;p^t) &= -2\log \left( \frac{\langle \mu_s,\mu_t \rangle_\mathcal{H}}{\|\mu_s\|_\mathcal{H} \|\mu_t\|_\mathcal{H} } \right) \\
% &= -2\log \cos(\mu_s,\mu_t),
% \end{aligned}
% \end{equation}
% \begin{equation}\label{eq:mmd_est}
% \begin{split}
% & \widehat{\text{MMD}}^2(p^s;p^t) \\
% & = \langle \mu_s,\mu_t \rangle_\mathcal{H}^2 = \|\mu_s\|_\mathcal{H}^2 + \|\mu_t\|_\mathcal{H}^2 - 2 \langle \mu_s,\mu_t \rangle_\mathcal{H} \\
% & = \frac{1}{M^2}\sum_{i,j=1}^M \kappa({\bf z}_i^s,{\bf z}_j^s) + \frac{1}{N^2}\sum_{i,j=1}^N \kappa({\bf z}_i^t,{\bf z}_j^t) \\
% & - \frac{2}{MN}\sum_{i=1}^M \sum_{j=1}^N \kappa({\bf z}_i^s,{\bf x}_j^t).
% \end{split}
% \end{equation}
% That is, CS divergence measures the cosine similarity between $\mu_p$ and $\mu_q$ in $\mathcal{H}$, whereas MMD uses Euclidean distance. Moreover, it is interesting to find that the empirical estimator of CS divergence just adds a logarithm on each term of that of MMD.  
% \end{remark}

% By comparing Eq.~(\ref{eq.cs_est}) with Eq.~(\ref{eq:mmd_est})



\begin{proposition}[Empirical Estimator of $D_{\text{CCS}}(p^s(\hat{y}|\mathbf{z});p^t(\hat{y}|\mathbf{z}))$~\citep{yu2023conditional}] 
Given features $\mathbf{z}$ and the corresponding predictions $\hat{y}$ from two domains, $\{\mathbf{z}_i^s,\hat{y}_i^s \}_{i=1}^M$ and $\{\mathbf{z}_i^t,\hat{y}_i^t \}_{i=1}^N$. Let $K^s$ and $L^s$ denote, respectively, the Gram matrices for the variable $\mathbf{z}$ and the predicted output $\hat{y}$ in the source distribution. Similarly, let $K^t$ and $L^t$ denote, respectively, the Gram matrices for the variable $\mathbf{z}$ and the predicted out $\hat{y}$ in the target distribution. Meanwhile, let $K^{st}\in \mathbb{R}^{M\times N}$ (i.e., $\left(K^{st}\right)_{ij}=\kappa(\mathbf{z}^s_i - \mathbf{z}^t_j)$) denote the Gram matrix from source distribution to target distribution for input variable $\mathbf{z}$, and $L^{st}\in \mathbb{R}^{M\times N}$ the Gram matrix from source distribution to target distribution for predicted output $\hat{y}$.
Similarly, let $K^{ts}\in \mathbb{R}^{N\times M}$ (i.e., $\left(K^{ts}\right)_{ij}=\kappa(\mathbf{z}^t_i - \mathbf{z}^s_j)$) denote the Gram matrix from target distribution to source distribution for input variable $\mathbf{z}$, and $L^{ts}\in \mathbb{R}^{N\times M}$ the Gram matrix from target distribution to source distribution for predicted output $\hat{y}$.
The empirical estimation of $D_{\text{CCS}}(p^s(\hat{y}|\mathbf{z});p^t(\hat{y}|\mathbf{z}))$ is given by:
% \begin{equation}\label{eq:conditional_CS_est}
% \begin{split}
% & \widehat{D}_{\text{CS}}(p^s(\hat{y}|\mathbf{x});p^t(\hat{y}|\mathbf{x})) \approx \log\left( \sum_{j=1}^M \left( \frac{ \sum_{i=1}^M K_{ji}^s L_{ji}^s }{ (\sum_{i=1}^M K_{ji}^s)^2 } \right) \right) + \log\left( \sum_{j=1}^N \left( \frac{ \sum_{i=1}^N K_{ji}^t L_{ji}^t }{ (\sum_{i=1}^N K_{ji}^t)^2 } \right) \right) \\
% & - \log \left( \sum_{j=1}^M \left( \frac{ \sum_{i=1}^N K_{ji}^{st} L_{ji}^{st} }{ (\sum_{i=1}^M K_{ji}^s) (\sum_{i=1}^N K_{ji}^{st}) } \right) \right) - \log \left( \sum_{j=1}^N \left( \frac{ \sum_{i=1}^M K_{ji}^{ts} L_{ji}^{ts} }{ (\sum_{i=1}^M K_{ji}^{ts}) (\sum_{i=1}^N K_{ji}^t) } \right) \right).
% \end{split}
% \end{equation}
\begin{equation}\label{eq:conditional_CS_est}
\begin{split}
& \widehat{D}_{\text{CCS}}(p^s(\hat{y}|\mathbf{z});p^t(\hat{y}|\mathbf{z})) \\
& \approx \log( \sum_{j=1}^M ( \frac{ \sum_{i=1}^M K_{ji}^s L_{ji}^s }{ (\sum_{i=1}^M K_{ji}^s)^2 } ) ) 
 + \log (\sum_{j=1}^N ( \frac{ \sum_{i=1}^N K_{ji}^t L_{ji}^t }{ (\sum_{i=1}^N K_{ji}^t)^2 } ) ) \\
& - \log ( \sum_{j=1}^M ( \frac{ \sum_{i=1}^N K_{ji}^{st} L_{ji}^{st} }{ (\sum_{i=1}^M K_{ji}^s) (\sum_{i=1}^N K_{ji}^{st}) } ) ) \\
& - \log ( \sum_{j=1}^N ( \frac{ \sum_{i=1}^M K_{ji}^{ts} L_{ji}^{ts} }{ (\sum_{i=1}^M K_{ji}^{ts}) (\sum_{i=1}^N K_{ji}^t) } ) ).
\end{split}
\end{equation}
\end{proposition}


\begin{remark} 
Estimating the divergence between $p^s(\hat{y}|\mathbf{z})$ and $p^t(\hat{y}|\mathbf{z})$ is a non-trivial task. An alternative choice is the conditional MMD by~\citep{ren2016conditional}:
\begin{equation}\label{eq:conditional_MMD}
\begin{split}
& \widehat{D}_{\text{MMD}}(p^s(\hat{y}|\mathbf{z});p^t(\hat{y}|\mathbf{z})) = \tr(K^s (\tilde{K^s})^{-1} L^s (\tilde{K^s})^{-1}) + \\
& \tr(K^t (\tilde{K^t})^{-1} L^t (\tilde{K^t})^{-1}) -2 \tr(K^{st} (\tilde{K^t})^{-1} L^{ts} (\tilde{K^s})^{-1}),
\end{split}
\end{equation}
in which $\tr$ denotes the trace, $\tilde{K} = K+\lambda I$. 
Obviously, CS divergence avoids introducing a hyperparameter $\lambda$ and the necessity of matrix inverse, which improves computational efficiency and stability. See also experiments in Section~\ref{sec:compare_MMD_KL}. % Our experimental results in Section~\ref{sec:compare_MMD_KL} suggest that conditional CS significantly outperforms conditional MMD. 
%Moreover, Eq.~(\ref{eq:conditional_CS_est}) does not rely on any parametric distributional assumption on either $p^s(\hat{y}|\mathbf{z})$ or $p^t(\hat{y}|\mathbf{z})$. 
%which makes it more suitable to measure the mismatch between $q_\theta(\hat{y}|\mathbf{x})$ and $p(y|\mathbf{x})$ than MSE.
\end{remark}

% In our implementation, we normalize $\mathbf{z}$ and $\hat{y}$ and set kernel size $\sigma=1$, which is a common heuristic~\citep{greenfeld2020robust}. %in all our experiments. 

% l_{train} + \frac{M}{2} \sqrt{D_{CS}(p^t (z, y), p^s(z,y))} \\

%Proof. provided in the appendix.

%\subsection{Conditional Adversarial Training}
\subsection{Network Training}


%  indicating the target samples outside the support of the source domain, which adversarially force the feature exactor to minimize the discrepancy.

We demonstrate how to use CS and conditional CS (CCS) divergences in both distance metric- and adversarial training-based UDA frameworks in a convenient way.

\textbf{Distance Metric Minimization } 
%\quad In a classical distance discrepancy-based UDA framework, the overall objective consists of two parts: the training loss on the source domain, and the distribution discrepancy between domains. For the former, we adopt the cross-entropy loss $L_{\text{CE}} =  \frac{1}{M}\sum^{M}_{i=1} -y^s_{i}\log \hat{y}^s_{i}$. For the latter, we use $D_{\text{CS}}(p^s(\mathbf{z}), p^t(\mathbf{z}))$ to align the feature distribution and $D_{\text{CCS}}(p^s(\hat{y}|\mathbf{z}), p^t(\hat{y}|\mathbf{z}))$ to align the conditional distribution for classifier. The three losses are trained jointly. }
Given a neural network $h_\theta = f\circ g$, where $\mathbf{z}=f(\mathbf{x})$ is the learned features and $g: \mathbf{z} \mapsto  y$ is a classifier. It is straightforward to use distance metrics to learn domain-invariant $f$ and $g$ (without introducing any new modules or training schemes). 
Specifically, the objective to train $h_\theta$ consists of the training loss on the source domain and a distribution discrepancy loss on both $p(\mathbf{z})$ and $p(y|\mathbf{z})$. 
For the former, we adopt the cross-entropy loss $L_{\text{CE}} =  \frac{1}{M}\sum^{M}_{i=1} -y^s_{i}\log \hat{y}^s_{i}$. For the latter, we include both $D_{\text{CS}}(p^s(\mathbf{z});p^t(\mathbf{z}))$ and $D_{\text{CCS}}(p^s(\hat{y}|\mathbf{z}); p^t(\hat{y}|\mathbf{z}))$, estimated with Eq.~(\ref{eq.cs_est}) and Eq.~(\ref{eq:conditional_CS_est}), respectively. 

% A diagram about this training scheme is shown in Fig.~\ref{ig.distance-framework}. 

%Moreover, $D_{\text{CS}}(\mathbf{z}^s;\mathbf{z}^t)$ and $D_{\text{CCS}}(p^s(\hat{y}|\mathbf{z}); p^t(\hat{y}|\mathbf{z}))$ are estimated by Eqs.~(\ref{eq.cs_est}, \ref{eq:conditional_CS_est}). 

% \wy{
% \begin{equation} 
% \begin{aligned}
% \min_{f, g} L_{\text{cls}} + \lambda D_{\text{CS}}(p^s(\mathbf{z}), p^t(\mathbf{z})) 
%  + D_{\text{CCS}}(p^s(\hat{y}|\mathbf{z}), p^t(\hat{y}|\mathbf{z})),
% \end{aligned}
% \label{eq.metric_loss}
% \end{equation}
% }



\textbf{Conditional Adversarial Training} \quad We incorporate our CS and CCS divergences into a popular bi-classifier adversarial training framework~\citep{saito2018maximum} to attain SOTA performance. 
The bi-classifier adversarial training method utilizes two classifiers $g_1$ and $g_2$ as a discriminator. By maximizing the discrepancy between the two classifiers’ output, the framework detects target samples that are outside the support of the source domain. Then, minimizing the discrepancy is for fooling the generator (feature extractor), which makes the features inside the support of the source with respect to the decision boundary. 

% proposed to align distributions by explicitly utilizing task-specific classifiers as a discriminator. The framework maximizes the discrepancy between two classifiers’ output to detect target samples that are outside the support of the source and then minimizes the discrepancy to generate feature representations that are inside the support of the source with respect to the decision boundary. Instead of aligning manifold in feature, input, or output space by heuristic assumptions, this approach fo- cuses on directly reshaping the target data regions that indeed need to be reshaped.


%Based on the feature extractor $f$, feature $f(\mathbf{x})$ is passed to two distinct task-specific classifiers, $g_1$ and $g_2$, which predict two probabilities $p_1(y|\mathbf{x})$ and $p_2(y|\mathbf{x})$ for either the source domain or the target domain. The original bi-classifier adversarial training framework~\cite{saito2018maximum} treats the prediction probability $p(\hat{y}|\mathbf{x})$ as the conditional distribution. Then, the discrepancy between $p^t_1(\hat{y}|\mathbf{x})$ and $p^t_2(\hat{y}|\mathbf{x})$ detects the samples in the target domain that is outside the support of the source domain, serving as the adversarial loss to train the feature exactor to minimize the disagreement. 

% This integration of CS and CCS divergences with the bi-classifier framework aims to achieve better alignment and adaptability between source and target domain distributions. 

As shown in Fig.~\ref{Fig.framework}, we model the alignment in two parts: 1) the minimization of $D_{\text{CS}}(p^s(\mathbf{z});p^t(\mathbf{z})$ for learning domain-invariant representation; 2) the minimization of $D_{\text{CCS}}(p^t_1(\hat{y}|\mathbf{z});p^t_2(\hat{y}|\mathbf{z}))$ for the conditional classifier adaptation adversarial training. We elaborate on the details of our training procedures by the following steps:

\textbf{Step 1} \quad Learn feature extractor $f$ and two classifiers, $g_1$ and $g_2$, jointly by minimizing the classification loss $L_{cls}$ and the discrepancy loss $D_{\text{CS}} + D_{\text{CCS}}$ between the source domain and the target domain: 
\begin{equation} 
\begin{aligned}
\min_{f, {g_1}, {g_2}} L_{\text{cls}} &+ \lambda D_{\text{CS}}(p^s(\mathbf{z}), p^t(\mathbf{z})) \\
& + \beta \sum^{2}_{n=1}D_{\text{CCS}}(p^s_n(\hat{y}|\mathbf{z}), p^t_n(\hat{y}|\mathbf{z})),
\end{aligned}
\label{eq.step1_loss}
\end{equation}
where $\lambda$, $\beta$ are weighting hyperparameters, $L_{\text{cls}}$ is the empirical risk over two classifiers in the source domain with additional entropy constraints in the target domain. Namely, %, and $\theta_f$, $\theta_{g_1}$, and $\theta_{g_2}$ are the parameters of the feature extractor and the two classifiers, respectively. 
%$L_{\text{cls}}$ is defined by
\begin{equation} 
\begin{aligned}
 L_{\text{cls}} = 
 \frac{1}{2}\sum^{2}_{n=1} (L_{\text{CE}}(g_{n}(f(\mathbf{x}^s)), y^s) + \gamma L_{\text{Ent}}(g_{n}(f(\mathbf{x}^t))) ),
\end{aligned}
\label{eq.adv_cls}
\end{equation}
where $\gamma$ is a trade-off parameter, $L_{\text{CE}}$ is the cross-entropy loss, 
%$L_{\text{CE}} =  \frac{1}{M}\sum^{M}_{i=1} -y^s_{i}\log \hat{y}^s_{i}$, 
and  $L_{\text{Ent}} = \frac{1}{N} \sum^{N}_{i=1} -\hat{y}^t_{i}\log \hat{y}^t_{i}$ (\citep{grandvalet2004semi,long2018conditional, luo2021conditional, du2021cross}) is a widely used constraint in domain adaptation.
%Moreover, $D_{\text{CS}}(\mathbf{z}^s;\mathbf{z}^t)$ and $D_{\text{CCS}}(p^s(\hat{y}|\mathbf{z}); p^t(\hat{y}|\mathbf{z}))$ are estimated by Eqs.~(\ref{eq.cs_est}, \ref{eq:conditional_CS_est}). 

%and Eq.~(\ref{eq:conditional_CS_est}).

% \wy{merge with Eq.~\ref{eq.adv_cls}} In practice, the majority of works use the cross-entropy loss ($L_{\text{CE}}$) for $R_s(h)$ in the source domain. Also, the entropy loss $L_{\text{Ent}}$~\cite{grandvalet2004semi} has been widely used in the domain adaptation task (\cite{long2018conditional, luo2021conditional, du2021cross}) as an additional constraint in the target domain. $L_{\text{CE}}$ and  $L_{\text{Ent}}$ are defined as:
% \begin{equation} 
% L_{\text{CE}} =  \frac{1}{M}\sum^{M}_{i=1} -y^s_{i}\log \hat{y}^s_{i}, \qquad 
% L_{\text{Ent}} = \frac{1}{N} \sum^{N}_{i=1} -\hat{y}^t_{i}\log \hat{y}^t_{i}.
% \label{eq.ce}
% \end{equation}
%divergence $D_{CS}(z^s;z^t)$ between feature representations, $z^s=f(x^s)$ and $z^t=f(x^t)$, which can be estimated by Eq.~\ref{eq.cs_est}.

\begin{figure}[htbp]
\centering
%\includegraphics[scale=.5]{Figures/adv_framework.pdf}
\includegraphics[scale=.32]{Figures/da_figure2.pdf}
\caption{The framework of the proposed conditional bi-classifier adversarial learning method with CS and CCS divergences. Feature extractor $f$ is used to obtain representations $\mathbf{z}^s$ and $\mathbf{z}^t$ for the source and target domains, respectively. Two classifiers $g_1$ and $g_2$ are used as a discriminator. CS divergence directly minimizes the discrepancy of $p(\mathbf{z})$ between two domains. CCS divergence measures the disagreement between two classifiers (adversarial loss). }
\label{Fig.framework}
\end{figure}

\textbf{Step 2} \quad We fix parameters of the feature extractor $f$, and subsequently update the classifiers $g_1$ and $g_2$. By maximizing the conditional divergence between the two classifiers on the target domain, the discriminator (classifiers) is trained to identify the target samples that are outside the support of decision boundaries. To maintain classification accuracy on the source domain, classification loss is also used. 
\begin{equation} 
\min_{{g_1}, {g_2}} L_{\text{cls}} - D_{\text{CCS}}(p^t_1(\hat{y}|\mathbf{z});p^t_2(\hat{y}|\mathbf{z})).
\label{eq.step_2_loss}
\end{equation}

\textbf{Step 3} \quad We fix the parameters of the two classifiers $g_1$ and $g_2$, and subsequently update the feature extractor $f$. The aim is to minimize the divergence between the probabilistic outputs of the two classifiers for training the feature exactor to fool the discriminator. This can be formalized as follows:
\begin{equation} 
\min_{{f}} D_{\text{CCS}}(p^t_1(\hat{y}|\mathbf{z});p^t_2(\hat{y}|\mathbf{z})).
\label{eq.adv_step3_loss}
\end{equation}


% \begin{algorithm}
% \caption{Bi-Classifier Domain Adversarial Training Framework}
% \label{alg:bi-classifier}
% \begin{algorithmic}[1]
% \Require Source labeled data $\mathcal{D}_s$, target unlabeled data $\mathcal{D}_t$, batch size $n$, learning rate $\eta$, number of epochs $T$
% \Ensure Learned feature extractor $F$, source domain classifier $C_s$, target domain classifier $C_t$
% \State Initialize $F$, $C_s$, $C_t$
% \For{$t = 1$ to $T$}
%     \For{each minibatch $\mathcal{B}_s$, $\mathcal{B}_t$}
%         \State Compute source features $f_s = F(\mathcal{B}_s)$
%         \State Compute target features $f_t = F(\mathcal{B}_t)$
%         \State Compute source domain loss $\mathcal{L}_s = \text{Loss}(C_s(f_s), y_s)$
%         \State Compute target domain loss $\mathcal{L}_t = \text{Loss}(C_t(f_t), y_t)$
%         \State Compute total loss $\mathcal{L} = \mathcal{L}_s + \mathcal{L}_t$
%         \State Update $F$, $C_s$, $C_t$ by descending their stochastic gradient: $\nabla \mathcal{L}$
%     \EndFor
% \EndFor
% \State \Return $F$, $C_s$, $C_t$
% \end{algorithmic}
% \end{algorithm}
