
\subsection{Problem Formulation}

We consider the standard setting of unsupervised domain adaptation
in which we have a labeled dataset $\mathbb{D}^{S}=\left\{ \left(\bx_{i}^{S},y_{i}^{S}\right)\right\} _{i=1}^{N_{S}}$
from a source domain and an unlabeled dataset $\mathbb{D}^{T}=\left\{ \bx_{i}^{T}\right\} _{i=1}^{N_{T}}$
from a target domain. We assume that data examples $\bx_{i}^{S},\bx_{i}^{T}\in\mathbb{R}^{d}$
and the categorical labels $y_{i}^{S}\in\left\{ 1,2,...,M\right\} $
where $M$ is the number of classes. For the sake of notion simplification,
we overload $\mathbb{D}^{S}$ and $\mathbb{D}^{T}$ to represent the
empirical joint distributions of the source and target domains. We
denote $\mathbb{P}^{S}$ and $\mathbb{P}^{T}$ as the data distributions
of the source and target domains respectively. Moreover, given a class
$m$, we further denote $\mathbb{P}_{m}^{S}$ as the $m$-th class-conditional
distribution of the source domain (i.e., the distribution with the
density function $p^{S}\left(\bx\mid y=m\right)$).

\subsection{Motivation}

For our proposed approach, we consider an OT distance of two discrete
distributions. The first one is the discrete distribution whose atoms
are the target examples $\bx^{T}$ (i.e., $\bx_{i}^{1}=\bx_{i}^{T}$
in Eq. (\ref{eq:total_cost})), while the second one is the discrete
distribution whose atoms are the source class-conditional distributions
$\mathbb{P}_{m}^{S}$ (i.e., $\bx_{j}^{2}=\mathbb{P}_{m}^{S}$ in
Eq. (\ref{eq:total_cost})). The cost $c\left(\bx_{i}^{T},\mathbb{P}_{m}^{S}\right)$
is defined as the negative log likelihood $-\log p_{m}^{S}\left(\bx_{i}^{T}\right)=-\log p^{S}\left(\bx_{i}^{T}\mid y=m\right)$.
Hence, if a target sample $\bx_{i}^{T}$ is more likely to be a sample
from $\mathbb{P}_{m}^{S}$, the log likelihood $\log p_{m}^{S}\left(\bx_{i}^{T}\right)$
is higher, meaning that the cost $c\left(\bx_{i}^{T},\mathbb{P}_{m}^{S}\right)=-\log p_{m}^{S}\left(\bx_{i}^{T}\right)$
becomes smaller. As shown in Figure \ref{fig:matching}, by examining
the OT distance between two aforementioned distributions, we aim to
find the best match between a given target sample $\bx_{i}^{T}$ and
a source class-conditional distribution $\mathbb{P}_{m}^{S}$.

\begin{figure}[th]
\begin{centering}
\includegraphics[width=0.57\columnwidth]{OT_distance}
\par\end{centering}
\caption{We consider the OT distance between two distributions: the first one
has atoms as the target examples $\protect\bx^{T}$ and the second
one has atoms as the class-conditional distributions $\mathbb{P}_{m}^{S}$.
The cost function $c(\protect\bx_{i}^{T},\mathbb{P}_{m}^{S})=-\log p_{m}^{S}\left(\protect\bx_{i}^{T}\right)=-\log p^{S}\left(\protect\bx_{i}^{T}\mid y=m\right)$.
\label{fig:matching}}
\end{figure}


\subsection{Distributional Optimal Transport}

\label{subsec:Distributional-Optimal-Transport}

We define $\mathcal{P}^{S}=\sum_{m=1}^{M}\pi_{m}\delta_{\mathbb{P}_{m}^{S}}$,
where $\delta$ is the Dirac delta distribution and the mixing proportion
$\bpi\in\simplex_{M}:=\left\{ \balpha\in\mathbb{R}^{M}:\balpha\geq\bzero\,\text{and}\,\norm{\balpha}_{1}=1\right\} $
with the number of classes $M$. Obviously, $\mathcal{P}^{S}$ is
a discrete distribution of distributions wherein $\mathcal{P}^{S}$
takes $\mathbb{P}_{m}^{S}$ with the probability $\pi_{m}$. As mentioned
in the motivation section, we now examine an OT distance between $\mathbb{P}^{T}$
and $\mathcal{P}^{S}$, we aim at matching target examples to the
source class-conditional distributions in which a target example is
absolutely guided to match the source class-conditional distribution
corresponding to its ground-truth label.

In the sequel, we inspect an OT distance between $\mathbb{P}^{T}$
and $\mathcal{P}^{S}$ in which we define the cost $c\left(\bx_{i},\mathbb{P}_{m}^{S}\right)$
to match a target sample $\bx_{i}$ to $\mathbb{P}_{m}^{S}$ as $-\log p_{m}^{S}\left(\bx_{i}\right)$.
Let us denote $A=\left[a_{im}\right]\in\mathbb{R}^{N_{T}\times M}$
as the transportation matrix wherein $a_{im}$ represents the probability
to match or transport $\bx_{i}$ to $\mathbb{P}_{m}^{S}$. The OT
distance between $\mathbb{P}^{T}$ and $\mathcal{P}^{S}$ w.r.t. the
cost function $c$ and the mixing proportion $\bpi$ is defined as:
\begin{align}
\mathcal{W}_{c,\bpi}\left(\mathbb{P}^{T},\mathcal{P}^{S}\right) & =\min_{A}\biggl\{\sum_{i=1}^{N_{T}}\sum_{m=1}^{M}a_{im}c\left(\bx_{i},\mathbb{P}_{m}^{S}\right):\nonumber \\
 & \,\,\,\,\,\,\,\,\sum_{m=1}^{M}a_{im}=\frac{1}{N_{T}},\sum_{i=1}^{N_{T}}a_{im}=\pi_{m}\biggr\}.\label{eq:ws_data}
\end{align}

Similar to other DA works \citep{pan2008transferlearning,TzengHDS15,long2017JAN},
we employ a feature extractor $G$ to map both source and target examples
to a latent space. We denote $\mathbb{Q}^{S},\mathbb{Q}^{T},\mathbb{Q}_{m}^{S}$,
and $\mathcal{Q}^{S}$ as the corresponding distributions over the
latent space induced by $\mathbb{P}^{S},\mathbb{P}^{T},\mathbb{P}_{m}^{S}$,
and $\mathcal{P}^{S}$ via the feature extractor $G$. The OT distance
in Eq. (\ref{eq:ws_data}) is rewritten as:
\begin{align}
\mathcal{W}_{c,\bpi}\left(\mathbb{Q}^{T},\mathcal{Q}^{S}\right) & =\min_{A}\biggl\{\sum_{i=1}^{N_{T}}\sum_{m=1}^{M}a_{im}c\left(G\left(\bx_{i}\right),\mathbb{Q}_{m}^{S}\right):\nonumber \\
 & \,\,\,\,\,\,\,\,\sum_{m=1}^{M}a_{im}=\frac{1}{N_{T}},\sum_{i=1}^{N_{T}}a_{im}=\pi_{m}\biggr\}.\label{eq:ws_latent}
\end{align}

To encourage the target examples $G\left(\bx_{i}\right)$ to move
towards proper class regions of the source domain, we propose solving
the following optimization problem (OP):
\begin{equation}
\min_{G,\bpi}\mathcal{W}_{c,\bpi}\left(\mathbb{Q}^{T},\mathcal{Q}^{S}\right).\label{eq:OT_Q}
\end{equation}

With $c\left(G\left(\bx_{i}\right),\mathbb{Q}_{m}^{S}\right)=-\log p_{m}^{S}\left(\bx_{i}\right)$,
minimizing the OT distance in Eq. (\ref{eq:OT_Q}) encourages the
target example $G\left(\bx_{i}\right)$ to move towards a $\mathbb{Q}_{k}^{S}\,(1\leq k\leq M)$
with a high likelihood and $\ba_{i}=\left[a_{im}\right]_{m}$ inspired
to be close to the corresponding scaled one-hot vector $\frac{1}{N_{T}}\bone_{k}$.
Here we denote $\bone_{k}$ as the one-hot vector with the $k$-th
element being one.
