
\section{Methods}
\label{sec:methods}

\begin{figure}[t]
    \floatconts
    {fig:model_pipeline_moco}
    {\caption{The proposed \acf{SRA} architecture for a given input image $x$. The $\mathcal{L}_{\mathrm{IND}}$ and $\mathcal{L}_{\mathrm{CRD}}$ represent the in-domain and cross-domain loss respectively.}}
    {\includegraphics[width=0.99\textwidth]{media/pipeline.pdf}}
\end{figure}

In our unsupervised domain adaptation setting, we have access to a small set of labeled source data, sampled from a source domain distribution and a set of unlabeled target data from a target distribution. The goal is to learn a hypothesis function (e.g., classifier here) on the source domain that provides a good generalization in the target domain. To this end, we propose a novel self-supervised cross-domain adaptation setting, which is described in more detail below. \figureref{fig:model_pipeline_moco} gives an overview of the proposed network architecture.

Our model builds upon two networks $f_{\Phi}$, $f_{\Psi}$ that compute the query $z$ and key $z^{\prime}$ embedding from the input representations $\hat{x}$, $\hat{x}^{\prime}$, respectively. 
Each branch consists of a residual encoder and two fully connected layers based on the \ac{SOTA} architecture proposed in~\citet{chen2020improved}. 
To generate $\hat{x}$, $\hat{x}^{\prime}$, a random image $x$ is drawn from either the source $\mathcal{D}_s$ or the target $\mathcal{D}_t$ domain and is then transformed with two random data augmentations selected from $\mathcal{T}_x$ to create a matching pair. The key embeddings $z^{\prime}$ are used to maintain a queue $\mathcal{Q}$ of negative samples $\{q_i\}_{i=1}^{\lvert \mathcal{Q} \rvert} \in \mathcal{Q}$ in a first-in, first-out fashion.
The queue provides a large number of examples which alleviates the need for a large batch \cite{chen2020simple} or the use of memory banks \cite{kim2020cross}. 
Moreover, $f_{\Psi}$ is updated using a momentum approach, combining its weights to those of $f_{\Phi}$.
This approach ensures that $f_{\Psi}$ generates a slowly-shifting embedding. Motivated by ~\citet{ge2020self,kim2020cross}, we extend the domain adaptation learning procedure to our model definition and task. Hence, we split the loss terms into two distinct tasks, namely in-domain $\mathcal{L}_{\mathrm{IND}}$ and cross-domain $\mathcal{L}_{\mathrm{CRD}}$ representation learning. The objective loss $\mathcal{L}_{\mathrm{SRA}} = \mathcal{L}_{\mathrm{IND}} + \mathcal{L}_{\mathrm{CRD}}$ is the summation of both terms, and are described in more detail below.


\subsection{In-domain Loss} 
The first objective $\mathcal{L}_{\mathrm{IND}}$ aims at learning the individual distribution of each the source and the target domain features. 
We want to keep the two domains independent as we will optimize their alignment later. 
For each vector $z$, there is a paired embedding $z^{\prime}$ that is generated from the same tissue image and therefore is, by definition, similar.
The contrastive loss, as expressed in \equationref{eq:p_ind,eq:l_ind}, is therefore used to constrain the representation of the embedding space for each domain separately. 

\begin{equation}
    p^{\mathrm{IND}}_i(\mathcal{Q}) = \frac{\exp(z_i^\top z^{\prime}_i/\tau)}{\exp(z_i^\top z^{\prime}_i/\tau) + \sum_{l \in \mathcal{Q}} \exp(z_i^{\top} q_l/\tau)}.
    \label{eq:p_ind}
\end{equation}

\begin{equation}
    \mathcal{L}_{\mathrm{IND}} = \frac{-1}{\lvert D_s \rvert +\lvert D_t \rvert} \left( \sum_{i \in D_s} \log{\left[ p^{\mathrm{IND}}_i(\mathcal{Q}_s)\right]}
    + \sum_{i \in D_t} \log{\left[p^{\mathrm{IND}}_i(\mathcal{Q}_t)\right]} \right).
\label{eq:l_ind}
\end{equation}

We denote $\mathcal{Q}_{s}, \mathcal{Q}_{t} \subset \mathcal{Q}$ as the sets of indexed samples of the queue that are drawn from the corresponding domain $\mathcal{D}_{s}, \mathcal{D}_{t}$, and $\tau \in \mathbb{R}$ as the temperature. The temperature is typically small ($<1$) to help the model in making confident predictions.
For all images of each dataset $\mathcal{D}_{s}, \mathcal{D}_{t}$, we want to minimize the distance between $z$ and $z^{\prime}$ while maximizing the distance to the previously generated negative samples from the corresponding set $\mathcal{Q}_{s}, \mathcal{Q}_{t}$. The queue samples are considered reliable negative candidates as they are generated by $f_{\Psi}$ whose weights slowly varies due to its momentum update procedure.

\subsection{Cross-domain Loss} 
We can see the cross-domain matching task as the generation of features that are discriminative for both sets.
In other words, if we embed a random sample drawn from $\mathcal{D}_s$ we expect to be able to find a limited number of candidates in $\mathcal{D}_t$ whose representations contain similar information as our initial query. 
Based on this logic, we compute the similarities and entropy of a query sample $z_i$ drawn from one set (for example $\mathcal{D}_s$) and the stored queue samples from the other set (for example $\mathcal{Q}_t$): 
\begin{equation}
    H^{\mathrm{CRD}}_i(\mathcal{Q}) = - \sum_{j \in \mathcal{Q}} p^{\mathrm{CRD}}_{i, j}(\mathcal{Q}) \log{\left[p^{\mathrm{CRD}}_{i, j}(\mathcal{Q})\right]}
    \quad\mathrm{and}\quad
    p^{\mathrm{CRD}}_{i, j}(\mathcal{Q}) = \frac{\exp(z_i^\top q_j/\tau)}{\sum_{l \in \mathcal{Q}} \exp(z_i^\top q_l/\tau)}.
    \label{eq:h_crd}
\end{equation}
\\
Low entropy means that the selected query from one domain matches with a limited number of keys from another domain. The loss, therefore, aims to minimize the average entropy of the similarity distributions, assisting the model in making confident predictions:
\begin{equation}
     \mathcal{L}_{\mathrm{CRD}} = \frac{1}{\lvert D_s \rvert + \lvert D_t \rvert} \left[ \sum_{i \in D_s} H^{\mathrm{CRD}}_i(\mathcal{Q}_t) + \sum_{i \in D_t} H^{\mathrm{CRD}}_i(\mathcal{Q}_s) \right].
     \label{eq:l_crd}    
\end{equation}
%Consequently, the entropy minimization assists the model in making confident predictions.


\subsection{Easy-to-hard (E2H) Learning} 
At the start of the learning process, the correlation between samples and their entropy is unclear as the model weights are initialized randomly, which does not guarantee proper feature descriptors.
Additionally, being able to find matching samples for all input queries across datasets is a strong assumption. 
In clinical applications, we often rely on open-source datasets with a limited number of classes to annotate complex tissue databases.
For example, tissues coming from specific cancer subtypes, such as mucinous \ac{CRC}, might not be present in a public dataset while being potentially frequent in daily diagnostics. 
In other words, optimizing \equationref{eq:h_crd} will result in a performance drop as the loss will try to find cross-domain candidates even if there are none to be found.

To tackle this issue, we introduce an easy-to-hard learning scheme. 
We start with easy (low entropy) samples and progressively include harder (high entropy) samples as the training progresses. 
We substitute the summation over $\mathcal{D}_{s}, \mathcal{D}_{t}$ in \equationref{eq:l_crd} with the corresponding set of candidates $\mathcal{R}_{s}, \mathcal{R}_{t}$ defined in \equationref{eq:r} where the ratio $0 \leq r \leq 1 $ is gradually updated during training using a step function. We denote $s_{w}$, $s_{h}$ as the width and height of the step respectively.

\begin{equation}
    \mathcal{R}_{s/t} = \{i \in \mathcal{D}_{s/t}\, \lvert \,  H_i^\mathrm{CRD}(\mathcal{Q}_{t/s}) \text{ is reverse top-$r$} \}
    \quad\mathrm{and}\quad
    r = \Big\lfloor \frac{\text{epoch}}{N_\mathrm{epochs} \cdot s_{w}} \Big\rfloor \cdot s_{h},
    \label{eq:r}    
\end{equation}

