\appendix

\section{Experimental Setup}
\label{app:experimental-setup}

In this section, we present the experimental setup. The architecture is first trained in an unsupervised fashion, and then as a second step, a linear classifier is trained on top as in \citet{chen2020simple}. The architecture of the feature extractors, $f_{\Phi}$ and $f_{\Psi}$, is composed of a ResNet18~\cite{he2016deep} followed by two fully connected layers (projection head) using \acp{ReLU} and with an output dimension of $D_{\textrm{FC1}}=512$ and $D_{\textrm{FC2}}=128$, respectively. 
We update the weights of $f_{\Phi}$ as $\theta_{\Phi}$ and $f_{\Psi}$ as $\theta_{\Psi}$ using standard backpropagation and momentum as described in \equationref{eq:momentum_update}, respectively. We use $m=0.999$ as momentum to update weights as described in \citet{he2020momentum}.

\begin{equation}
    \theta_{\Psi} \leftarrow m \theta_{\Psi} + (1-m)\theta_{\Phi}
    \label{eq:momentum_update}
\end{equation}

The model is trained from scratch for $N_\mathrm{epochs}=200$ epochs using the \ac{SGD} optimizer ($\text{momentum}=0.9$, $\text{weight decay}=10^{-4}$), a learning rate $\lambda=10^{-2}$ and a batch size of $B=128$. 
For the similarity learning and easy-to-hard training, we set $\tau=0.2$, $s_w=0.25$ and $s_h=0.2$.
We apply random cropping, gray transform, horizontal/vertical flipping, and color jittering as data augmentations $\mathcal{T}_x$. 
%The pseudo code for \ac{SRA} in display in Algorithm~\ref{alg:sra}.
Algorithm~\ref{alg:sra} presents the pseudo-code of our \ac{SRA} method.
For a fair comparison, we also use a ResNet18 backbone for the presented baselines. The classification performances are evaluated using a linear layer placed on top of the frozen feature extractor for $N_{\textrm{epochs}}=100$ epochs using the \ac{SGD} optimizer ($\text{momentum}=0.9$, $\text{weight decay}=0$), a batch size of $B=128$, and a learning rate of $\lambda=10$.

\input{table/sra_algo}

At each epoch, we sample $50,000$ example with replacement from both the source and target dataset to create a set $\mathcal{D}$ of $N=100,000$ samples. We use $70\%$ of \ac{K16} to train the unsupervised domain adaptation. The remaining $30\%$ examples are used to test the performance of the linear classifier trained on top of the self-supervised model. We repeat this operation 10 times to obtain statistically relevant results.

  
\section{Detailed Overview of the Datasets}
\label{app:datasets}
 
In this study, we use two publicly available datasets as well as an in-house cohort that contain patches of different tissue types found in the human gastrointestinal tract and that are extracted from \ac{HE}-stained \acp{WSI}. \figureref{fig:examples-datasets} shows the occurrence and relationship of different tissue types across the datasets. Note that labels are not available for the in-house dataset. 
% The displayed crops are cherry-picked for the comparison purpose.

\noindent \textbf{\acf{K16} Dataset}:
The dataset~\cite{kather2016multi} contains $5,000$ patches ($150\times150$ pixels, $74\mu m\times74\mu m$) from multiple \ac{HE} \acp{WSI}. There are eight classes of tissue phenotypes, namely tumor epithelium, simple stroma (homogeneous composition, includes tumor stroma, extra-tumoral stroma, and smooth muscle), complex stroma (stroma containing single tumor cells and/or few immune cells), immune cells (including immune cell conglomerates and sub-mucosal lymphoid follicles), debris (including necrosis, erythrocytes, and mucus), normal mucosal glands, adipose tissue, and background (no tissue). The dataset is balanced with 625 patches per class.

\noindent \textbf{\acf{K19} Dataset}:
The dataset~\cite{kather2019predicting} consists of patches depicting nine different tissue types: cancer-associated stroma, epithelium, normal colon mucosa, adipose tissue, lymphocytes, mucus, smooth muscle, debris, and background. Each class is roughly equally represented in the dataset. In total, there are $100,000$ patches ($224\times224$ pixels) in the training set.

\noindent \textbf{In-house Dataset}:
Our cohort is composed of 665 \ac{HE}-stained \acp{WSI} from our local \ac{CRC} patient cohort. The slides originated from 378 unique patients diagnosed with adenocarcinoma and were scanned at a resolution of 0.248 MMP (40x). The \acp{WSI} are sampled to reduce the computational complexity of the proposed approach. From each \ac{WSI}, we uniformly sample 300 ($448\times448$ pixels, $ 111 \mu m \times 111 \mu m$) regions from the foreground masks, creating a dataset with a total of $199,500$ unique, unlabelled patches. We assume that these randomly selected samples of our cohort are a good estimation of its tissue complexity and heterogeneity. 

\noindent \textbf{Inconsistencies between \ac{K16} and \ac{K19}}:
An expert pathologist reviewed all three datasets to identify any potential discrepancies between the class definitions.
We have identified the following issues:
\begin{itemize}
    \item \emph{Complex stroma:} The class is not represented in \ac{K19}. However, few occurrences of the complex stroma are present in both the tumor and stroma class. Other samples are hard to distinguish and classify from regular stroma without context information.
    \item \emph{Stroma:} In \ac{K16}, the stroma class is a composition of stroma and smooth muscle. When performing domain adaptation, we consider the classes stroma and smooth muscle in \ac{K19} as a single stroma class to match the definition of \ac{K16}.
    \item \emph{Debris:} Similar to stroma in \ac{K16}, the debris class is a mixture of multiple types of tissues. We observe examples of mucin, debris/necrosis, and loose tissue. For domain adaptation, we merge mucin into debris in \ac{K19}. Note that collagenous tissue and blood are not present in \ac{K19}, which is an additional example of an open set domain adaption. 
\end{itemize}

\begin{figure}[t]
\centering
  \includegraphics[width=.99\textwidth]{media/datasets.pdf}
  \caption{Examples images of the different tissue types present in the used datasets and their association. We use the following abbreviations: TUM: tumor epithelium, STR: simple stroma, COMP: complex stroma, LYM: lymphocytes, NORM: normal mucosal glands, DEB: debris/necrosis, MUS: muscle, MUC: mucus, ADI: adipose tissue, BACK: background. Examples from the in-house dataset are manually picked for comparison but are not labeled.}
  \label{fig:examples-datasets}
\end{figure}

\section{Self-supervision and the Importance of the Queue}
\label{app:queue}

In this section, we compare the performances of different self-supervised methods to the standard supervised learning approach when facing different levels of available data. The results are presented in \tableref{tbl:clspercentage}. We report the performance of single domain classification on \ac{K16} and \ac{K19}. The supervised approach uses ImageNet pre-trained weights. Self-supervised baselines are trained from scratch. For the classification results, we freeze the weights and add a linear classifier on top and train it until convergence. For SupContrast \cite{khosla2020supervised} we jointly train the representation and the classification as described in the original paper.

We can observe that MoCoV2 \cite{chen2020improved} outperforms the two other \ac{SOTA} approaches. On \ac{K16} the model to gain up to $10\%$ in terms of F1-score with respect to the other self-supervised baselines. In addition, MoCoV2 gives competitive results with the supervised baseline that is initialized with ImageNet weights. It proves that MoCoV2 is able to efficiently learns from unlabeled data to create relevant feature spaces. This mainly comes from the combination of the momentum encoder and the give access to a large number of negative samples.

\input{table/tab_percentage_cls}


\section{Ablation Study}
\label{app:ablation}


We present the ablation study of our approach in \tableref{tbl:cross-domain-ablation}. 
We denote $\mathcal{L}_{\mathrm{IND}}$ as the in-domain loss, $\mathcal{L}_{\mathrm{CRD}}$ as the cross-domain loss, and  \ac{E2H} as the easy-to-hard learning scheme. For the baseline (no differentiation between in-domain and cross-domain), we consider the model where the training set $\mathcal{D}$ is the merged source and target domain data as in \cite{he2020momentum}. 
%We can observe the instability of the $\mathcal{L}_{\mathrm{CRD}}$ alone. 
%If we do not impose domain representation the model converges toward incorrect solutions where random sets of samples are matched between the source and target datasets. 
Adding just the $\mathcal{L}_{\mathrm{CRD}}$ to the loss creates an unstable model, because we do not impose domain representation and thus the model converges toward incorrect solutions where random sets of samples are matched between the source and target datasets. 
$\mathcal{L}_{\mathrm{IND}}$ achieves a relatively good performances but fails to generalize knowledge to classes where texture differs (for example background).
The introduction of the \ac{E2H} procedure greatly improves the classification performances on debris and tumor classification while maintaining good performances on other classes. 

The \figureref{app:query} highlights the usefulness of the \ac{E2H} scheme. Some tissue types might not have relevant candidates in the other set (open-set scenario). 
The example shown the figure is complex stroma (COMP), which is only present in \ac{K16} and not in \ac{K19}. Without the \ac{E2H} learning, the model tries to find matching candidates at any cost even if no suitable ones exist. 
This results in the occurrence of a subset of samples that have a near-perfect similarity to the query sample (top-right distribution plot, marked in red). 
Keeping the hyperparameter $r$ (\equationref{eq:r}) at a low level prevents the model from learning degenerated solutions (bottom-right distribution plot, marked in red). The same behavior is observed in other such open-set tissue classes (e.g., the absence of blood vessels and collagen in debris).

\begin{figure}[t]
\centering
  \includegraphics[width=0.99\textwidth]{media/query-e2h.pdf}
  \caption{Effect on similarity distribution with (bottom) and without (top) \ac{E2H}. Without \ac{E2H} the model tries to optimize similarity for all queries at any cost and creates out-of-distribution samples (red). With \ac{E2H}, the unpaired examples are still attached to the distribution (red).}
  \label{app:query}
\end{figure}

\input{table/tab_sra_ablation}


\begin{figure}[ht]
\centering
  \includegraphics[width=\textwidth]{media/tsne_full.pdf}
  \caption{The t-SNE projections of the source (\acl{K19}) and target (\acl{K16}) domain embeddings. We show the alignment of the embedding spaces between the source and target domain for all presented models as well as the classes. The classes of \acl{K19} are merged and relabeled according to the \acl{K16} definition. The standard supervised approach is depicted in (a). We compared our approach (i) to other domain adaptation methods (b-h). Our approach (i) qualitatively shows the best alignment between the source and target domains.}
  \label{fig:sup_tsne_full}
\end{figure}

\section{t-SNE Projections}
\label{app:tsne}

In this section, we display the complementary results to the ones presented in section \ref{subsec:crossdomain_cls}. The embedding for all baselines and the presented approach are displayed in \figureref{fig:sup_tsne_full}. We show the alignment between the source (\ac{K19}) and target (\ac{K16}) embedding domain-wise as well as classes-wise.
% Comment to avoid empty last page ?
