\documentclass{midl} % Include author names
%\documentclass[anon]{midl} % Anonymized submission

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution
%\documentclass[xcolor=table]{beamer}
\usepackage{mwe} % to get dummy images
\usepackage{bm}
\usepackage{amsmath}
\usepackage{dsfont}
\usepackage{natbib}
\usepackage{graphicx}
\usepackage{comment}
\usepackage{floatrow}
\usepackage{hyperref}
% Table float box with bottom caption, box width adjusted to content
\newfloatcommand{capbtabbox}{table}[][\FBwidth]

%\jmlrvolume{-- Under Review}
\jmlryear{2021}
\jmlrworkshop{Full Paper -- MIDL 2021}
%\editors{Under Review for MIDL 2021}

\title[A regularization term for slide correlation reduction]{ A regularization term for slide correlation reduction in  whole slide image analysis with deep learning}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{\Name{Hongrun Zhang \nametag{$^{1}$}} \Email{hongrun.zhang@liverpool.ac.uk}\\
\Name{Yanda Meng \nametag{$^{1}$}} \Email{yanda.meng@liverpool.ac.uk} \\
\Name{Xuesheng Qian \nametag{$^{4}$}} \Email{xuesheng.qian@intellicloud.ai} \\
\Name{Xiaoyun Yang \nametag{$^{5}$}} \Email{xyang@remarkholdings.com} \\
\Name{Sarah E.Coupland\thanks{Corresponding authors} \nametag{$^{2,3}$}} \Email{s.e.coupland@liverpool.ac.uk} \\
\Name{Yalin Zheng\footnotemark[1] \nametag{$^{1}$}} \Email{yalin.zheng@liverpool.ac.uk} \\
%\addr $^{1}$ Department of Eye and Vision Science, Institute of Life Course and Medical Sciences, University of Liverpool \\
%\addr $^{2}$ Liverpool Ocular Oncology Research Group, Department of Molecular and Clinical Cancer Medicine, Institute of Systems, Molecular and Integrative Biology, University of Liverpool \\
\addr $^{1}$ Department of Eye and Vision Science, University of Liverpool, Liverpool, UK \\
\addr $^{2}$ Liverpool Ocular Oncology Research Group, University of Liverpool, Liverpool, UK \\
\addr $^{3}$ Liverpool Clinical Laboratories, Liverpool University Hospitals NHS Foundation Trust, Liverpool, UK \\
\addr $^{4}$ Chinese Academy of Sciences (CAS) IntelliCloud Technology Co., Ltd., Shanghai, China \\
\addr $^{5}$ Remark Holdings, London, UK
%\addr $^{2}$ Address 2 \AND
%\Name{Author Name2\midlotherjointauthor\nametag{$^{1}$}} \Email{xyz@sample.edu}
}
%\vspace*{-15mm}
\begin{document}

\maketitle

%\vspace*{-15mm}

\begin{abstract}
To develop deep learning-based models for automatic analysis of histopathology whole slide images (WSIs), the atomic entities to be directly processed are often the smaller patches cropped from WSIs as it is not always possible to feed a whole WSI to a model given its enormous size. However, a trained model tends to relate the slide-specific characteristics to diagnosis results because a large number of patches cropped from the same WSI will  share common slide features and thus have strong correlations between them, resulting in deteriorated generalization capability of the trained model. Current approaches to alleviate this issue include data pre-processing (stain normalization or color augmentation) and adversarial learning, both of which introduce extra complications in computations. Alternatively, we propose to reduce the impact of this issue by introducing a new regularization term to the standard loss function to reduce the correlation of the patches from the same WSI. It is intuitive and easy-to-implement and introduces comparably smaller  computation overhead compared to existing approaches. Experimental results prove that the proposed regularization term is able to enhance the generalization capability of learning models and consequently to achieve better performance. The code is available in:  \url{https://github.com/hrzhang1123/SlideCorrelationReduction}. 



\end{abstract}

\begin{keywords}
deep learning, histopathology, whole slide image, over-fitting, uveal melanoma.
\end{keywords}

\section{Introduction}

There has been an increasing interest in using pathological whole slide images (WSIs) in the field of digital pathology \cite{litjens2016deep,bandi2018detection,zanjani2018cancer}. However, WSIs are characterized by their tremendously large sizes ranging from 10k$\times$10k pixels to even 100k$\times$100k pixels. Given this, when applying a deep learning-based algorithm for the automatic analysis of histopathology WSIs, the atomic entities to be directly processed are smaller cropped patches \cite{hou2016patch}, and usually up to 1000s of patches can be obtained from just a single WSI. In contrast to the size, the number of WSIs available to train a learning-based model is often comparably small (e.g. $\textless$ 50 in many cases). Meanwhile, patches from the same slide share significant features in terms of appearances or morphologies (Figure.\ref{pair_patch}). The accumulative effort of these factors renders a significant issue that a learning-based model tends to link the diagnosis to the slide-specific features that are not diagnostically relevant. This situation will lead to severe over-fitting and undermine the model's capability of generalization, especially when the number of slides for training is small. 

Three commonly-used approaches exist to relieve the negative effect of this issue: staining normalization \cite{magee2009colour, macenko2009method}, color augmentation \cite{tellez2019quantifying}, and slide domain adversarial (SDA) \cite{lafarge2017domain, ganin2016domain}. Both staining normalization and augmentation serve as pre-processing steps either to unify the staining appearances of patches from different slides, or to introduce varieties of color in patches as a data augmentation approach. These two methods require a comparably large amount of time to pre-process an image before feeding it into a model. SDA is implemented by adversarial training to extract the features of a patch that are agnostic to the slide where the patch is from. However, SDA requires extra network architectures and consequently needs more computational resources. 

In this paper, we propose a new alternative approach to directly reduce the correlations of high-level representations of patches from the same slide. It is intuitive and can be implemented by just introducing a regularization term to standard loss functions during training. Compared with the above three approaches, it requires no extra cost in pre-processing of images and introduces no extra network modules. In addition, since there is only one hyper-parameter in the proposed regularization term to be fine-tuned, it is easier to search for the optimal configuration. In summary, the contributions of this paper are (1) a new regularization term to directly reduce correlations between patches from the same slide, in order to alleviate over-fitting and enhance generalization capability of learning models; and (2) proven performance demonstrated by an empirical validation on a WSI dataset of uveal melanoma.

In what follows, we refer to slide correlation reduction (SCR) to the method trained with the proposed regularization term. 


\begin{figure}[ht] \label{pair_patch}
	\centering
	\subfigure[]{
		\begin{minipage}[b]{0.11\linewidth}
			\includegraphics[width=1\linewidth]{image/H13-02843_HE_13_sw114_sh86_iw73_ih73.png}\vspace{4pt}
			\includegraphics[width=1\linewidth]{image/H13-02843_HE_14_sw114_sh86_iw73_ih74.png}\vspace{4pt}
		\end{minipage}}
		\subfigure[]{
			\begin{minipage}[b]{0.11\linewidth}
				\includegraphics[width=1\linewidth]{image/H13-13973_HE_173_sw103_sh54_iw32_ih30.png}\vspace{4pt}
				\includegraphics[width=1\linewidth]{image/H13-13973_HE_174_sw103_sh54_iw32_ih31.png}\vspace{4pt}
			\end{minipage}}
		\subfigure[]{
			\begin{minipage}[b]{0.11\linewidth}
				\includegraphics[width=1\linewidth]{image/H13-14506_HE_334_sw105_sh99_iw47_ih66.png}\vspace{4pt}
				\includegraphics[width=1\linewidth]{image/H13-14506_HE_335_sw105_sh99_iw47_ih67.png}\vspace{4pt}
			\end{minipage}}
		\subfigure[]{
			\begin{minipage}[b]{0.11\linewidth}
				\includegraphics[width=1\linewidth]{image/H14-05043_HE_12_sw101_sh89_iw14_ih33.png}\vspace{4pt}
				\includegraphics[width=1\linewidth]{image/H14-05043_HE_13_sw101_sh89_iw14_ih34.png}\vspace{4pt}
			\end{minipage}}
		\subfigure[]{
			\begin{minipage}[b]{0.11\linewidth}
				\includegraphics[width=1\linewidth]{image/H14-06625_HE_14_sw105_sh85_iw88_ih41.png}\vspace{4pt}
				\includegraphics[width=1\linewidth]{image/H14-06625_HE_15_sw105_sh85_iw88_ih42.png}\vspace{4pt}
			\end{minipage}}
		\subfigure[]{
			\begin{minipage}[b]{0.11\linewidth}
				\includegraphics[width=1\linewidth]{image/H14-26340_HE_34_sw117_sh83_iw37_ih19.png}\vspace{4pt}
				\includegraphics[width=1\linewidth]{image/H14-26340_HE_35_sw117_sh83_iw38_ih14.png}\vspace{4pt}
			\end{minipage}}
		\subfigure[]{
			\begin{minipage}[b]{0.11\linewidth}
				\includegraphics[width=1\linewidth]{image/H15-07576_HE_32_sw103_sh83_iw22_ih58.png}\vspace{4pt}
				\includegraphics[width=1\linewidth]{image/H15-07576_HE_33_sw103_sh83_iw22_ih59.png}\vspace{4pt}
			\end{minipage}}
		\subfigure[]{
			\begin{minipage}[b]{0.11\linewidth}
				\includegraphics[width=1\linewidth]{image/H15-12193_HE_56_sw107_sh86_iw25_ih34.png}\vspace{4pt}
				\includegraphics[width=1\linewidth]{image/H15-12193_HE_57_sw107_sh86_iw25_ih35.png}\vspace{4pt}
			\end{minipage}}
					\caption{Eight pairs of patches from eight H\&E slides of uveal melanoma. }
%\vspace*{-5mm}
\end{figure}
%\vspace*{-5mm}

\section{Dataset and Method}

\subsection{Dataset Description}
The proposed method is verified with a task to predict the nuclear \textit{BAP1} (\textit{nBAP1}) immunohistochemical expression (positive or negative) of a section of a uveal melanoma \cite{zhang2020piloting} on the basis of  haematoxylin-and-eosin (H\&E) stained slides only. Uveal melanoma (UM) is the most common primary intraocular malignancy in adults, and a high proportion of patients develop metastases  to the liver, which is unfortunately incurable at present. Should the metastases be detected early, the UM patients can undergo liver surgery to prolong survival \cite{gomez2014liverpool,marshall2013mri}. A mutated \textit{BAP1} gene is strongly associated with highly metastatic UM. Whilst this mutation can be determined using genetic analyses, immunohistology can be applied as a surrogate marker, whereby strong nuclear protein staining indicates that the \textit{BAP1} gene is intact, and loss of nuclear staining is related to mutant \textit{BAP1} \cite{kalirai2014lack, farquhar2018patterns}.

This task is a specific case of a group of applications that aim to predict gene expression and mutations on the basis of histopathology slides using deep learning, which is an ongoing and booming field in recent years \cite{chen2020classification, schmauch2020deep, sun2019prediction, coudray2018classification}, and they share similar problem contexts and corresponding methodologies. Gene mutations usually result in the same alterations in cellular morphology across the tissue region, which would imply that all the patches extracted from a WSI are usually assigned with the same label. In such a case, the over-fitting issue tends to be more server, given a large number of patches from the same slide that share similar features and have the same patch label. 



In total, 184 cases of enucleated eyes were taken from pathology archives of {the Royal Liverpool University Hospital}, with each case including one tumour-representative slide being scanned at 40$\times$ magnification. We randomly selected 140 slides (66 \textit{BAP1} positive and 74 BAP1 negative) as the training set and 44 slides (16 \textit{BAP1} positive and 28 \textit{BAP1} negative cases) as the test set. For each slide,  patches of 1024$\times$1024 pixels were tiled from the tumor regions, and the \textit{nBAP1} status of each patch was labeled by the corresponding slide \textit{nBAP1} status. In total there were 99,778 patches for training and 30,693 patches for testing. 

In the Appendix, we have provided the validation results on the Camelyon16 \cite{bejnordi2017diagnostic} dataset with the task to detect lymph node metastases in women with breast cancer.

\subsection{Correlations between patches from the same slide}

As shown in Figure.\ref{pair_patch}, patches from the same slide have high similarities in terms of appearance and morphology. 

We conducted two experiments on the 140 slides for the empirical demonstration of highly correlations between patches cropped from the same slide. The patches cropped from the 140 slides were split into training set and validation set in two different ways. In the first experiment, we mixed the patches from different slides and randomly split them into training set and validation set; therefore patches from the same slide could exist in both training and validation set. Figure.\ref{fig_curve}.(a) shows the performance of training and validation over 10 epochs in terms of area under curve (AUC). For the second experiment, we split the dataset at the slide-level to avoid information leakage as of the first experiment. That is, all the patches from a slide were either in the training set or validation set, but would not co-exist in both sets. Figure.\ref{fig_curve}.(b) illustrates the training and validation performances.  We can see from Figure.\ref{fig_curve} that when splitting patches from the same slide into training set and validation set randomly, the performances on the validation set were synchronized with those on the training set over epochs and could achieve high AUC values (Figure.\ref{fig_curve}.(a)). In contrast, when all patches from a slide exist only on one set, the performances on the validation set were significantly worse than that on the training set. These two figures imply the strong correlations of patches from the same slide and the shared features were learnt by the trained model as the diagnostic features. 



\begin{figure}[ht] 
	\centering
	\subfigure[]{
		\centering
		\begin{minipage}[b]{0.4\linewidth}
			\includegraphics[width=1\linewidth]{image/mix.eps}\vspace{4pt}
		\end{minipage}}
		\subfigure[]{
			\begin{minipage}[b]{0.4\linewidth}
				\includegraphics[width=1\linewidth]{image/unmix.eps}\vspace{4pt}
			\end{minipage}}
			%\vspace*{-7mm}
			\caption{Area under curve values on training set and validation set over 10 training epochs. (a). Patches from different slides are randomly split into training set and validation set. (b). All the patches from a slide exist only either on training set or validation set. ResNet-18 is used.}
			\label{fig_curve}
%\vspace*{-5mm}
\end{figure}

\subsection{The regularization term for slide correlation reduction}

Consider $N$ patches extracted from different slides as a batch. The $i_{\textrm{th}}$ patch is denoted as $a_i$ ($i=1,2,...,N$) with its slide index $s_i$ that indicates the patch is from slide $s_i$. If patch $a_i$ and patch $a_j$ are from the same slide, then $s_i=s_j$. A vector feature $\bm{f}_i \in \mathds{R}^{D \times 1}$ serves as the feature representation of patch $a_i$ extracted through a convolutional neural network, and $D$ is the dimension of the vector. The extracted features can be used for various downstream tasks. For the task considered specifically in this paper to predict the \textit{nBAP1} status of uveal melanoma, the extracted features are fed into a classifier that generates prediction that indicates the probabilities of the corresponding patch to be \textit{BAP1} positive and \textit{BAP1} negative, denoted as $\bm{p}_i \in {[0,1]}^2$. The loss function to train the network within the scope of the batch is formulated as, 

\begin{equation}
  \mathcal{L} =\frac{1}{N} \sum_{i}^{N}\mathcal{C}(\bm{p}_i, l_i),
\end{equation}

\noindent where $l_i$ is the ground-truth label for patch $i$ and $\mathcal{C}$ is a criterion function that measures the distance of prediction $\bm{p}_i$ to the ground-truth label $l_i$. Cross-entropy is the most common criterion function for classification applications. 

The contribution of this paper is a regularization term added to the $\mathcal{L}$, which quantifies the correlations of patches from the same slide, 

\begin{equation} \label{overal_loss}
	\mathcal{L} =\frac{1}{N} \sum_{i}^{N}\mathcal{C}(\bm{p}_i, l_i) + \beta \mathcal{L}_{\textrm{cr}}(\hat{\bm{F}}, \bm{s})
\end{equation}

\noindent where $\beta$ is a positive weight,  $\hat{\bm{F}} \in \mathds{R}^{D \times N}$ is the matrix that stacks the normalized feature vectors in the batch, i.e., the $i_{\textrm{th}}$ column in $\hat{\bm{F}}$ is $\hat{\bm{f}}_i = \mathcal{N}(\bm{f}_i)$, with $\mathcal{N}$ being the operation that normalizes the values of the element in $\bm{f}_i$ to be between -1 and 1. And $\bm{s} \in \mathds{R}^{N}$ is the vector that contains the slide indice information, i.e., the $i_{\textrm{th}}$ element of $\bm{s}$ is $s_i$. The specific formation of the proposed regularization term is formulated as, 


\begin{equation} \label{Lcr}
\begin{aligned}
\mathcal{L}_{\textrm{cr}} (\hat{\bm{F}}, \bm{s}) = & \frac{1}{D}\sum_{0 \textless i \textless j \leq N} \hat{\bm{f}}_i^{T} \hat{\bm{f}}_j \ \textrm{I}(i,j) \\
= & \frac{1}{D}\sum_{0 \textless i \textless j \leq N} \bm{u}_i^{T} \hat{\bm{F}}^{T} \hat{\bm{F}} \ \bm{u}_j \ \textrm{I}(i,j) 
\end{aligned}
\end{equation}




\noindent where $\bm{u}_i \in \mathds{R}^{N \times 1}$ is the one-hot vector with the $i_{\textrm{th}}$ element being one and the rest all being zero. $T$ is the matrix transpose operation. $\textrm{I}(i,j)$ is the indicator function,

\begin{equation}
\textrm{I}(i,j) = \left\{
\begin{aligned}
1&, \ \mbox{if} \ s_i = s_j \ \mbox{ (i.e., from the same slide)} \\
0&, \ \mbox{otherwise}
\end{aligned}
\right.
\end{equation}

\noindent Equation (\ref{Lcr}) can be simplified into a matrix operation form as,

\begin{equation} \label{final_Lcr}
\mathcal{L}_{\textrm{cr}} (\hat{\bm{F}}, \bm{s}) = \frac{1}{D} \ \mathcal{S} \big(  \hat{\bm{F}}^{T} \hat{\bm{F}} \odot \bm{M} \big),
\end{equation}

\noindent where $\mathcal{S}$ is the operation that sums up all the elements in a matrix, $\odot$ is the element-wise product, and $\bm{M} \in \mathds{R}^{N \times N}$ is an upper-triangular matrix with the element of $i_{\textrm{th}}$ row and $j_{\textrm{th}}$ ($i \textless j$) column being defined as

\begin{equation}
M_{i,j} = \left\{
\begin{aligned}
1&,  \ \mbox{if} \ s_i = s_j \\
0&, \ \mbox{otherwise}
\end{aligned}
\right.
\end{equation}

\noindent The $\hat{\bm{F}}^{T} \hat{\bm{F}}$ is exactly the Gramian matrix \cite{horn2012matrix}, in which the value of the element of $i_{\textrm{th}}$ row and $j_{\textrm{th}}$ column being the correlation of $\hat{\bm{f}}_i$ and $\hat{\bm{f}}_j$, with a higher value suggesting the corresponding pair of features are more correlated. $\bm{M}$ serves to select the target correlation values in $\hat{\bm{F}}^{T} \hat{\bm{F}}$.


 
\subsection{Interpretation}
The two terms in Equation (\ref{overal_loss}) work in an adversarial way in some sense. On the one hand, minimizing the first term results in searching for the subspaces of feature that are diagnostically discriminative. However, the learnt subspaces inevitably incorporate the feature spaces related to specific slides that are not diagnostically relevant and consequently not informative to discriminate the category. Therefore it hampers a model being trained to be more generalized. The situation is even worse if the number of slides is particularly small. On the other hand, the proposed regularization term (the second term in Equation (\ref{overal_loss})) aims to drive the learnt features away from the subspace characterized by each individual slide. Note that the subspace of an individual slide feature also overlaps with diagnostic subspace, thus a proper weight $\beta$ is required to function the trade-off. 




\begin{figure*}[ht]
	\label{fig_sne}
	\centering
	\subfigure[\scriptsize{Without SCR. ResNet-18}]{
		\begin{minipage}[b]{0.235\linewidth}
			\includegraphics[width=1\linewidth]{image/SNE/cw0.png}\vspace{4pt}
		\end{minipage}}
		\subfigure[\scriptsize{Without SCR. ResNet-50}]{
			\begin{minipage}[b]{0.235\linewidth}
				\includegraphics[width=1\linewidth]{image/SNE/resnet50_cw0.png}\vspace{4pt}
			\end{minipage}}
	\subfigure[\scriptsize{With SCR. ResNet-18}]{
		\begin{minipage}[b]{0.235\linewidth}
			\includegraphics[width=1\linewidth]{image/SNE/cw05.png}\vspace{4pt}
		\end{minipage}}
		\subfigure[\scriptsize{With SCR. Resnet-50}]{
			\begin{minipage}[b]{0.235\linewidth}
				\includegraphics[width=1\linewidth]{image/SNE/resnet50_cw05.png}\vspace{4pt}
			\end{minipage}}
			\caption{t-SNE distribution of the learnt features of patches from 10 slides, with and without slide correlation reduction (SCR), respectively. For each slide 100 patches are considered, and the same color refers to patches from the same slide. Features closed to each other in the 2-dimensional space are merged to a larger ellipsoid.}
			%\vspace*{-5mm}
		\end{figure*}


%\vspace*{-10mm}
\section{Experiments}

\subsection{Configurations}

Five deep learning architectures were utilized as the backbone feature extractors in the experiments, namely ResNet-18, ResNet-50 \cite{he2016deep}, DenseNet-121 \cite{huang2017densely}, AlexNet \cite{krizhevsky2012imagenet} and VGG-16 \cite{simonyan2014very}, and all were pre-trained with ImageNet dataset \cite{deng2009imagenet}. 

Patches were split into training set and test set based on slides. All the patches were resized to 256$\times$256 pixels, and random rotation was adopted as the data augmentation approach during the training process. Each model was trained for 10 epochs with an initial learning rate of $ 5e-4$ and then $1e-4$ after epoch 5. Stochastic gradient descent (SGD) was used as the optimizer with a weight decay of $1e-4$. The slide-level performances are reported which were obtained from the mean values of predictions of all patches in a slide. Each performance value was the mean value of the results of 5 independent experiments. The $\beta$ was set to be 0.2. For all the experiments 0.5 was adopted as the threshold to compute the performance metrics except for AUC.

\subsection{Performance}

Table.\ref{performance_0} presents the results of the proposed method in comparison of the baseline (no SDA nor SCR applied), SDA \cite{lafarge2017domain} and the proposed SCR by using different backbone architectures as the feature extractors. For the most informative performance metric AUC, the proposed method is significantly superior to the baseline and SDA with all the backbone architectures, and with large margins in most cases and can up to 8\% (ResNet-50). For all the other performance metrics, the proposed method is the best or closed to the best. 

In Table.\ref{performance_1}, stain normalization (SN) and color jitter (CJ) were utilized as the additive pre-processing approaches to the baseline method, SDA, and the proposed SCR, respectively. CJ served in a way as data augmentation and was implemented by multiplying the values of brightness, contrast, saturation, and hue of an image with a random coefficient. The random coefficient was drawn each time between 0.8 and 1.2 with uniform probability. For the baseline method, the results show the CJ works better than SN. When combined CJ with the proposed SCR, the performance can further be improved, since except for the baseline method using DenseNet-121 (0.955 vs 0.947), the proposed method dominates the other two in AUC. In particular, when using VGG-16 as the backbone, the proposed method with CJ achieves the highest AUC values of all (0.968).  




\begin{table}[ht]
	\footnotesize
	\caption{Performance of the baseline method, slide domain adversarial (SDA) and the proposed regularization term of slide correlation reduction (SCR). The subscripts are the standard deviation values. The best AUC values are in bold.}
	\centering
	\label{performance_0}
	\footnotesize
	\begin{tabular}{c|c|cccc|c}
		\hline \hline
		Network     & Method   & Accuracy     & Recall       & Specificity  & F1           & AUC          \\ \hline
& Baseline & 0.649\textsubscript{0.011} & 0.875\textsubscript{0.001} & 0.521\textsubscript{0.017} & 0.645\textsubscript{0.007} & 0.776\textsubscript{0.015} \\
		Resnet18    & SDA      & 0.600\textsubscript{0.027} & 0.812\textsubscript{0.001} & 0.478\textsubscript{0.042} & 0.596\textsubscript{0.016} & 0.772\textsubscript{0.015} \\
		& SCR      & 0.622\textsubscript{0.031} & 0.887\textsubscript{0.024} & 0.471\textsubscript{0.057} & 0.631\textsubscript{0.017} & \textbf{0.834}\textsubscript{0.002} \\ \hline
		& Baseline & 0.519\textsubscript{0.007} & 0.955\textsubscript{0.028} & 0.270\textsubscript{0.026} & 0.591\textsubscript{0.005} & 0.801\textsubscript{0.021} \\
		Resnet50    & SDA      & 0.545\textsubscript{0.038} & 0.574\textsubscript{0.025} & 0.528\textsubscript{0.052} & 0.479\textsubscript{0.028} & 0.650\textsubscript{0.018} \\
		& SCR      & 0.584\textsubscript{0.031} & 0.937\textsubscript{0.001} & 0.382\textsubscript{0.049} & 0.621\textsubscript{0.018} & \textbf{0.889}\textsubscript{0.003} \\ \hline
		& Baseline & 0.813\textsubscript{0.029} & 0.887\textsubscript{0.053} & 0.771\textsubscript{0.069} & 0.776\textsubscript{0.021} & 0.889\textsubscript{0.004} \\
		AlexNet     & SDA      & 0.895\textsubscript{0.018} & 0.899\textsubscript{0.030} & 0.892\textsubscript{0.039} & 0.862\textsubscript{0.018} & 0.916\textsubscript{0.001} \\
		& SCR      & 0.795\textsubscript{0.062} & 0.912\textsubscript{0.050} & 0.728\textsubscript{0.120} & 0.769\textsubscript{0.049} & \textbf{0.920}\textsubscript{0.006} \\ \hline
		& Baseline & 0.836\textsubscript{0.009} & 0.737\textsubscript{0.025} & 0.892\textsubscript{0.001} & 0.766\textsubscript{0.016} & 0.880\textsubscript{0.005} \\
		DenseNet121 & SDA      & 0.859\textsubscript{0.026} & 0.812\textsubscript{0.068} & 0.885\textsubscript{0.014} & 0.806\textsubscript{0.043} & 0.881\textsubscript{0.003} \\
		& SCR      & 0.822\textsubscript{0.017} & 0.887\textsubscript{0.025} & 0.785\textsubscript{0.039} & 0.784\textsubscript{0.012} & \textbf{0.918}\textsubscript{0.005} \\ \hline
		& Baseline & 0.695\textsubscript{0.027} & 0.875\textsubscript{0.001} & 0.592\textsubscript{0.042} & 0.676\textsubscript{0.019} & 0.891\textsubscript{0.007} \\
		VGG16       & SDA      & 0.672\textsubscript{0.030} & 0.75\textsubscript{0.001}  & 0.628\textsubscript{0.048} & 0.625\textsubscript{0.021} & 0.809\textsubscript{0.002} \\
		& SCR      & 0.695\textsubscript{0.018} & 0.937\textsubscript{0.001} & 0.557\textsubscript{0.028} & 0.691\textsubscript{0.012} & \textbf{0.893}\textsubscript{0.010} \\ \hline \hline
	\end{tabular}
	%\vspace*{-5mm}
\end{table}
%\vspace*{-8mm}




\subsection{Feature distribution}
Figure.\ref{fig_sne} presents the feature distributions obtained by mapping the high-dimensional features to 2-dimensional using t-SNE \cite{van2008visualizing}. As can be seen, when without using SCR for training, the learnt features of the patches from the same slide have smaller inter-distances in the feature space, and tend to cluster with each other. Such clustering is more significant with larger networks such as ResNet-50 since it has higher learning capability. This phenomenon suggests the slide-specific features are inevitably encoded in the learnt presentations of the patches. In contrast, with the proposed SCR, the learnt features of the patches from the same slide distribute more evenly over the feature space, and present weaker spatial clues to infer they are from the same slide. It implies the slide-specific features among the same slide patches have been deprived from the learnt features to some extent. 
































\begin{table}[ht]
	\footnotesize
	\caption{Performance of the baseline method, slide domain adversarial (SDA) and the proposed slide correlation reduction (SCR), with stain normalization (SN) and color jitter (CJ) serving as the extra pre-precossing methods. The subscripts are the standard deviation values. The best AUC values are  in bold.}
	\label{performance_1}
	\centering
	\footnotesize
	\begin{tabular}{c|c|cccc|c}
		\hline \hline
		Network     & Method      & Accuracy     & Recall        & Specificity  & F1            & AUC           \\ \hline
		& Baseline+SN & 0.850\textsubscript{0.011} & 0.774\textsubscript{0.030}  & 0.892\textsubscript{0.001} & 0.789\textsubscript{0.018}  & 0.890\textsubscript{0.002}  \\
		Resnet18    & Baseline+CJ & 0.873\textsubscript{0.016} & 0.892\textsubscript{0.028}  & 0.862\textsubscript{0.012} & 0.8367\textsubscript{0.021} & 0.927\textsubscript{0.002}  \\
		& SDA+CJ      & 0.899\textsubscript{0.011} & 0.937\textsubscript{0.001}  & 0.878\textsubscript{0.017} & 0.872\textsubscript{0.012}  & 0.9334\textsubscript{0.002} \\
		& SCR+CJ      & 0.889\textsubscript{0.022} & 0.928\textsubscript{0.021}  & 0.867\textsubscript{0.031} & 0.859\textsubscript{0.026}  & \textbf{0.951}\textsubscript{0.005}  \\ \hline
		& Baseline+SN & 0.854\textsubscript{0.023} & 0.800\textsubscript{0.025}  & 0.885\textsubscript{0.026} & 0.800\textsubscript{0.029}  & 0.899\textsubscript{0.006}  \\
		Resnet50    & Baseline+CJ & 0.777\textsubscript{0.009} & 0.937\textsubscript{0.002}  & 0.685\textsubscript{0.014} & 0.753\textsubscript{0.007}  & 0.936\textsubscript{0.005}  \\
		& SDA+CJ      & 0.809\textsubscript{0.018} & 0.862\textsubscript{0.025}  & 0.778\textsubscript{0.034} & 0.766\textsubscript{0.016}  & 0.919\textsubscript{0.008}  \\
		& SCR+CJ      & 0.845\textsubscript{0.017} & 0.937\textsubscript{0.001}  & 0.792\textsubscript{0.026} & 0.815\textsubscript{0.016}  & \textbf{0.953}\textsubscript{0.002}  \\ \hline
		& Baseline+SN & 0.831\textsubscript{0.023} & 0.800\textsubscript{0.072}  & 0.850\textsubscript{0.014} & 0.774\textsubscript{0.040}  & 0.887\textsubscript{0.004}  \\
		AlexNet     & Baseline+CJ & 0.850\textsubscript{0.011} & 0.812\textsubscript{0.001}  & 0.871\textsubscript{0.017} & 0.797\textsubscript{0.012}  & 0.911\textsubscript{0.003}  \\
		& SDA+CJ      & 0.799\textsubscript{0.030} & 0.875\textsubscript{0.001}  & 0.757\textsubscript{0.047} & 0.761\textsubscript{0.027}  & 0.915\textsubscript{0.003}  \\
		& SCR+CJ      & 0.859\textsubscript{0.022} & 0.875\textsubscript{0.039}  & 0.850\textsubscript{0.052} & 0.819\textsubscript{0.019}  & \textbf{0.932}\textsubscript{0.002}  \\ \hline
		& Baseline+SN & 0.768\textsubscript{0.009} & 0.600\textsubscript{0.030}  & 0.864\textsubscript{0.014} & 0.652\textsubscript{0.018}  & 0.877\textsubscript{0.002}  \\
		DenseNet121 & Baseline+CJ & 0.836\textsubscript{0.009} & 0.862\textsubscript{0.025}  & 0.821\textsubscript{0.001} & 0.792\textsubscript{0.014}  & \textbf{0.955}\textsubscript{0.003}  \\
		& SDA+CJ      & 0.836\textsubscript{0.009} & 0.875\textsubscript{0.001}  & 0.814\textsubscript{0.014} & 0.795\textsubscript{0.008}  & 0.929\textsubscript{0.002}  \\
		& SCR+CJ      & 0.863\textsubscript{0.014} & 0.850\textsubscript{0.030}  & 0.871\textsubscript{0.017} & 0.819\textsubscript{0.019}  & 0.947\textsubscript{0.004}  \\ \hline
		& Baseline+SN & 0.795\textsubscript{0.020} & 0.612\textsubscript{0.061}  & 0.899\textsubscript{0.014} & 0.683\textsubscript{0.041}  & 0.870\textsubscript{0.007}  \\
		VGG16       & Baseline+CJ & 0.745\textsubscript{0.037} & 0.762\textsubscript{0.027}  & 0.735\textsubscript{0.059} & 0.685\textsubscript{0.031}  & 0.863\textsubscript{0.010}  \\
		& SDA+CJ      & 0.777\textsubscript{0.033} & 0.8125\textsubscript{0.055} & 0.757\textsubscript{0.057} & 0.726\textsubscript{0.034}  & 0.874\textsubscript{0.017}  \\
		& SCR+CJ      & 0.872\textsubscript{0.011} & 0.837\textsubscript{0.030}  & 0.892\textsubscript{0.001} & 0.826\textsubscript{0.017}  & \textbf{0.968}\textsubscript{0.003}  \\ \hline \hline
	\end{tabular}
	%\vspace*{-5mm}
\end{table}





























%\vspace*{-5mm}







\section{Conclusion}
In this paper, we propose an intuitive and easy-to-implement regularization term to be added to the standard loss function, in order to reduce the correlation of patches from the same slide, and in turn to increase the generalization capability of deep learning models. We have applied this new approach for the analysis of histopathology WSIs for the prediction of \textit{nBAP1} status. Indeed, it offers improved performance compared to existing approaches. It is compatible and effective for a variety of existing network architectures. This SCR is expected to be extendable to wider applications when the correlation is of concern. 




% Acknowledgments---Will not appear in anonymized version
\midlacknowledgments{We thank a large number of people associated with this project, including the uveal melanoma patients. Zhang H. thanks Chinese Academy of Sciences IntelliCloud Technology Co., Ltd. for the industry studentship.}


\bibliography{Zhang21}





\clearpage

\appendix
\section{Ablation Study}


We selected the cases with SCR and with color jitter as the pre-processing, which achieved the best performance of all (see Table.\ref{performance_1} in the main paper), to explore how performances vary with different values of $\beta$ in Eq.(\ref{overal_loss}). Figure.(\ref{fig_beta}) shows the performances have peaks around $\beta=0.2$ and slightly decrease with the increase of $\beta$. However, for a wide range values of $\beta$ better performances can be achieved than the one without SCR regularization term ($\beta=0$). 

To further demonstrate it is exactly the reduction in slide correlation functions that improves the generalization capability of a deep learning model, we conducted experiments that instead of reducing slide correlations, enhanced slide correlations, simply by reversing the plus sign to minus sign in Equation.(\ref{overal_loss}). Figure.(\ref{tab_enhance}) presents the corresponding AUC values, which shows by enhancing slide correlations (denoted as SCE) the performances are significantly worse than by reducing slide correlations, and in some cases, it even has lower AUC values than the baseline. 

 


\begin{figure}[ht]
	\begin{floatrow}
		\ffigbox{%
				\includegraphics[width=5cm]{image/beta.eps}  \vspace*{-6mm}
		}{%
		\caption{ \footnotesize{Performances of slide correlation reduction using VGG16 with different values of $\beta$.}}%
		\label{fig_beta}
	}
	\capbtabbox{% 
	\centering
	\footnotesize
	\begin{tabular}{c|ccc} 
		\hline
		Method      & Baseline & SCE   & SCR   \\ \hline
		ResNet18    & 0.776    & 0.788 & 0.834 \\
		ResNet50    & 0.801    & 0.803 & 0.889 \\
		AlexNet     & 0.889    & 0.852 & 0.92  \\
		DenseNet121 & 0.88     & 0.885 & 0.918 \\
		VGG16       & 0.891    & 0.841 & 0.893 \\ \hline
	\end{tabular} %\vspace*{1mm}
	}{%
	\caption{ \footnotesize{AUC values of the baseline, slide correlation enhance (SCE) and slide correlation reduction (SCR) methods, respectively.}} \label{tab_enhance}%  
} 
\end{floatrow}
\end{figure}







\section{Comparisons of the AUC values on training set and test set}

Table.\ref{tab_training_test_performance} presents the AUC values on the training set and test set, respectively. Without using SCR, the baseline method can achieve extremely high values of AUC on the training set, which are significantly better than those with SCR. In contrast, the performances on the test set of the baseline method are inferior to those with SCR. It demonstrates that when trained with the proposed SCR, the issue of over-fitting can be alleviated to some extent, and consequently the trained model is able to obtain higher generalization capability. 


\begin{table}[ht]
	\footnotesize
	\caption{AUC values on the training set and test set with and without the slide correlation reduction (SCR) method. Both are with color jitter for pre-processing.}
	\label{tab_training_test_performance}
	\centering
	\footnotesize
	\begin{tabular}{c|cc|cc}
		\hline
		& \multicolumn{2}{c|}{without SCR} & \multicolumn{2}{c}{with SCR} \\ \hline
		Method      & Training                  & Test                   & Training                & Test                 \\ \hline
		ResNet18    & 0.981                     & 0.927                  & 0.956                   & 0.951                \\
		ResNet50    & 0.995                     & 0.936                  & 0.981                   & 0.953                \\
		AlexNet     & 0.962                     & 0.911                  & 0.937                   & 0.932                \\
		DenseNet121 & 0.993                     & 0.955                  & 0.977                   & 0.947                \\
		VGG16       & 0.992                     & 0.863                  & 0.986                   & 0.968                \\ \hline
	\end{tabular}
	\vspace*{-5mm}
\end{table}



\section{Validation on CAMELYON16 dataset}

The Camelyon16 dataset \cite{bejnordi2017diagnostic} contains 270 WSIs (160 normal and 110 tumor) for training, and 130 WSIs for testing (81 normal and 49 tumor). We followed \cite{li2018cancer} to use the first 140 normal slides and the first 100 tumor slides for training, and the remaining slides for validation. 50,000 patches were extracted from the normal and tumor slides in the training set, respectively (In total 100000 patches for training).  From the validation set, 10000 normal patches and 10,000 tumor patches were extracted for validation. All the patches were from the 40X magnification and with the size of 256 x 256 pixels. Random cropping to 224 x 224 pixels and random rotation/flipping were utilized as the data augmentations. The networks were trained for 15 epochs with a constant learning rate of 0.001. For more details please refer to the released code. Table.\ref{table_camelyon} presents the patch-level AUC values on the validation set. We can observe that the proposed SCR achieves higher AUC values than the baseline method and the slide domain adversary (SDA) using the two backbone networks (Resnet-18 and Resnet-50) with and without color jitter (CJ) as the data augmentation. 




\begin{table}[ht]
\begin{tabular}{c|cccc}
\hline \hline
         & Resnet18 & Resnet18 (CJ) & Resnet50 & Resnet50 (CJ) \\ \hline 
Baseline & 0.910\textsubscript{0.002}    & 0.922\textsubscript{0.003}    & 0.909\textsubscript{0.005}    & 0.926\textsubscript{0.003}        \\
SDA      & 0.906\textsubscript{0.006}    & 0.923\textsubscript{0.002}    & 0.918\textsubscript{0.003}    & 0.929\textsubscript{0.003}        \\
SCR      & 0.923\textsubscript{0.002}    & 0.934\textsubscript{0.001}    & 0.922\textsubscript{0.004}    & 0.931\textsubscript{0.002}        \\ \hline \hline
\end{tabular}
\caption{Patch-level AUC values of the baseline method, SDA and the proposed SCR on the validation set of Camelyon16, respectively. CJ: color jitter.}
\label{table_camelyon}
\end{table}
























\begin{comment}

\appendix

\\

\section{Proof of Theorem 1}

This is a boring technical proof of
\begin{equation}\label{eq:example}
\cos^2\theta + \sin^2\theta \equiv 1.
\end{equation}

\section{Proof of Theorem 2}

This is a complete version of a proof sketched in the main text.
\end{comment}




\end{document}
