\documentclass{midl} % Include author names
%\documentclass[anon]{midl} % Anonymized submission

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

\usepackage{mwe} % to get dummy images

% Header for extended abstracts
\jmlrproceedings{MIDL}{Medical Imaging with Deep Learning}
\jmlrpages{}
\jmlryear{2021}

% to be uncommented for submissions under review
%\jmlrworkshop{Short Paper -- MIDL 2021 submission}
%\jmlrvolume{-- Under Review}
%\editors{Under Review for MIDL 2021}

\title[Comparison of CNN models on a multi-scanner database]{Comparison of CNN models on a multi-scanner database in colon cancer histology}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{\Name{Petr Kuritcyn\midlotherjointauthor\nametag{$^{1}$}} \Email{petr.kuritcyn@iis.fraunhofer.de}\\
	\Name{Michaela Benz\midlotherjointauthor\nametag{$^{1}$}} \Email{michaela.benz@iis.fraunhofer.de}\\
	\Name{Jakob Dexl\midlotherjointauthor\nametag{$^{1}$}} \Email{dexljb@iis.fraunhofer.de}\\
	\Name{Volker Bruns\midlotherjointauthor\nametag{$^{1}$}} \Email{volker.bruns@iis.fraunhofer.de}\\
\addr $^{1}$ Fraunhofer Institute for Integrated Circuits IIS\AND
\Name{Arndt Hartmann\midlotherjointauthor\nametag{$^{2}$}} \Email{Arndt.Hartmann@uk-erlangen.de}\\
\Name{Carol I. Geppert\nametag{$^{2}$}} \Email{Carol.Geppert@uk-erlangen.de}\\
\addr $^{2}$ Institute of Pathology, University Hospital Erlangen}


\begin{document}

\maketitle

\begin{abstract}
One of the most important challenges for computer-aided analysis in digital pathology is the development of robust deep neural networks, which can cope with variations in color and resolution of digitized whole-slide images (WSIs). It has been shown that color augmentation during training is a useful method to aid a model generalize better to heterogeneous data. In this work, we compare several state of the art models on a multi-scanner database comprising slides each digitized with six different scanners. All of the networks are trained with data of only one scanner applying a combination of color and blur augmentation techniques. All models show similar tendencies across the different scanner databases but differ in the overall classification accuracy. Differences in training and inference time, however, are more pronounced: on a mid-range GPU, the inference time of the fastest model (QuickNet) is 13 times faster than the slowest one (EfficientNet B4). There is also a trade-off between speed and accuracy, the slower networks are more stable across different scanners and show the overall best performance. A good compromise between quality and inference time is achieved by EfficientNet B0.


\end{abstract}

\begin{keywords}
Histopathology; Data Augmentation; Tissue Classification; CNN
\end{keywords}

\section{Introduction}

Digitization of slides in computational pathology is a crucial step, which may introduce significant variations in color and resolution due to the use of different scanners. In our previous work \cite{Kuritcyn:InProceedings:2021} we addressed this challenge by applying different augmentation techniques during training and tested on our multi-scanner database of 30 slides each digitized with six different scanners. Another challenge in digital pathology is the huge size of whole-slide images, which can consist of several giga pixels and lead to high computation times. Therefore, in this work, we compare different state-of-the-art networks (Three versions of EfficientNet: B0, B3 and B4 \cite{Tan:InProceedings:2019}, Xception, an adapted version of Xception, Inception, ResNet, DenseNet, MobileNet and QuickNet) in terms of their robustness and inference time.  


\section{Materials and Methods}

Our dataset consists of 161 hematoxylin and eosin (H\&E) stained colon tissue sections with manual annotations of seven tissues classes (tumor cells, mucosa, etc.). The data for training and validation is derived from 122 WSIs, acquired with a 3DHISTECH MIDI scanner, resulting in 2,173,515 labeled image patches with a size of 224 x 224 pixel for training and 719,000 patches for validation. A disjoint set of 30 slides was scanned with six different scanners and annotations were automatically transfered resulting in scanner-specific test datasets each comprising more than 500,000 image patches. The resolution of the scanners varies from 0.17 to 0.35 µm/pixel and significant color variations are present. A more detailed description of the datasets is given in \cite{Kuritcyn:InProceedings:2021}. 

 
Training was carried out on a NVIDIA Tesla P100 using the TensorFlow framework. The batch size was set to 105 for all models except EfficientNet B3, B4 and QuickNet, where it was decreased to 35 due to GPU memory constraints. For all models, the Adam optimizer with a learning rate of 0.001 and an exponential decay was used. Each network was trained three times and test results were averaged (see \figureref{fig:performance}). We introduced color variance in the training data using a combination of hue, saturation and H\&E color augmentations \cite{Tellez:Article:2019}. Additionally, we added a blur augmentation to counter the presence of out of focus regions in some WSIs. No geometric augmentations were applied. Besides the standard Xception model we also trained an adapted version (Xception adapt) described in \cite{Kuritcyn:InProceedings:2021}. 
Inference tests were done with a mid-range NVIDIA GeForce GTX 1060 GPU with 6 GB memory using the TensorFlow 2.3 C API with a batch size of 30 and averaging over 5275 batches. Training and inference time are presented in \figureref{fig:time}.

\setlength{\abovecaptionskip}{10pt plus 1pt minus 1pt}
\begin{figure}[htbp]
	% Caption and label go in the first argument and the figure contents
	% go in the second argument
	\floatconts
	{fig:performance}
	{\vspace*{-5mm}\caption{Average classification accuracy on the different scanner test datasets. All models were trained on the 3DHISTECH MIDI scanner (Original).}\vspace*{-5mm}}
	{\includegraphics[width=0.98\linewidth
		, height=5.8cm
		]{bar_des_final_ratio_scale.pdf}}
\end{figure}


\begin{figure}[htbp]
	% Caption and label go in the first argument and the figure contents
	% go in the second argument
	\floatconts
	{fig:time}
	{\vspace*{-5mm}\caption{Mean training time until early stopping (left). Mean classification accuracy over all datasets plotted against the mean inference time per image patch (right).}\vspace*{-5mm }}
	{\includegraphics[width=1.0\linewidth, height = 5.3cm]{time_final_inch_scale.pdf}}
\end{figure}

\section{Results and Conclusion}

Most models achieve recognition rates around 90\% on all except the iSTIX dataset. Similarly, the standard deviation on all datasets is lower than for iSTIX. In comparison the latter has a poorer image quality due to the nature of the manual scanning process. Compared to our previous work the additional blur augmentation significantly increases  the accuracy on the iSTIX dataset from 0.621 to 0.761 (Xception adapt). The three EfficientNet models achieve the highest mean accuracies over all datasets. In \cite{Tan:InProceedings:2019}, however, e.g. Xception performs better on the ImageNet dataset than EfficientNet B0. This shows that ImageNet ranking is not directly transferable to the domain of histopathology. A decision on which model presents the best trade-off between accuracy and inference speed  depends on how these attributes are weighted. In an attempt to find an objective score with equal weights for quality and speed, we propose to normalize both dimensions to a range of 0 (slowest/least accurate) to 1 (fastest/most accurate) and then average both values for each model. EfficientNet B0 shows the best overall score with 0.73, while all other models score between 0.42 (Inception v3) and 0.65 (MobileNet).

% Acknowledgments---Will not appear in anonymized version
\midlacknowledgments{This work was supported by the Bavarian Ministry of Economic Affairs, Regional Development and Energy through the Center for Analytics – Data – Applications (ADA-Center) within "BAYERN DIGITAL II" and by the BMBF (16FMD01K, 16FMD02 and 16FMD03).}
\bibliography{midl-samplebibliography}
\end{document}