\documentclass{midl} % Include author names
%\documentclass[anon]{midl} % Anonymized submission

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

% figures and tables
\usepackage{svg}
\usepackage{booktabs}
\usepackage{adjustbox}
\usepackage{wrapfig}
\usepackage{multicol}
\usepackage{multirow}

\jmlryear{2024}
\jmlrworkshop{Full Paper -- MIDL 2024}
\jmlrvolume{-- nnn}
\editors{Accepted for publication at MIDL 2024}
%\title[Cell detection and classification]{Cell nuclei detection and classification at scale with transformers}
\title[Cell detection transformers]{Cell-DETR: Efficient cell detection and classification in WSIs with transformers}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{\Name{Oscar Pina\nametag{$^{1}$}} \Email{oscar.pina@upc.edu}\\
\addr $^{1}$ Universitat Politècnica de Catalunya - BarcelonaTech (UPC) \AND
\Name{Eduard Dorca\nametag{$^{2}$}} \Email{edorca@hospitalbellvitge.cat}\\
\addr $^{2}$ Hospital Universitari de Bellvitge (HUB) \AND
\Name{Verónica Vilaplana\nametag{$^{1}$}} \Email{veronica.vilaplana@upc.edu}\\
}

\begin{document}

\maketitle

\begin{abstract}
Understanding cell interactions and subpopulation distribution is crucial for pathologists to support their diagnoses. This cell information is traditionally extracted from segmentation methods, which poses significant computational challenges on processing Whole Slide Images (WSIs) due to their giga-size nature. Nonetheless, the clinically relevant tasks are nuclei detection and classification rather than segmentation. In this manuscript, we undertake a comprehensive exploration of the applicability of detection transformers for cell detection and classification (Cell-DETR). Not only do we demonstrate the effectiveness of the method by achieving state-of-the-art performance on well-established benchmarks, but we also develop a pipeline to tackle these tasks on WSIs at scale to enable the development of downstream applications. We show its efficiency and feasibility by reporting a $\times 3.4$ faster inference time on a dataset featuring large WSIs. By addressing the challenges associated with large-scale cell detection, our work contributes valuable insights that paves the way for the development of scalable diagnosis pipelines based on cell-level information.

%By addressing the challenges associated with large-scale cell detection, our work contributes valuable insights that pave the way for improved diagnostics and downstream applications leveraging cell-level information.

%Understanding cell interactions and subpopulation distribution is crucial for pathologists to support their diagnoses. This cell information is traditionally extracted from cell segmentation methods, which poses significant computational challenges, making the processing of Whole Slide Images (WSIs) a significant challenge due to their immense size. Nonetheless, the clinically relevant tasks are cell nuclei detection and classification, rather than segmentation. In this manuscript, we undertake a comprehensive exploration of the applicability of detection transformers for cell detection and classification. We also develop a robust pipeline designed to tackle these tasks on WSIs at scale to enable the development of downstream applications. Our study not only demonstrates the effectiveness of the proposed method in achieving state-of-the-art performance on well-established benchmarks for both cell detection and classification, but also sheds light on the efficiency of the approach by reporting inference times on a dataset featuring large WSIs. By addressing the challenges associated with large-scale cell detection and classification, our work contributes valuable insights that pave the way for improved diagnostics and downstream applications in digital pathology.

\end{abstract}

\begin{keywords}
Object detection, cell detection, transformers
\end{keywords}

\section{Introduction}
\label{sec:intro}

The integration of deep learning methods into digital pathology image analysis is reshaping medical practices, offering unprecedented opportunities for enhanced diagnostics. These techniques span from analyzing individual cells to examining Whole Slide Images (WSIs). Pathologists often rely on the composition of diverse cell subtypes, and other biological entities such as glands, in order to support their diagnoses, making the precise identification of cell nuclei imperative for effective computer-aided diagnosis applications. Indeed, applications leveraging cell information are gaining considerable attention \cite{pati2022hierarchical, jaume2021quantifying}.

%While cell segmentation and classification are well-explored tasks in digital pathology \cite{graham2019hover, hörst2023cellvit}, they pose significant computational demands, particularly due to the high resolution of output segmentation masks. Additionally, these methods often involve expensive post-processing steps to convert predicted masks into final results. For instance, \cite{graham2019hover} combines nuclear pixel, horizontal distance, and vertical distance maps to obtain instance segmentation. The giga-size of WSIs introduces additional computational challenges, restricting the application of cell segmentation and classification to smaller images and hindering the development of computer-aided diagnoses based on cell information.

%The task of cell detection and classification typically diverges from traditional detection approaches and is commonly addressed as a segmentation task. Given the small size of cell nuclei and the potential overlap, detecting them poses a significant challenge. Furthermore, the classification may hinge on nuanced details of the nuclei, necessitating high resolutions to produce robust outcomes. On the other hand, segmentation methods operate at higher resolutions, allowing for the detection of finer details but at the expense of increased computational and memory resource consumption.

%Nevertheless, the truly clinically relevant task lies in cell detection rather than segmentation, offering a less computationally demanding avenue. In this work, we delve into the nuances and challenges associated with cell detection and classification on WSIs, presenting novel insights and methodologies to enhance the efficiency and accuracy of this critical aspect of digital pathology image analysis. 

Cell segmentation and classification represent well-explored tasks in digital pathology \cite{graham2019hover, hörst2023cellvit}, supported by various datasets for related research \cite{gamper2020pannuke, graham2019hover}. However, the truly clinically relevant objectives lie in cell instance detection and classification, prioritizing these over segmentation. The inclination towards segmentation as a surrogate for detection arises from the inherent challenges posed by the size, morphology, and density of cell nuclei. Given their small size and frequent overlap, direct detection becomes a complex task. Additionally, accurate classification often relies on subtle image details, necessitating high resolutions for robust outcomes. Segmentation methods excel in capturing those small details, contributing to improved results.

Despite the valuable boost in accuracy, this improvement comes at a cost—significant computational demands during both training and inference. The dense output format of segmentation masks amplifies memory and computational resource requirements for calculating pixel-level predictions and training loss function. Moreover, it also involves expensive post-processing steps during inference to output the final predictions. The size of WSIs, often reaching gigapixel dimensions (e.g. $100,000\times100,000=10^{10}$ pixels), adds an extra layer of complexity to implementing a computer-aided diagnosis pipeline, requiring the partition into smaller patches and subsequent processing. This challenge is particularly pronounced for segmentation methods, given their dense pixel-level output maps, making them impractical for real-world applications on WSIs. Instead, their application is often limited to smaller tiles, hindering the development of computer-aided diagnoses that require comprehensive cell information.

In this study, we delve into the challenges of cell nuclei detection and classification, treating it as a traditional object detection task. We explore the opportunities and challenges associated with this approach, presenting novel insights and methodologies geared towards enhancing the efficiency and accuracy of this critical facet of digital pathology image analysis and extending their application to WSIs.


%For the development of large-scale cell nuclei detection and classification methods, we harness the DEtection TRansformer (DETR) \cite{carion2020end, zhu2020deformable} model. Originally proposed for object detection with transformers, DETR's applicability to cell-level digital pathology has been explored in related work \cite{obeid2022nucdetr, Huang_2023_ICCV}. In contrast to previous works, our research focuses on the direct application of transformer-based models for cell nuclei detection and classification within real-world medical scenarios. Rather than introducing elaborate and sophisticated architectures, we systematically explore various design components of DETR-like models, aiming for practical and robust solutions. Our primary objective is to facilitate the seamless integration of these models into clinical daily workflows, emphasizing the development of a pipeline that not only yields robust results but also ensures swift and efficient inference on WSIs. By prioritizing applicability and performance in medical settings, we strive to contribute valuable insights and advancements that can directly benefit clinical practitioners and enhance the efficiency of pathological image analysis.

To develop large-scale methods for detecting and classifying cell nuclei, we utilize the DEtection TRansformer (DETR) model \cite{carion2020end, zhu2020deformable}. Although earlier studies have explored DETRs for cell detection and classification in digital pathology \cite{obeid2022nucdetr, Huang_2023_ICCV}, we prioritize practicality and robustness of cell detection transformers (Cell-DETR) rather than focusing on designing sophisticated auxiliary architectures as in prior methodologies. Our key objective is to provide the necessary tools to integrate Cell-DETR into daily clinical workflows by firstly obtaining more reliable, robust models and secondly addressing the challenges that arise when applying them to real world, large-scale scenarios beyond the traditional patch-based datasets.

Our contributions are twofold: firstly, to enhance the reliability of Cell-DETR, we explore different design components of DETR models and achieve state-of-the-art performance in both cell detection and classification tasks on popular benchmarks. Subsequently, we derive a specialized pipeline for efficient inference on WSIs, achieving a remarkable  $\times 3.4$ speed-up on inference time compared to other methods. This enhancement significantly expedites the application of Cell-DETR models to WSIs, making them well-suited for large-scale digital pathology tasks with cell-level information.

This manuscript is structured as follows: In Section \ref{sec:related}, we provide the essential background information and related work to our topic. Section \ref{sec:methods} outlines our methodologies, including details on datasets, augmentations, architecture and the inference pipeline designed for WSIs. The evaluation of our models for cell detection and classification, alongside the measurement of inference time on WSIs, is presented in Section \ref{sec:results}. Finally, our conclusions are summarized in Section \ref{sec:conclusions}.

\section{Background and related work}
\label{sec:related}

\subsection{Cell segmentation and classification}
\label{sec:related_cell}

HoVer-Net \cite{graham2019hover} introduces an innovative U-Net-like architecture featuring three decoder branches: nuclear pixel (NP), horizontal-vertical (HV), and nuclear classification (NC). These branches play distinct roles in predicting the probability of a pixel belonging to a nucleus, the horizontal and vertical distances to the nucleus's center of mass, and the pixel label, respectively. A postprocessing step is required to merge the outputs of the NP and HV branches to generate the final segmentation mask.

A recent extension of this work, CellViT \cite{hörst2023cellvit}, takes a step further by replacing the convolutional encoder with a Vision Transformer (ViT) \cite{dosovitskiy2020vit}, achieving state-of-the-art performance in cell detection and classification. This transition to a transformer-based architecture showcases the adaptability and effectiveness of transformer models in the domain of medical image analysis \cite{you2022class}.


\subsection{Object detection with Transformers}
\label{sec:related_detr}
The Detection Transformer (DETR) \cite{carion2020end} presents an end-to-end approach to object detection, utilizing transformers and bipartite matching to eliminate the necessity for manual post-processing steps. The model consists of a backbone that extracts hidden features from an input image, a transformer encoder that enhances these features through self-attention, and a transformer decoder that given the encoded image information outputs bounding box predictions for a set of input queries, which are learnable parameters of the model. The model undergoes training with a set-based bipartite matching loss to ensure the uniqueness of predictions.

DETR exhibits certain limitations, particularly in its ability to detect small objects due to the global nature of self-attention. To address this constraints, Deformable-DETR \cite{zhu2020deformable} incorporates a multi-scale deformable attention operation that confines the attention of each token to a specific subset of points. The determination of these points is achieved through the prediction of multi-scale offsets from the central token position to other tokens across all scales. Importantly, this offset prediction is co-trained with other components of the model, providing a comprehensive approach to enhancing the model's performance in detecting smaller objects. This approach also brings additional advantages, including multi-scale representations and faster computation.

%These models exhibit certain limitations, particularly in their ability to detect small objects. This constraint is primarily associated with the global nature of the self-attention operation applied across the entire image. To address this challenge, Deformable-DETR \cite{zhu2020deformable} incorporates a multi-scale deformable attention operation. This modification confines the attention of each token to a specific subset of points. The determination of these points is achieved through the prediction of multi-scale offsets from the central token position to other tokens across all scales. Importantly, this offset prediction is co-trained with other components of the model, providing a comprehensive approach to enhancing the model's performance in detecting smaller objects. This approach also brings additional advantages, including multi-scale representations and faster computation.

%The nature of decoder queries has been a subject of study. While in the original DETR, they were primarily positional queries, research has extended them to serve as both content and positional queries. Deformable-DETR \cite{zhu2020deformable} proposes a two-stage approach. Here, the queries are derived as foreground proposals from the encoder's output, making them dependent on the specifics of each image. Interestingly, other works also model the queries directly as bounding box that are embedded into a higher dimensional space \cite{liu2022dabdetr}.

\subsection{Cell detection and classification with transformers}
\label{sec:related_celldetr}
The exploration of cell detection and classification using transformers has been a subject of previous research, as evident in related works such as NucDETR \cite{obeid2022nucdetr} and ACFormer \cite{Huang_2023_ICCV}. NucDETR \cite{obeid2022nucdetr} stands out as the pioneering work that introduced the application of DETR for cell detection. However, it did not delve into the task of nuclei classification.

On the other hand, ACFormer \cite{Huang_2023_ICCV} presents a sophisticated mechanism featuring an adaptive transformer. This transformer proposes affine transformations for a given input image to be used as data augmentation. The method incorporates a local-global network architecture and a self-distillation mechanism, where the local network receives the affine-transformed images as input whereas the global network is fed with the original image. The outputs of the global network serve as target for the local network. While intriguing, the proposed transformations and the advantages of its local-global strategy remain unclear, suggesting at an excessively complex approach that may compromise the practicality and reliability of the model.
%This approach showcases an intricate yet effective strategy for improving the performance of cell detection and classification using transformer-based models.

\section{Methods}
\label{sec:methods}

\subsection{Datasets}
\label{sec:methods_data}
\paragraph{PanNuke} 
The PanNuke dataset \cite{gamper2020pannuke} comprises 7,904 patches, each sized $256 \times 256$, extracted from WSIs in The Cancer Genome Atlas (TCGA) dataset, representing 19 diverse tissue types at a magnification of 40x. Within this dataset, there are 189,744 labeled nuclei categorized into five clinically significant classes: neoplastic, inflammatory, connective, necrosis, and epithelial.

\paragraph{CoNSeP} 
The CoNSeP dataset \cite{graham2019hover} includes 41 $1000 \times 1000$px tiles extracted from H\&E-stained colorectal adenocarcinoma WSIs, each with at a 40x magnification. Notably diverse, the dataset encompasses various regions such as stromal, glandular, muscular, collagen, adipose, and tumorous areas. It also features a range of nuclei derived from different cell types, which are grouped into inflammatory, epithelial, spindle-shaped and miscellaneous  \cite{graham2019hover}.

\paragraph{Camelyon16}
The Camelyon16 dataset \cite{bejnordi2017diagnostic} consists of 400 H\&E stained Whole Slide Images (WSIs) of lymph node sections scanned at $\times 40$ magnification. Each WSI is accompanied by annotations highlighting tumor and normal regions. With average dimensions of $189,832 \times 95,590$px, approximately $29 \%$ of the slides represent tissue area. Specifically, an average of 1384 tissue tiles, each sized $2048\times2048$ pixels, is extracted per slide. Despite lacking cell-level annotations for quantifying detection and classification performance, this dataset remains pivotal in evaluating the efficacy of our models. Leveraging the Camelyon16 dataset allows us to assess the practicality of Cell-DETR on a scale that closely mimics the challenges encountered in clinical settings, demonstrating their scalability and effectiveness in handling large-scale pathology images.

%\color{red}\paragraph{WSIs}
%An additional in-house dataset consisting of 80 Whole Slide Images (WSIs) derived from raw endometrial hysterectomy samples is included in our work. In the dataset preparation phase, we implement a preprocessing step to accurately identify tissue regions within the slides. Subsequently, patches of size 1024 x 1024 pixels are extracted, resulting in an average of 6000 patches per slide. This approach reflects the large-scale nature of the images under consideration. Despite the absence of cell-level annotations for quantifying detection and classification performance, this dataset plays a crucial role in evaluating the inference time of our models in extensive, real-world scenarios. Leveraging this dataset allows us to assess the efficiency and practicality of our proposed models on a scale that mirrors the challenges encountered in clinical settings.\color{black}

\subsection{Cell-DETR}
\label{sec:method_celldetr}

\begin{figure}
  \centering
    \includegraphics[width=0.80\linewidth]{figures/celldetr.pdf}
  \caption{Cell-DETR pipeline for efficient cell detection on WSIs.}
  \label{fig:windowattn}
  \small % Add this line to set the text size to small
  \textit{(1) Preprocessing:} The tissue area of the slide is identified, and the slide is segmented into tiles of size $2048\times2048$px. \textit{(2) Inference:} Each tile undergoes the \textit{window detection} procedure.The tiles are fed into the model, divided into overlapping windows, processed in parallel, and their predictions are merged. \textit{(3) Post-processing:} The outputs from all tiles are aggregated to derive the cell nuclei for the WSI. The figure depicts a heatmap illustrating the cell density.
  %The window detection approach is used to address the limitation of the number of detections on lare images. Cell-DETR splits the image in-device into overlapped windows and processes them in parallel to finally combine the predictions.
  \vskip -0.2in
\end{figure}

\paragraph{Architecture}
The architecture for Cell-DETR comprises a hierarchical backbone that generates a four level feature pyramid for a given input image, followed by a multi-scale deformable transformer \cite{zhu2020deformable}, consisting of 6 encoder and 6 decoder layers. The encoder enhances the input features through multi-scale deformable self-attention, while the decoder produces predictions for bounding boxes and labels based on a set of input object queries. The initial states of these queries are foreground proposals generated from the output of the encoder \cite{zhu2020deformable}. Both the backbone and the transformer are pre-trained on the COCO dataset \cite{lin2014microsoft}. In Section \ref{sec:results_metrics}, we conduct experiments using both ResNet-50 \cite{he2016deep} and Swin Transformer \cite{liu2021Swin} backbones. Additionally, in the Appendix \ref{sec:appdx_ablations_backbone} we explore the impact of the output resolution and the number of levels in the extracted image features.

\paragraph{Data augmentation}
Data augmentation plays a crucial role in the domain of digital pathology. Images exhibit substantial diversity due to various factors, including differences in staining protocols, elapsed time since slide staining before digitization, and the diverse tissue types. Acknowledging and addressing these variations through data augmentation is key for obtaining robust performance across different conditions. Drawing from the observations in \cite{10.1117/12.2293048}, our data augmentation pipeline includes not only traditional rotation, flipping, and blurring augmentations but also a combination of elastic transformation and stain augmentations. The latter involves transforming the RGB image into the Hematoxylin-Eosin-DAB space (HED), randomly corrupting the channels separately, and then transforming the image back to the RGB space.

\paragraph{Loss function}
We utilize the standard loss function recommended for DETRs in natural images, which includes a combination of bounding box L1 regression, generalized intersection over union and focal loss classification. Opposed to ACFormer \cite{Huang_2023_ICCV}, which limits its prediction to the nuclei centroids as it is the primary interest in cell detection, we have observed a slight performance decline when excluding the boxes' width and height from the loss computation. Our hypothesis is that incorporating feedback on the boxes size during training aids the network in disambiguating detections and predicting the class label. The focal loss is used for classification rather than the standard cross-entropy loss to account for the class imbalance between cell nuclei. The corresponding hyperparameters and the difference in performance when excluding the bounding box size from the target can be found in the Appendinx \ref{sec:appdx_imp} and the Appendix \ref{sec:appdx_ablations_loss}, respectively.

\subsection{Large-scale cell detection}
Large-scale cell detection and classification on WSIs poses a formidable challenge owing to their gigapixel size. Traditional segmentation approaches might face significant hurdles in terms of computational and memory resources, making them less practical for addressing this task. In this section, we explore an alternative approach by Cell-DETR for processing on larger tiles. By adopting a strategic \textit{window detection} procedure, we facilitate the application of Cell-DETR on larger images, enabling efficient and scalable inference on WSIs. This approach provides a practical solution for large-scale cell detection and classification tasks overcoming the massive scale of histopathological images.

\label{sec:method_scale}
\subsubsection{Dealing with large image tiles}
A significant constraint of DETR-like models is the necessity for the number of queries in the decoder to surpass the potential objects present in an image. In regions characterized by a high cell density, a $256\times256$px image patch may contain up to 300 cell nuclei. Consequently, increasing the input image size to larger tiles, such as $1024\times1024$px or $2048\times2048$px, becomes non-trivial. The number of cell nuclei, and therefore the required input DETR queries, can substantially increase, potentially resulting in prohibitive computational demands.

%A notable constraint of DETR-like models is the requirement for the number of queries in the decoder to exceed the potential objects present in an image. \textcolor{blue}{In regions characterized by a high cell density, augmenting the input image size may lead to a substantial increase in the number of cell nuclei, which is of the order of hundreds in a $256\times256$px image ($\approx 4000 \mu m^2$). Consequently, applying DETR models to larger image tiles such as $1024\times1024$px becomes non-trivial, as the number of input queries may significantly increase, potentially leading to prohibitive computational demands.}

To address this challenge when working with larger image tiles, we adopt a \textit{window detection} procedure. This involves training the model using randomly selected image crops of the desired size extracted from the original images. During inference,  an overlapped sliding window approach is adopted, which is executed in-device to minimize GPU-CPU communication and enhance inference speed. Concretely, the model splits the original image into overlapped windows, processes them in parallel and finally combines the outputs to derive the final results. This strategy allows Cell-DETR to overcome the limitations associated with regions containing a high density of cell nuclei and large images, ensuring the model's adaptability and efficiency in real-world applications. 

\subsubsection{Inference on WSIs}
Scalability of cell detection and classification on WSIs is central to our approach, driven by the giga-size of these images. We have devised a robust pipeline tailored for inference on WSIs leveraging the \textit{window detection} approach. Specifically, the tissue regions of the slide are subdivided into $2048 \times 2048$ tiles, and these tiles undergo processing using the \textit{window detection} procedure. Subsequently, all predictions are aggregated to obtain the final results. This approach is more suitable than directly partitioning the image into smaller patches that could directly be fed into the model. The in-device execution of the \textit{window detection} ensures an efficient and streamlined process, making our pipeline adept at handling the distinctive challenges associated with the substantial scale of WSIs.


\begin{table}[ht]
\vskip -0.1in
\centering
\caption{Detection and classification metrics on PanNuke dataset.}
\label{tab:pannuke}
\vskip 0.1in
\begin{adjustbox}{width=\textwidth}
\begin{tabular}{l|ccc|ccc|ccc|ccc|ccc|cccccc}
\toprule
\multirow{2}{*}{\textbf{Method}} & \multicolumn{3}{c|}{\textbf{Detection}} & \multicolumn{3}{c|}{\textbf{Neoplastic}} & \multicolumn{3}{c|}{\textbf{Epithelial}} & \multicolumn{3}{c|}{\textbf{Inflammatory}} & \multicolumn{3}{c|}{\textbf{Connective}} & \multicolumn{3}{c}{\textbf{Necrosis}} \\
\cmidrule{2-19}
& $P_{det}$ & $R_{det}$ & $F_{det}$ & $P_{neo}$ & $R_{neo}$ & $F_{neo}$ & $P_{epi}$ & $R_{epi}$ & $F_{epi}$ & $P_{inf}$ & $R_{inf}$ & $F_{inf}$ & $P_{con}$ & $R_{con}$ & $F_{con}$ & $P_{nec}$ & $R_{nec}$ & $F_{nec}$ \\
\midrule
\textbf{DIST} \cite{naylor2018segmentation} & 0.74 & 0.71 &0.73 & 
       0.49 & 0.55 & 0.50 & 
       0.38 & 0.33 & 0.35 &
       0.42 & 0.45 & 0.42 &
       0.42 & 0.37 & 0.39 &
       0.00 & 0.00 & 0.00 \\
\textbf{Mask-RCNN} \cite{he2017mask} & 0.76 & 0.68 & 0.72 &
            0.55 & 0.63 & 0.59 &
            0.52 & 0.52 & 0.52 &
            0.46 & 0.54 & 0.50 &
            0.42 & 0.43 & 0.42 &
            0.17 & 0.30 & 0.22 \\
\textbf{Micro-Net} \cite{raza2019micro} & 0.78 & 0.82 & 0.80 &
            0.59 & 0.66 & 0.62 &
            0.63 & 0.54 & 0.58 &
            0.59 & 0.46 & 0.52 &
            0.40 & 0.45 & 0.47 &
            0.23 & 0.17 & 0.19 \\
\textbf{HoVerNet} \cite{graham2019hover} & 0.82 & 0.79 & 0.80 &
            0.58 & 0.67 & 0.62 &
            0.54 & 0.60 & 0.56 &
            0.56 & 0.51 & 0.54 &
            0.52 & 0.47 & 0.49 &
            0.28 & 0.35 & 0.31 \\
\textbf{CellViT} \cite{hörst2023cellvit} & 0.83 & 0.82 & \textbf{0.82} &
          0.69 & 0.70 & 0.69 &
          0.68 & 0.71 & 0.70 &
          0.59 & 0.58 & 0.58 &
          0.53 & 0.51 & 0.52 &
          0.39 & 0.35 & 0.37\\
\midrule
\textbf{Cell-DETR R50} & 0.85 & 0.78 & 0.81 &
                0.72 & 0.67 & 0.69 &
                0.71 & 0.67 & 0.69 &
                0.59 & 0.60 & 0.59 &
                0.57 & 0.49 & 0.53 &
                0.54 & 0.32 & 0.40 \\
\textbf{Cell-DETR SwinL} & 0.85 & 0.80 & \textbf{0.82} &
                  0.74 & 0.70 & \textbf{0.72} &
                  0.74 & 0.74 & \textbf{0.74} &
                  0.60 & 0.63 & \textbf{0.61} &
                  0.60 & 0.52 & \textbf{0.56} &
                  0.56 & 0.41 & \textbf{0.47} \\
\bottomrule
\multicolumn{19}{l}{\small Other metrics are extracted from \cite{hörst2023cellvit}.}

\end{tabular}
\end{adjustbox}
\vskip -0.1in
\end{table}

\begin{table}[ht]
\vskip -0.1in
\centering
\caption{Detection and classification F-Score on CoNSeP dataset. }
\label{tab:consep}
\vskip 0.1in
\begin{adjustbox}{width=0.8\textwidth}
\begin{tabular}{l|c|cccc}
\toprule
\textbf{Method} & \textbf{Detection} & \textbf{Epithelial} & \textbf{Inflammatory} & \textbf{Spindle-shaped} & \textbf{Miscellaneous} \\
\midrule
\textbf{DIST} \cite{naylor2018segmentation} & 0.71 & 0.62 & 0.53 & 0.51 & 0.00 \\
\textbf{Micro-Net} \cite{raza2019micro} & 0.74 & 0.62 & 0.59 & 0.53 & 0.12 \\
\textbf{Mask-RCNN} \cite{he2017mask} & 0.69 & 0.60 & 0.59 & 0.52 & 0.10 \\
\textbf{HoVer-Net} \cite{graham2019hover} & \textbf{0.75} & 0.64 & 0.63 &\textbf{ 0.57} & \textbf{0.43} \\
\textbf{ACFormer} \cite{Huang_2023_ICCV} & 0.74 & 0.64 & 0.64 & - & - \\
\midrule
\textbf{Cell-DETR R50} & 0.74 & 0.61 & 0.63 & 0.51 & 0.21 \\
\textbf{Cell-DETR SwinL} & 0.74 & \textbf{0.65} & \textbf{0.67} & 0.56 & 0.40 \\
\textbf{Cell-DETR SwinL}* & \textit{0.77} & \textit{0.70} & \textit{0.70} & \textit{0.61} & \textit{0.55} \\
\bottomrule
\multicolumn{6}{l}{\small *Model pre-trained on the first fold of PanNuKe dataset.}

\end{tabular}
\end{adjustbox}
\vskip -0.15in
\end{table}

\section{Results}
\label{sec:results}
In this section, we conduct an extensive set of experiments and present the results to provide insights into the capabilities of Cell-DETR. We assess the performance in terms of F-Score and inference time in Section \ref{sec:results_metrics} and Section \ref{sec:results_time}, respectively.

\subsection{Detection and classification performance}
\label{sec:results_metrics}

Table \ref{tab:pannuke} presents the detection and classification metrics on the PanNuke dataset for Cell-DETR using ResNet-50 and Swin Transformer (large) as backbones, in comparison to other state-of-the-art segmentation and detection methods. The provided numerical values in the table detail the averaged precision (P), recall (R), and F1-score (F1) for detection and nuclei types across the three standard splits publicly available for this dataset. A more detailed description of the metrics can be found in Appendix \ref{sec:appdx_metrics}. Notably, our results align with the current state-of-the-art in cell detection, and we achieve state-of-the-art performance in cell classification. The ResNet-50 backbone exhibits slightly superior classification performance, while the Swin-L backbone surpasses classification metrics by a significant margin. Although including the Swin-L involves an increase in the parameter complexity, these results showcase the potential of transformers for medical image analysis.

Given that the CoNSeP dataset consists of tiles sized at $1000 \times 1000$ pixels, the potential number of nuclei in a single image is exceptionally high. We have employed the \textit{window detection} procedure outlined in Section \ref{sec:method_scale} for both training and evaluating the Cell-DETR on this dataset. For training, random crops of $250 \times 250$px are randomly samples. During the valuation phase, the tiles are processed with the \textit{window detection}, with a window size of 250px and a stride of 187. The resulting predictions are then combined, retaining only those detections within the central crop of 187x187 pixels for each window. The detection and classification F-Score results are presented in Table \ref{tab:consep}, showcasing state-of-the-art performance. These findings align with the results presented in Table \ref{tab:pannuke} and validate the effectiveness of the \textit{window detection} approach in handling scenarios with high nucleus abundance.
%Local crops of size $250 \times 250$ pixels randomly sampled from the original images in each iteration are used for training. During the evaluation phase, the entire tiles are processed by the model, which divides the images into overlapping windows of size $250 \times 250$ pixels, with a stride of 187, and processes them in parallel. The resulting predictions are then combined, retaining only those detections within the central crop of 187x187 pixels for each window. The detection and classification F-Score results are presented in Table \ref{tab:consep}, showcasing state-of-the-art performance. These findings align with the results presented in Table \ref{tab:pannuke} and validate the effectiveness of the \textit{window detection} approach in handling scenarios with high nucleus abundance.

\begin{figure}
  \centering
   \subfigure[Model inference time.]{
    \includegraphics[width=0.45\linewidth]{figures/times_proc.pdf}
    \label{fig:time_proc}
  }
  \subfigure[Post-processing time.]{
    \includegraphics[width=0.45\linewidth]{figures/times_post.pdf}
    \label{fig:time_post}
  }
  \caption{Model inference and post-processing times as function of the slide area and the number of nuclei.}
  \label{fig:time}
  \vskip -0.1in
  \small{Cell-DETR is $\times 3.4$ faster than HoVer-Net for inference, leading to significant differences when the area of the slide is large. Additionally, in contrast to HoVer-Net, the post-processing time of Cell-DETR is constant with respect to the number of detected nuclei.}
  \vskip -0.15in
\end{figure}

\subsection{Time performance}
\label{sec:results_time}
To evaluate the scalability and feasibility of Cell-DETR models, along with the \textit{window detection} procedure, we conduct experiments on a subset of 111 slides from the Camelyon16 dataset and compare the time performance with HoVer-Net. The inference process can be divided into three steps: (i) model loading and slide pre-processing, (ii) model inference and (iii) post-processing. In the pre-processing step, which is shared between methods, tissue regions are segmented from the thumbnail of the original WSIs and tiles of size $2048 \times 2048$ are extracted, ensuring that only tissue areas are processed by the model. An average of $1,300$ tiles are extracted for slide. For model inference, we employ the \textit{window detection} approach for Cell-DETR whereas we follow the official processing pipeline for HoVer-Net. Figure \ref{fig:time_proc} shows the model inference time for HoVer-Net and Cell-DETR as function of the total area of the slide. Notably, Cell-DETR shows a $\times 3.4$ faster performance in this step, taking an average time of $1450$s per slide, versus $4912$s required by HoVer-Net. These metrics are extracted utilizing four 16GB GPUs. As previously argued, segmentation methods are more computationally demanding due to their dense output nature, making detection models a more suitable solution for inference on large WSIs. Finally, the post-processing step of Cell-DETR basically consists of combining the predictions of multiple tiles, while HoVer-Net requires an expensive post-processing to firstly obtain the instance segmentation masks from the raw predicted maps, and then to extract the cell nuclei instance information such as the centroids. Figure \ref{fig:time_post} shows the post-processing time as function of the number of nuclei detected in the slide. Intuitively, the post-processing time of HoVer-Net increases with the number of nuclei, which is in the order of millions for the WSIs. Instead, Cell-DETR exhibits a virtually constant post-processing time.

%\section{Discussion}

\section{Conclusions}
\label{sec:conclusions}

In this manuscript, we introduce a novel perspective to applications involving cell-level information on Whole Slide Images (WSIs), moving beyond conventional cell segmentation methods to prioritize detection while addressing reliability and scalability. Firstly, through a meticulous examination of design components, we enhance trustworthiness, achieving state-of-the-art performance in cell detection and classification that outperforms semantic segmentation methods. Secondly, we effectively tackle scalability challenges associated with large histopathological images, extending our approach to process WSIs with a remarkable efficiency. Consequently, our work provides vital insights for the development of diagnostics and interpretability applications, leveraging the wealth of information within extensive histopathology slides at the cellular level.


\midlacknowledgments{
This work has been supported by the Spanish Research Agency (AEI) under project PID2020-116907RB-I00 of the call MCIN/ AEI /10.13039/501100011033 and the FI-AGAUR grant funded by Direcció General de Recerca (DGR) of Departament de Recerca i Universitats (REU) of the Generalitat de Catalunya.}

\bibliography{midl23_130}

\appendix


\section{Implementation details}
\label{sec:appdx_imp}

Table \ref{tab:window_params} presents the image and window sizes used during the training and evaluation phases for the three datasets included in our experiments. For the PanNuKe dataset with $256 \times 256$-pixel images, no window partitioning is required for both training and evaluation. CoNSeP dataset images are larger at $1000 \times 1000$ pixels. Given the potential abundance of nuclei in a $1000 \times 1000$ image, we adopt the \textit{window detection} procedure. Training involves random crops of size $250 \times 250$ from the original image, with one crop sampled per image at each epoch. During inference, we process the large images using sliding windows of size $250 \times 250$ and a stride of 187.

To merge predictions from overlapped sliding windows during inference, only detections whose centroid falls within the central crop of the window are considered. Specifically, for a window size of $250 \times 250$ and a stride of 187, the size of the selected central crop is $187 \times 187$, leaving a border of 61 pixels. Detections within these borders are excluded, as they will belong to the central crop of the neighboring window. For windows at the tile borders, this exclusion applies solely to the sides overlapped by another window, not to those corresponding to the image borders.

For the Camelyon16 dataset, lacking cell annotations for training, we utilize it to assess the scalability of cell detection and classification on Whole Slide Images (WSIs). The original images are divided into patches of size $2048 \times 2048$, processed similarly to CoNSep images, with a sliding window of size $256 \times 256$ and a stride of 187.

All our models are implemented in PyTorch, with hyperparameters drawn from the original Deformable DETR \cite{zhu2020deformable}, avoiding an exhaustive hyperparameter search. Training is performed on four NVIDIA Quadro RTX 16GB GPUs. The base learning rate, defined for a batch size of 16 in the original paper, is linearly scaled based on our setting. Multi-step learning rate scheduling is incorporated by a factor of 0.1 at 70\% and 90\% of training. Notably, for CoNSeP, the number of epochs is extended due to the limited number of training images and the use of only one crop of size $250 \times 250$ sampled from the entire image at each epoch, resulting in only 1/16 of the image being fed into the model.

\begin{table}[ht]
\centering
\caption{Window detection hyperparameters.}
\label{tab:window_params}
\vskip 0.1in
\begin{adjustbox}{width=0.6\textwidth}
\begin{tabular}{l|cc|cccc}
\toprule
\multirow{2}{*}{\textbf{Dataset}} & \multicolumn{2}{c}{\textbf{Training}} & \multicolumn{3}{c}{\textbf{Evaluation and Inference}} \\
\cmidrule(lr){2-3} \cmidrule(lr){4-6}
 & \textbf{Image size} & \textbf{Crop size} & \textbf{Patch size} & \textbf{Window size} & \textbf{Stride} \\
\midrule
\textbf{PanNuke} & 256 & 256  & 256 & 256 & - \\
\textbf{CoNSeP}  & 1000 & 250 & 1000 & 250 & 187 \\
\textbf{Camelyon16} & - & - & 2048 & 256 & 187 \\
\bottomrule
\end{tabular}
\end{adjustbox}
\vskip -0.1in
\end{table}

\begin{table}[ht]
\centering
\caption{Training hyperparameters.}
\label{tab:train_params}
\vskip 0.1in
\begin{adjustbox}{width=0.9\textwidth}
\begin{tabular}{l|ccccc|ccc|ccccc}
\toprule
\multirow{2}{*}{\textbf{Dataset}} & \multicolumn{5}{c}{\textbf{Solver}} & \multicolumn{3}{c}{\textbf{Matcher}} & \multicolumn{4}{c}{\textbf{Loss}} \\
\cmidrule(lr){2-6} \cmidrule(lr){7-9} \cmidrule(lr){10-13}
 & \textit{epochs} & \textit{base lr} & \textit{batch size} & \textit{lr drop} & \textit{lr steps} & $\lambda_{giou}$ & $\lambda_{bbox}$ & $\lambda_{focal}$ & $\lambda_{giou}$ & $\lambda_{bbox}$ & $\lambda_{focal}$ & $\alpha_{focal}$  \\
\midrule
\textbf{PanNuke} & 100 & 2e-4 & 2 & 0.1 & 70, 90 & 2 & 2 & 5 & 2 & 1 & 5  & 0.25 \\
\textbf{CoNSeP}  & 1000 & 2e-4 & 2 & 0.1 & 700, 900 & 2 & 2 & 5 & 2 & 1 & 5  & 0.25 \\
\bottomrule
\end{tabular}
\end{adjustbox}
\vskip -0.1in
\end{table}

%\section{Background on DETR and nuclei detection}
%\subsection{DETR matching and loss function}
\section{Evaluation metrics}
\label{sec:appdx_metrics}
The evaluation protocol for nuclei detection and classification follows the methodology outlined in \cite{graham2019hover}, employing F1-score as the evaluation metric for enhanced comparability. Initially, a bi-partite matching process aligns ground truth nuclei centroids with detected counterparts, limited to a radius of 12 pixels. Detection metrics, including true positives ($TP_{det}$), false positives ($FP_{det}$), and false negatives ($FN_{det}$), are derived based on the outcomes of the matching process between ground truth and predicted nuclei. The detection F1-score ($F_{det}$) is computed as the harmonic mean of detection precision ($P_{det}$) and recall ($R_{det}$).

For classification, $TP_{det}$ is further categorized into correctly and incorrectly classified nuclei of class \textit{c}, denoted as $TP_c$ and $FP_c$, respectively. Additionally, misclassified elements from class $c$ are captured as $FN_c$. Precision, recall, and F1-Score for each class are then calculated as follows:
\begin{equation}
    F_c = \frac{2(TP_c +TN_c)}{2(TP_c +TN_c)+2FP_c +2FN_c +FP_{det} +FN_{det}}
\end{equation}

\begin{equation}
    P_c = \frac{TP_c +TN_c}{TP_c +TN_c +2FP_c +FP_{det}}
\end{equation}

\begin{equation}
    R_c = \frac{TP_c +TN_c}{TP_c +TN_c +2FN_c +FN_{det}}
\end{equation}


\section{Ablations }
\label{sec:appdx_ablations}

\subsection{Backbone feature levels and resolution}
\label{sec:appdx_ablations_backbone}
Deformable DETR enables multi-scale input features extracted from the backbone to boost the capabilities of the model. By default, Deformable DETR extracts three feature levels from the backbone and adds another virtual level on top of them with convolutional layer of kernel size 3 and a stride of 2. The first level is extracted from the second block of the backbone with a resolution of $1/8$, and the remaining two levels are at $1/16$ and $1/32$. 

The histopathology images employed in this work are scanned at a $\times 40$ magnification, with a resolution of $0.245 \mu m / px$. Given small and possibly enlarged shape of cell nuclei, the width or height of some instances can be of no more than 10px. Consequently, if the first feature level is extracted at $1/8$, these small objects could be occluded in the backbone output representations. Table \ref{tab:backbone} shows the performance according on the first split of the PanNuke dataset for different configurations of the backbone and the output features. Generally, the Swin transformer backbone performs better than the ResNet50, accentuating the relevance transformer architectures for the medical image analysis. It can also be observed an increase on the F-Score, with larger margins in those nuclei types that are smaller, such as inflammatory cells and necrosis.

\begin{table}[ht]
\centering
\caption{Performance with distinct backbones.}
\label{tab:backbone}
\vskip 0.1in
\begin{adjustbox}{width=0.9\textwidth}
\begin{tabular}{lc|c|ccccc}
\toprule
\textbf{Backbone} & \textbf{Output levels} & \textbf{Detection} & \textbf{Neoplastic} & \textbf{Epithelial} & \textbf{Inflammatory} & \textbf{Connective} & \textbf{Necrosis} \\
\midrule
\textbf{ResNet50} & $1/8, 1/16, 1/32$ & 0.81 & 0.69 & 0.68 & 0.57 & 0.52 & 0.35 \\
\textbf{ResNet50} & $1/4, 1/8, 1/16, 1/32$ & 0.81 & 0.69 & 0.69 & 0.59 & 0.52 & 0.36 \\
\textbf{Swin} & $1/8, 1/16, 1/32$ & 0.82 & 0.72 & 0.72 & 0.59 & 0.55 & 0.42  \\
\textbf{Swin} & $1/4, 1/8, 1/16, 1/32$ & 0.82 & 0.73 & 0.74 & 0.61 & 0.55 & 0.45 \\
\bottomrule
\end{tabular}
\end{adjustbox}
\vskip 0.1in
\end{table}

%\subsection{Data augmentation}

\subsection{Loss function}
\label{sec:appdx_ablations_loss}
Object detection loss for DETR involves predicting the bounding box centroid $(c_x, c_y)$ as well as the size $(w, h)$. Nonetheless, as mentioned in \cite{Huang_2023_ICCV}, for cell detection it is enough to predict the centroid of the cells. Indeed, the evaluation metrics only take into account the centroid, not the bounding box. In this section we explore the influence of including the boxes size information in the target. Results of Table \ref{tab:boxsize} show a slight classification performance decline when the generalized intersection over union loss as well as the $(w, h)$ values of the L1 regression loss are excluded from the overall loss computation. Although it is information that may be ignored during inference, the supervision signals generated by their prediction could be providing valuable feedback to the network that may help in disambiguating the predictions for the multiple queries. Additionally, the generalized intersection over union is included as the L1 loss is highly influenced by the scale of the object \cite{carion2020end}. If removing this term, the model learns to focus on the bigger nuclei to minimize the loss function.

\begin{table}[ht]
\centering
\caption{Performance with and without bounding box size in the loss function.}
\label{tab:boxsize}
\vskip 0.1in
\begin{adjustbox}{width=0.9\textwidth}
\begin{tabular}{l|c|ccccc}
\toprule
\textbf{Target} & \textbf{Detection} & \textbf{Neoplastic} & \textbf{Epithelial} & \textbf{Inflammatory} & \textbf{Connective} & \textbf{Necrosis} \\
\midrule
$(c_x, c_y)$ & 0.82 & 0.71 & 0.73 & 0.59 & 0.55 & 0.42 \\
$(c_x, c_y, w, h)$ & 0.82 & 0.73 & 0.74 & 0.61 & 0.55 & 0.45 \\
\bottomrule
\end{tabular}
\end{adjustbox}
\vskip 0.1in
\end{table}


\section{Where is the model attending to?}
\label{sec:appdx_attn}
%The deformable attention mechanism only focuses on a subset of points for a given location, allowing for a detailed examination of the attended points within the predicted bounding boxes. Figure \ref{fig:attn} illustrates the attention locations for the query corresponding to the predicted bounding box. Specifically, we present the detection and attention maps in both low and high cell density regions (Figure \ref{fig:attn_sparse} and Figure \ref{fig:attn_dense}, respectively). The sampling locations are normalized to be within the proposed bounding box. In the central images of the subfigures, the sampling locations are uniformly distributed along the bounding box. However, as depicted in the right images, where different colors represent different attention heads, each head focuses on a specific direction within the bounding box. This phenomenon may be attributed to the ellipsoidal shape of cell nuclei and their frequent orientation.

The deformable attention mechanism only focuses on a subset of points for a given location, allowing for a detailed examination of the attended points within the predicted bounding boxes. In Figure \ref{fig:attn} we present the detections and attention maps in both low and high cell density regions (Figure \ref{fig:attn_sparse} and Figure \ref{fig:attn_dense}, respectively). Concretely, we show the detected nuclei and bounding box (left), the attended sampling locations for each detection colored by the corresponding attention weight (middle) and finally the sampling locations colored by the head (right). It can be observed that the sampling locations are uniformly distributed along the bounding box. However, as depicted in the right images each head focuses on a specific direction within the bounding box. This phenomenon may be attributed to the ellipsoidal shape of cell nuclei and their frequent orientation.

\begin{figure}
    \vskip -0.2in
  \centering
   \subfigure[Detections on a sparse region.]{
    \includegraphics[width=0.45\linewidth]{figures/attn_sparse.pdf}
    \label{fig:attn_sparse}
  }
  \subfigure[Detections on a dense region.]{
    \includegraphics[width=0.45\linewidth]{figures/attn_dense.pdf}
    \label{fig:attn_dense}
  }
  \caption{Deformable attention maps.}
  \label{fig:attn}
  %\vskip -0.1in
  \small{
    Deformable attention maps show that different heads have learned to look at distinct directions.
  }
  \vskip -0.2in
\end{figure}

\end{document}
