\documentclass{midl}

\usepackage{mwe}
\usepackage{pifont}

\jmlryear{2024}
\jmlrworkshop{Full Paper -- MIDL 2024}
\jmlrvolume{-- nnn}
\editors{Accepted for publication at MIDL 2024}

\title[IHCScoreGAN: A framework for unsupervised, end-to-end Ki67 scoring]{IHCScoreGAN: An unsupervised generative adversarial network for end-to-end Ki67 scoring for clinical breast cancer diagnosis}

\midlauthor{
	\Name{Carl P. Molnar\nametag{$^{1}$}} \Email{molnar.carl@mayo.edu}\\
	\Name{Thomas E. Tavolara\nametag{$^{1}$}} \Email{tavolara.thomas@mayo.edu} \\
 	\Name{Christopher A. Garcia\nametag{$^{1}$}} \Email{garcia.christopher@mayo.edu} \\
 	\Name{David S. McClintock\nametag{$^{1}$}} \Email{mcclintock.david@mayo.edu} \\
 	\Name{Mark D. Zarella\nametag{$^{1}$}} \Email{zarella.mark@mayo.edu} \\
 	\Name{Wenchao Han\nametag{$^{1}$}} \Email{han.wenchao@mayo.edu} \\
	\addr $^{1}$ Division of Computational Pathology and AI, Mayo Clinic, Rochester, USA
      }

\usepackage{amsmath}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\usepackage{amsfonts}
\usepackage{booktabs}
\usepackage{siunitx}
\usepackage{multirow}
\usepackage{caption,setspace}
\usepackage{enumitem}

\begin{document}
\setlength{\abovecaptionskip}{0pt}
\maketitle

\begin{abstract}

Ki67 is a biomarker whose activity is routinely measured and scored by pathologists through  immunohistochemistry (IHC) staining, which informs clinicians of patient prognosis and guides treatment. Currently, most clinical laboratories rely on a tedious, inconsistent manual scoring process to quantify the percentage of Ki67-positive cells. While many works have shown promise for Ki67 quantification using computational approaches, the current state-of-the-art methods have limited real-world feasibility: they either require large datasets of meticulous cell-level ground truth labels to train, or they provide pre-trained weights that may not generalize well to in-house data. To overcome these challenges, we propose IHCScoreGAN, the first unsupervised deep learning framework for end-to-end Ki67 scoring without the need for any ground truth labels. IHCScoreGAN only requires IHC image samples and unpaired synthetic data, yet it learns to generate colored cell segmentation masks while simultaneously predicting cell center point and biomarker expressions for Ki67 scoring, made possible through our novel dual-branch generator structure. We validated our framework on a large cohort of 2,136 clinically signed-out cases, yielding an accuracy of 0.97 and an F1-score of 0.95 and demonstrating substantially better performance than a pre-trained state-of-the-art supervised model. By removing ground truth requirements, our unsupervised technique constitutes an important step towards easily-trained Ki67 scoring solutions which can train on out-of-domain data in an unsupervised manner. Our code and model weights are available at \url{https://github.com/WenchaoHan0718/IHCScoreGAN}.
\end{abstract}

\begin{keywords}
Generative Adversarial Networks, Unsupervised Learning, Computational Pathology, Ki67 Scoring, Breast Cancer
\end{keywords}

\section{Introduction}
Ki67 is a protein biomarker whose percentage of Ki67-positive tumor cells has shown to be effective for indicating tumor proliferation in breast cancer cases \cite{davey2021ki}. In a clinical setting, the Ki67 score is reported routinely for breast cancer IHC tissue samples to aid in diagnosis and guide treatment decisions \cite{faneyte2003breast, chang2000apoptosis, petit2004comparative}. Currently, in most centers, Ki67 scoring is manually performed by pathologists, which is time-consuming and subject to inter-observer variability \cite{reisenbichler2020prospective}. Therefore, automatic Ki67 scoring for breast tissue samples are extremely desirable. 

Unfortunately, automatic stain scoring is challenging due to the prohibitive data demands for training or fine-tuning supervised deep learning-based IHC scoring models. These models require pixel-level annotations of hundreds of thousands of cells, painstakingly labeled by highly trained technicians \cite{9098652, fassler2020deep, graham2019hover, wen2023deep, van2018segmentation, priego2022deep, zhang2020generative}. DeepLIIF \cite{ghahremani2022deep} is a supervised framework which instead trains using multiple co-registered multiplex immunofluorescence (mpIF) images as its ground truth to eliminate cell annotation errors. However, mpIF assays are expensive and not widely available. On the other hand, pre-trained supervised models are rarely publicly available and not always applicable to in-house stain data: tissue stains are collected from a particular area of the body with distinct tissue features, using a unique arrangement of best practices, tissue scanners, chemical mixtures, and tissue qualities, resulting in a unique in-house data distribution \cite{wagner2024built, therrien2018role}.  

Recently, several studies have emerged in an attempt to completely eliminate the need for ground truth labels. To this end, unsupervised deep learning methods typically leverage CycleGAN \cite{zhu2017unpaired}, a well-known framework for unpaired and unsupervised domain transfer. One such example is automatic stain transfer, where the goal is to transfer IHC stain images into hematoxylin-eosin (H\&E) stain images in an unsupervised manner \cite{liu2021unpaired, trullo2022image, lin2023unpaired}. However, performing a stain transfer from IHC to H\&E loses the biomarker expression information associated with the IHC stain, making IHC scoring impossible. Unsupervised nuclear semantic segmentation \cite{le2022unsupervised} and instance segmentation \cite{wang2023unsupervised} were recently shown to be possible by leveraging binary cell segmentation masks created from H\&E images; however, both of these approaches lack the ability to retain vital IHC biomarker expression information. In the field, there remains an unmet need for an end-to-end IHC scoring method which can avoid the need for ground truth labeling.

In this paper, we propose IHCScoreGAN, the first unsupervised deep learning framework to provide end-to-end scoring for IHC sample images. We leverage a domain transfer strategy to formulate a novel learning task, along with a novel model architecture which facilitates this task with a dual-branch generator. Our learning task seeks to transfer Ki67 images into synthetic cell segmentation masks extracted from an unpaired public dataset of H\&E tissue images, instilling cell center points and synthetic cell colors. Significantly, our model automatically learns the correspondence between the Ki67-positive and -negative cells and the synthetic cell colors while also generating center point predictions, which together provides a predicted Ki67 score. We validated our framework on a large cohort of clinical cases sourced from Mayo Clinic, demonstrating strong performance on 2,126 clinically signed-out breast cancer cases collected over the span of 11 years, and further validated on an external dataset for cell counting, yielding competitive performance against supervised methods. Our proposed contributions are as follows: 
\vspace{-0.5em}
\begin{itemize}[style=unboxed,itemsep=-0.5mm]
    \item We propose a novel domain transfer strategy for unsupervised, unpaired end-to-end IHC scoring, where the learning task is constructed from synthetic colored cell segmentation masks and center points easily extracted from public H\&E images. 
    \item We propose a novel model architecture to accommodate our learning task, which can generate colored cell segmentation masks as a proxy task while simultaneously generating supplementary information for IHC scoring during inference.
\end{itemize}

\section{Methodology}

Our proposed framework, IHCScoreGAN, is trained by leveraging a novel learning task constructed from unpaired, publicly-sourced H\&E data. It does not require any preprocessing or postprocessing of IHC stain images (Figure  \ref{fig:overview}a). To facilitate our learning task, we build a novel model architecture which achieves end-to-end Ki67 scoring by splitting our learning task into two parts: 1) its primary goal is to generate a colored cell segmentation mask, which serves as a proxy task for 2) predicting cell center points and biomarker expressions (Figure \ref{fig:overview}b). From this information we can extract a Ki67 score, which is aggregated per slide and evaluated against clinically-derived scores (Figure \ref{fig:overview}c). 

\subsection{Dataset}

\subsubsection{Internal IHC Dataset}
\label{ref:internal_dataset}

Our internal dataset consists of 2,126 Ki67 digitized slides, each corresponding to a distinct breast cancer case. All cases in this study are sourced from Mayo Clinic and collected from 2012 through 2023, primarily by 6 pathologists. The Ki67 slides were scanned at $20\times$ (0.5{\textmu}m/pixel) magnification by Aperio\textsuperscript{\textregistered} AT scanners. Clinical diagnoses were performed on selected invasive tumor regions. Typically, one pathologist is involved in the clinical diagnosis for each case. We divided each Ki67 tissue slide into tiles of size $256\times 256$ pixels within the selected regions, resulting in $N_\textsc{Ihc}=678,134$ total tiles. 
We formally define our internal IHC dataset as $S_\textsc{Ihc}=\{x_i\}_{i=1}^{N_\textsc{Ihc}}$, where $x\in \mathbb{R}^{256\times 256\times 3}$ is a Ki67 stain tile.

\begin{figure}[t]
\floatconts
  {fig:overview}
  {\caption{Overview of our proposed framework. a) Ki67 images, which have associated diagnostic data, are tiled (top). Unpaired H\&E slides are passed through a model which predicts cell centers and contours; from this, we produce synthetic masks (bottom). b) The flow of training our model, which generates a prediction of cell center points and biomarker expressions (Fake K) for end-to-end Ki67 scoring. c) Scoring information is extracted from generated Fake K, aggregated  per slide, and compared with our diagnostic data.}}
  {\includegraphics[width=1.0\linewidth]{images/Figure2.PNG}}
\end{figure}

\subsubsection{Target Dataset}

We generated our target dataset using unpaired H\&E-stained tissue slides sourced from The Cancer Genome Atlas (TCGA) database \cite{weinstein2013cancer}, a public clinical data repository. We randomly sampled H\&E tissue slides of $20\times$ magnification from the Breast Invasive Carcinoma project (TCGA-BRCA) and manually selected slides to represent a variety of cell sizes, structures, and shapes (i.e., to discourage the model from using these features to discriminate real vs. fake data), resulting in 23 slides. We then divided each slide into tiles of size $256\times 256$ and sampled $N_\textsc{Mask}=10,000$ total tiles. Next, we passed the H\&E tiles through a publicly available HoVerNet \cite{graham2019hover} model, pre-trained on the CPM-17 dataset \cite{vu2019methods} to predict cell instance contours. 

We generated synthetic colored cell segmentation masks for each H\&E tile by assembling the predicted cell instance contours into a cell segmentation mask, drawing a ratio from a uniform distribution $\mathcal{U}_{[0, 1]}$, and then randomly picking cells to color green based on the ratio and coloring the rest red. We also draw a value from a uniform distribution $\mathcal{U}_{[0, 1]}$ for each cell and multiply the cell’s pixel intensities by this value, in order to emulate different IHC biomarker expression intensities (see ``Real Y" in Figure \ref{ref:network_design}). We then generated distinct binary masks for our two synthetic cell colors, which detach cell expression prediction from the intensity values. Finally, we generated cell center point distance maps, instilling 2-norm distances between each pixel in each cell and its corresponding predicted center point. 

We formally define our external target dataset as $S_\textsc{Mask}=\{(y_i, k_i)\}_{i=1}^{N_\textsc{Mask}}$, where $y\in \mathbb{R}^{256\times 256\times 3}$ is a synthetic colored segmentation mask tile and $k\in \mathbb{R}^{256\times 256\times 3}$ is its matching cell center point distance map and two binary cell expression masks.

\subsection{IHCScoreGAN}

\subsubsection{Proxy Task}

Unsupervised nuclear cell segmentation and biomarker expression identification is a proxy task for our model, which builds off CycleGAN \cite{zhu2017unpaired}. We train our model using a domain transfer strategy to learn how to convincingly transform our Ki67 stain input tiles $x$ into synthetic colored segmentation masks $y$.

The objective of our model is to learn an unpaired domain transfer between domains $\mathcal{X}$ and $\mathcal{Y}$, where we seek to train a generator $G: \mathcal{X} \rightarrow \mathcal{Y}$ which can transform an input $x\in \mathcal{X}$ following $\hat{y}=G(x)$, where $\hat{y}\in \mathcal{\hat{Y}}$. To enforce a robust mapping onto the distribution of $\mathcal{Y}$, an inverse generator $F: \mathcal{Y} \rightarrow \mathcal{X}$ is simultaneously trained to maintain “cycle consistency”. Generator $F$ can transform an input $y\in \mathcal{Y}$ following $\hat{x}=F(y)$, where $\hat{x}\in \mathcal{\hat{X}}$. The idea is that an input $x$ should be penalized for being different from its reconstruction $\hat{x}=F(G(x))$ through a cycle consistency loss, and similarly for an input $y$ in the other mapping direction:
\begin{equation}
\label{eqn:loss_cyc_xy}
    \begin{split}
        \mathcal{L}_{\textsc{Cyc}_{X}}&=\mathbb{E}_x\left[||F(G(x))-x||_1\right] \\
        \mathcal{L}_{\textsc{Cyc}_{Y}}&=\mathbb{E}_y\left[||G(F(y))-y||_1\right]
    \end{split}
\end{equation}

Next, the generated domain $\mathcal{\hat{Y}}$ is constrained to be indistinguishable from domain $\mathcal{Y}$, which is estimated by training discriminators $D_Y$ and $D_X$, which produce logits predicting whether an input is from $\mathcal{Y}$ or $\mathcal{\hat{Y}}$, or from $\mathcal{X}$ or $\mathcal{\hat{X}}$, respectively, through an adversarial loss:
\begin{equation}
\label{eqn:loss_gan_xy}
    \begin{split}
        \mathcal{L}_{\textsc{Gan}_{X}} &=\mathbb{E}_x\left[\log(1-D_Y(G(x))\right] + \mathbb{E}_x\left[\log(D_X(x))\right] \\
        \mathcal{L}_{\textsc{Gan}_{Y}} &=\mathbb{E}_y\left[\log(1-D_X(F(y))\right] + \mathbb{E}_y\left[\log(D_Y(y))\right]
    \end{split}
\end{equation}
where generators $G$ and $F$ are updated such that $\mathcal{L}_{\textsc{Gan}_{X}}$ and $\mathcal{L}_{\textsc{Gan}_{Y}}$ are minimized, and discriminators $D_X$ and $D_Y$ are updated such that $\mathcal{L}_{\textsc{Gan}_{X}}$ and $\mathcal{L}_{\textsc{Gan}_{Y}}$ are maximized.

Finally, an input $x$ should be identical to itself after mapping onto its own distribution $\hat{x}=F(x)$, and similarly for $y$, constrained through an identity loss:
\begin{equation}
\label{eqn:loss_idt_xy}
    \begin{split}
        \mathcal{L}_{\textsc{Idt}_{X}}&=\mathbb{E}_x\left[||F(x)-x||_1\right] \\
        \mathcal{L}_{\textsc{Idt}_{Y}}&=\mathbb{E}_y\left[||G(y)-y||_1\right]
    \end{split}
\end{equation}
Our model's generators are each composed of encoder-decoder architectures similar to a four-block U-Net \cite{ronneberger2015u}. Our discriminators each resemble a contracting convolutional neural network. Network details are further elaborated in Appendix \ref{ref:network_design}.

\subsubsection{Center Point and Cell Type Generation}
\label{ref:center_point_and_cell_type_generation}

We previously defined our target dataset as $S_\textsc{Mask}=\{(y_i, k_i)\}_{i=1}^{N_\textsc{Mask}}$, where $y$ is a synthetic colored segmentation mask tile and $k$ is its matching cell center point distance map and binary cell expression masks. In this section, we aim to predict $k\in \mathcal{K}$ from a given Ki67 input tile $x$, which contains the critical information for achieving simple end-to-end quantification.

To generate $\hat{k}$, we add a second decoder branch $\textsc{Dec}_k$ to our generator $G$ so that it generates outputs $\{(\hat{y}, \hat{k})\}=G(x)$ through its main and secondary branches simultaneously. For clarity, we hereafter denote the output of $G$ through $\textsc{Dec}_k$ as $\hat{k}=G_K(x)$, where $\hat{k}\in \mathcal{\hat{K}}$. Our generator thus resembles $G: \mathcal{X} \rightarrow \{(\mathcal{Y}, \mathcal{K})\}$. Our second decoder branch $\textsc{Dec}_k$ follows the same structure and input as its main decoder branch $\textsc{Dec}_y$, but we detach the input embeddings from the gradient graph in this branch, since the proxy task already embeds all necessary information for $\textsc{Dec}_k$ in its encoder embeddings (i.e., cell expression and shape).

To train $\textsc{Dec}_k$, we add three learning objectives to the model. First, we introduce an additional cycle consistency loss between real $k\in \mathcal{K}$ and reconstructed $\hat{k}\in \mathcal{\hat{K}}$:
\begin{equation}
\label{eqn:loss_cyc_k}
    \begin{split}
        \mathcal{L}_{\textsc{Cyc}_K}&=\mathbb{E}_{y, k}\left[||G_K(F(y))-k||_1\right]
    \end{split}
\end{equation}

To encourage generation of realistic cell center points and expressions, we add a new discriminator $D_K$ which produces logits predicting whether an input is from domain $\mathcal{K}$ or $\mathcal{\hat{K}}$, along with a new adversarial loss: 
\begin{equation}
\label{eqn:loss_gan_k}
    \begin{split}
        \mathcal{L}_{\textsc{Gan}_K} &= \mathbb{E}_x\left[\log(1-D_K(G_K(x)\right] + \mathbb{E}_k\left[\log(D_K(k))\right]
    \end{split}
\end{equation}
where generator $G_K$ is updated such that $\mathcal{L}_{\textsc{Gan}_K}$ is minimized, and discriminator $D_K$ is updated such that $\mathcal{L}_{\textsc{Gan}_K}$ is maximized. Discriminator $D_K$ follows the same internal structure as discriminators $D_X$ and $D_Y$.

Finally, we tie our identity constraint to the encoding of $y$ by evaluating $\hat{k}=G_K(y)$ and comparing it with its matching $k$:
\begin{equation}
\label{eqn:loss_idt_k}
    \begin{split}
        \mathcal{L}_{\textsc{Idt}_K}&=\mathbb{E}_{y, k}\left[||G_K(y)-k||_1\right]
    \end{split}
\end{equation}

The model architecture and loss constraints introduced in this section encourage generation of convincing $\hat{k}$ through $\hat{k}=G_K(x)$, which is a simple, single-stage process.

\subsection{Objective Function}

Our model's overall objective function combines Equations 1 through 6:
\begin{equation}
    \begin{split}
        \mathcal{L}_{model}=
        \lambda\mathcal{L}_{\textsc{Cyc}_{X}}+\beta\mathcal{L}_{\textsc{Gan}_{X}}&+\gamma\mathcal{L}_{\textsc{Idt}_{X}}\\
        +\lambda\mathcal{L}_{\textsc{Cyc}_{Y}}+\beta\mathcal{L}_{\textsc{Gan}_{Y}}&+\gamma\mathcal{L}_{\textsc{Idt}_{Y}}\\
        +\lambda\mathcal{L}_{\textsc{Cyc}_K}+\beta\mathcal{L}_{\textsc{Gan}_K}&+\gamma\mathcal{L}_{\textsc{Idt}_K}
    \end{split}
\end{equation}
where $\lambda$, $\beta$, and $\gamma$ are scaling terms. We use $\lambda=10$, $\beta=1$, and $\gamma=10$ in this paper, following CycleGAN \cite{zhu2017unpaired}. $\mathcal{L}_{model}$ is simultaneously minimized with respect to generators $G$ and $F$ and maximized with respect to discriminators $D_X$, $D_Y$, and $D_K$.

\subsection{End-To-End Scoring}

During inference, we extract predicted cell center points and cell expressions from the generated $\hat{k}=G_K(x)$ using a local maxima algorithm (see Appendix \ref{ref:e2e_quant}). We then aggregate the cell counts at the slide level and calculate the slide-level scoring following $\frac{\text{Ki67}^+}{{\text{Ki67}^+}+{\text{Ki67}^-}}$. 

\section{Results}

\subsection{Experiment Design}

\begin{description}[style=unboxed,leftmargin=0cm]

\item[Model Comparison] We compared our method against a state-of-the-art supervised model, DeepLIIF \cite{ghahremani2022deep}, and a supervised U-Net \cite{ronneberger2015u}, both trained on the public \textbf{Breast Tumor Cell Dataset (BCData)} \cite{huang2020bcdata} using identical settings as their original paper. 
Dataset details are provided in Section \ref{ref:head2head}. 

\item[Error Metrics] Mayo Clinic pathologists used a 20\% cut-off for diagnosing Ki67-high vs. Ki67-low cases based on the ASCO guideline \cite{asco2022guideline} in the clinical diagnosis reports. Therefore, in our internal data experiments, we used the same cut-off value to classify a case from each method's case-level Ki67 score prediction. We then compared results by evaluating precision, recall, accuracy, and F1 score. For our experiments on the public BCData dataset, we instead evaluated mean absolute error (MAE) of cell counts across images, used in the BCData paper. In the tables, arrows represent the direction of optimal performance for each metric. Error metric details are provided in Appendix \ref{ref:error_metrics}.


\end{description}

% \item[Random Split]
\subsection{Internal Dataset Experiments}
\label{ref:two_split}

We performed a two-split experiment to reflect the performance for the model in a general use-case, where we randomly drew 1,532 cases in the train split and 594 cases in the test split, with tiles corresponding to the cases aggregated into their split. 
% Our selection of case counts per split correspond to Section \ref{ref:generalization_studies} to enable comparison. 
Our framework outperformed both supervised models on all error metrics except precision (Table \ref{tab:random}).

\begin{table}[h!]
    \centering
    \sisetup{round-mode = places, 
        round-precision = 2, 
        text-series-to-math = true,
        propagate-math-font = true
    }
    \caption{Random train/test split; comparison against clinical diagnosis}%
    \begin{tabular}{SSSSSSSS} 
        \toprule
        {Model} & {Precision$\uparrow$} & {Recall$\uparrow$} & {Accuracy$\uparrow$} & {F1$\uparrow$} \\
        \midrule\midrule
        {UNet} & \textbf{\num{1.0}} & \num{0.6712328767123288} & \num{0.9190556492411467} & \num{0.8032786885245902} \\ 
        {DeepLIIF} & \num{0.9537037037037037} & \num{0.7054794520547946} & \num{0.9190556492411467} & \num{0.8110236220472441} \\ 
        \midrule
        {IHCScoreGAN} & \num{0.9366} & \textbf{\num{0.9568}} & \textbf{\num{0.9747}} & \textbf{\num{0.9466}} \\
        \bottomrule
    \end{tabular}
    \label{tab:random}
\end{table}

% \item[Generalization Studies]
% \subsection{Clinical Simulation Experiment}
\label{ref:generalization_studies}

We also evaluated the clinical use case of our proposed framework, mimicking deployment in a real-world setting where the model was trained on older data collected from 2012 to 2018 and then evaluated on cases collected after 2018. This division resulted in 1,532 cases in the train split and 594 cases in the test split, with tiles corresponding to the cases aggregated into their split. Each model exhibited a reduction in performance in this experiment -- we observed that the after-2018 samples were more challenging due to an increase in background noise, such as darker stroma staining (Table \ref{tab:generalization} and Figure \ref{fig:qualitative}).

\begin{table}[h!]
    \centering
    \sisetup{round-mode = places, 
        round-precision = 2, 
        text-series-to-math = true,
        propagate-math-font = true
    }
    \caption{Chronological train/test split; comparison against clinical diagnosis}%
    \begin{tabular}{SSSSSSSS} 
        \toprule
        {Model} & {Precision$\uparrow$} & {Recall$\uparrow$} & {Accuracy$\uparrow$} & {F1$\uparrow$} \\
        \midrule\midrule
        {UNet} & \textbf{\num{0.9836065573770492}} & \num{0.4580152671755725} & \num{0.8787878787878788} & \num{0.625} \\ 
        {DeepLIIF} & \num{0.8961038961038961} & \num{0.5267175572519084} & \num{0.8821548821548821} & \num{0.6634615384615384} \\ 
        \midrule
        {IHCScoreGAN} & \num{0.89} & \textbf{\num{0.95}} & \textbf{\num{0.96}} & \textbf{\num{0.92}} \\
        % 0.89 0.95 0.96 0.92
        \bottomrule
    \end{tabular}
    \label{tab:generalization}
\end{table}

\begin{figure}[h!]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:qualitative}
  {\caption{Qualitative comparison. Red and blue indicates positive and negative cell prediction, respectively; magenta indicates situations where the supervised model failed.}}
  {\includegraphics[width=1\linewidth]{images/Qualitative5.PNG}}
\end{figure}
\subsection{Cutoff Analysis}

In practice, cases very close to the classification cutoff may be equivocal. Using the results in the clinical simulation experiment in Section \ref{ref:generalization_studies}, we relaxed the classification of equivocal cases by using each interval from $\pm 1\%$ to $\pm 10\%$ around the 20\% ASCO cutoff point (Figure \ref{fig:interval}) and classifying cases both predicted and labeled within the interval as correct. Each model improved as the cutoff interval was relaxed, but we note that IHCScoreGAN exhibited near-optimal performance when relaxing the cutoff interval by just $\pm 2\%$, reporting 0.93, 0.99, 0.98, and 0.96 in precision, recall, accuracy and F1 score, respectively (see Appendix \ref{ref:cutoff_details} for precise values).

\begin{figure}[h!]
\floatconts
  {fig:interval}
  {\caption{Effect of relaxing the case classification cutoff interval.}}
  {\includegraphics[width=0.90\linewidth]{images/Interval2.PNG}}
\end{figure}
\subsection{External Dataset Experiments}
\label{ref:head2head}

We validated our framework on a public breast cancer dataset, \textbf{BCData}, which features 1,338 Ki67 stain images taken at $40\times$ magnification by a Motic BA600-4 scanner, along with 181,074 manually annotated cell center points and cell types \cite{huang2020bcdata}. BCData splits 803, 133, and 402 images for training, validation, and testing, respectively. We compared cell counting of our framework against two supervised models, both of which were trained using BCData's training images and cell annotation labels. The results are in Table \ref{tab:head2head}, where `MP', `MN', and `MA' represents the MAE of positive, negative, and the average value, respectively. Our unsupervised framework did not outperform the supervised state-of-the-art DeepLIIF, yet still achieved comparable performance without needing the annotation labels. Both ours and DeepLIIF outperformed all supervised baseline models reported in \cite{huang2020bcdata}. It is possible that the training sample size limited our framework's performance, suggested by our sample size experiments in Appendix \ref{ref:sample_size}.

\begin{table}[htbp]
    \centering
    \caption{Experiments on the external BCData dataset; comparison of cell counting}%
    \sisetup{round-mode = places, round-precision = 2, 
        text-series-to-math = true,
        propagate-math-font = true}
    \begin{tabular}{cccSSS} 
        \toprule
        {Model} & {Trained On} && {\text{MP}$\downarrow$} & {\text{MN}$\downarrow$} & {\text{MA}$\downarrow$} \\
        \midrule\midrule
        {UNet} & {BCData} && {\num{7.880597014925373}} & {\num{22.893034825870647}} & {\num{15.38675}} \\
        {DeepLIIF} & {BCData} && {\textbf{\num{5.440298507462686}}} & {\textbf{\num{12.22636815920398}}} & {\textbf{\num{8.83325}}} \\
        \midrule
        {IHCScoreGAN} & {Internal} && {\num{9.912935323383085}} & {\num{17.875621890547265}} & {\num{13.89425}} \\
        {IHCScoreGAN} & {BCData} && {\num{6.4278606965174125}} & {\num{14.32089552238806}} & {\num{10.3743}} \\
        \bottomrule
    \end{tabular}
    \label{tab:head2head}
\end{table}

\vspace{-2em}

\section{Discussion}

In this work, we proposed the first unsupervised framework, IHCScoreGAN, for end-to-end
Ki67 scoring. We validated our method on 2,126 breast cancer cases and showed high
agreement with clinical diagnoses provided by our pathologists. Experimental comparison against pre-trained supervised methods on our internal dataset showed the significant advantage of our framework; pre-trained DeepLIIF often erroneously segmented stroma tissue and missed/misclassified challenging cells, likely resulting from differences from its training domain (Figure \ref{fig:qualitative}). On external data, we yielded close cell counting performance to the fully-supervised state-of-the-art DeepLIIF, and superior performance to other supervised models, without needing the training annotations. 
This work is limited by lacking comparisons with supervised models which are fine-tuned on our dataset, which is out of the scope of this work due to the complexity of fairly assessing such an experiment. We do not consider inter-observer variability in our clinical diagnoses in this work.

\acks{
We acknowledge all the support from the Division of Computational Pathology and AI, Department of Laboratory Medicine and Pathology, Mayo Clinic: Dr. Steven N. Hart, Debra A. Novak, Katelyn A. Reed, Dr. Daniel Macaulay. }

\bibliography{midl24_298}


\appendix
\section{Additional Implementation Details}

\subsection{Network Design}
\label{ref:network_design}
Our generators $G$ and $F$ follow encoder-decoder architectures similar to a four block U-Net \cite{ronneberger2015u}. The encoders consist of four contracting ``down-convolution" blocks, each consisting of a convolution layer (with a kernel size of 4, stride of 2, and padding of 1), followed by a LeakyReLU activation layer and an instance normalization layer. There are 64 convolution channels in the first encoder block, which increase by a factor of 2 in each block (up to 512 channels). The encoders have 8 additional residual blocks, each consisting of an in-place convolution (with kernel size 3), again followed by a LeakyReLU activation layer and an instance normalization layer. The decoders consist of 8 residual blocks of the same definition, followed by four expanding ``up-convolution" blocks, each consisting of a transposed convolution layer (with a kernel size of 4, stride of 2, and padding of 1), followed by a LeakyReLU activation layer and an instance normalization layer. There are 512 convolution channels in the first decoder block, which decrease by a factor of 2 in the first three blocks (down to 64 channels) and end with a convolution onto 3 channels in the last decoder block. The outputs of each down-convolution block in the encoder is given a skip-connection into the corresponding up-convolution block in the decoder (i.e., the output of the third down-convolution block is concatenated with the output of the first up-convolution block, which is the input of the second up-convolution block). Finally, we perform a $\tanh$ activation on the outputs of the last decoder block, representing the final model prediction.

As mentioned in Section \ref{ref:center_point_and_cell_type_generation}, generator $G$ additionally has a second decoder which follows identical structure to the main decoder described above, including identical skip-connections coming from the encoder, except that its gradients do not flow into the encoder during backpropagation.

Our discriminators $D_X$, $D_Y$, and $D_K$ consist of similar four-block encoders, described above, followed by a convolution layer (with a kernel size of 4, stride of 1, and padding of 1) onto a single output channel. The result is a patch-wise real/fake logit prediction.

\subsection{End-To-End Scoring Details}
\label{ref:e2e_quant}

During inference, we achieve end-to-end scoring from the generated $\hat{k}=G_K(x)$ by using a simple local maxima algorithm on the cell center point maps for instance detection, followed by an $\argmax$ operation for each detected instance across the binary type maps for cell expression prediction. We stitch together $\hat{k}$ before we run the extraction algorithm, which resolves prediction artifacts at the tile borders.

To formally define the local maxima algorithm, we further define $\{(\hat{m}, \hat{b})\}\in \hat{k}$, where $\hat{m}$ is the center point distance map and $\hat{b}$ is the group of binary type segmentation masks. The local maxima algorithm can then be defined as:
\begin{equation}
    T = \{\argmax(\hat{b}_{ij}): (\hat{m}_{ij}>\omega) \land (\hat{m}_{ij} = \max_{i-\delta\leq h \leq i+\delta, j-\delta\leq w \leq j+\delta}\hat{m}_{hw})\}
\end{equation}
where $\delta$ is a given neighborhood size, $\omega$ is a given threshold, and $(i, j)$ are indices of pixels within the center point distance map $\hat{m}$. 

This is a common, simple 2D local maxima algorithm (e.g., \cite{brieu2019domain}). We used the \verb|maximum_filter| function implemented in the \verb|scipy| package in Python and assigned a pixel as a cell instance if it is its own local maxima (within $\delta$ pixels) and if it exceeds the threshold $\omega$. In our experiments, we used a neighborhood size $\delta$ of 25 and a threshold $\omega$ of 0.5.

\subsection{Formal Definition of Error Metrics}
\label{ref:error_metrics}

For our internal dataset experiments, we compare classification against a two-class ground truth, based on counting of cell types. For this binary classification problem, we evaluate precision, recall, accuracy, and F1 score, which are commonly used binary classification metrics. Let us define correctly-classified positive cases as True Positive (TP), correctly-classified negative cases as True Negative (TN), incorrectly-classified positive cases as False Negative (FN), and incorrectly-classified negative cases as False Positive (FP). We then formally define our binary classification error metrics:

\begin{equation}
    \begin{split}
    \text{Precision} &= \frac{\text{TP}}{\text{TP}+\text{FP}} \\
    \text{Recall} &= \frac{\text{TP}}{\text{TP}+\text{FN}} \\
    \text{Accuracy} &= \frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}} \\
    \text{F1 Score} &= \frac{2*\text{TP}}{2*\text{TP}+\text{FP}+\text{FN}} \\
    \end{split}
\end{equation}

For external dataset experiments on BCData, we used Mean Absolute Error (MAE) to compare cell counting performance. MAE was used by BCData authors to evaluate cell counting on their dataset \cite{huang2020bcdata}, formally defined as:

\begin{equation}
    \begin{split}
    \text{MAE}^{category} &= \frac{\sum_{i=1}^{n}|c_i^{category} - \hat{c}_i^{category}|}{n} \\
    \end{split}
\end{equation}
where $category$ represents cell counts per biomarker expression or total cell counts on a per-image basis, and $c_i^{category}$ and $\hat{c}_i^{category}$ are the ground truth and predicted cell counts per category, respectively.

\subsection{Training Hyperparameters}

In each experiment, we trained our model using 40,000 training iterations, a learning rate of 0.0002, a batch size of 4, and an Adam optimizer. Each value was selected through a multi-run grid search in order to optimize the model’s quantitative performance on a separate internal test dataset which is distinct from the data used in the experiments in this work.

\subsection{Hardware Details}

We trained our model and performed all experimental validations in this paper using an NVIDIA\textsuperscript{\textregistered} RTX\textsuperscript{\texttrademark} A4000 GPU with 16GB memory. Our CPU was a 12-core, 3.2 Ghz Intel\textsuperscript{\textregistered} Xeon\textsuperscript{\textregistered} w5-3435X.

\section{Supplementary Experimental Results}
\subsection{Cutoff Analysis Details}
\label{ref:cutoff_details}

This table corresponds to the exact values of the points in the plot in Figure \ref{fig:interval}.

\begin{table}[htbp]
    \label{tab:full_dataset}
    \centering
    \caption{Effect of relaxing 20\% Ki67-positive cutoff threshold for case stratification}%
    \sisetup{round-mode = places, round-precision = 2}
    \begin{tabular}{ccSSSScSSSS} 
        \toprule
        \multirow{2}{*}{Cutoff} && \multicolumn{4}{c}{DeepLIIF} && \multicolumn{4}{c}{IHCScoreGAN} \\
        \cmidrule{3-6} \cmidrule{8-11}
        && {Precision} & {Recall} & {Accuracy} & {F1} && {Precision} & {Recall} & {Accuracy} & {F1} \\
        \midrule\midrule
        {$20\%$} && \num{0.388} & \num{0.855} & \num{0.670} & \num{0.533} && \num{0.892} & \num{0.947} & \num{0.963} & \num{0.919}\\
        {$20\pm1\%$}  && \num{0.388} & \num{0.855} & \num{0.670} & \num{0.533} && \num{0.901} & \num{0.962} & \num{0.968} & \num{0.931}\\
        {$20\pm2\%$}  && \num{0.398} & \num{0.880} & \num{0.675} & \num{0.548} && \num{0.926} & \num{0.986} & \num{0.978} & \num{0.955}\\
        {$20\pm3\%$}  && \num{0.411} & \num{0.904} & \num{0.682} & \num{0.566} && \num{0.929} & \num{0.986} & \num{0.978} & \num{0.957}\\
        {$20\pm4\%$}  && \num{0.433} & \num{0.930} & \num{0.692} & \num{0.591} && \num{0.939} & \num{0.994} & \num{0.982} & \num{0.966}\\
        {$20\pm5\%$}  && \num{0.462} & \num{0.942} & \num{0.700} & \num{0.620} && \num{0.948} & \num{0.994} & \num{0.983} & \num{0.971}\\
        {$20\pm6\%$}  && \num{0.480} & \num{0.945} & \num{0.704} & \num{0.636} && \num{0.957} & \num{0.994} & \num{0.985} & \num{0.975}\\
        {$20\pm7\%$}  && \num{0.510} & \num{0.972} & \num{0.714} & \num{0.669} && \num{0.960} & \num{0.995} & \num{0.985} & \num{0.977}\\
        {$20\pm8\%$}  && \num{0.532} & \num{0.984} & \num{0.722} & \num{0.690} && \num{0.967} & \num{0.995} & \num{0.987} & \num{0.981}\\
        {$20\pm9\%$}  && \num{0.560} & \num{0.995} & \num{0.732} & \num{0.717} && \num{0.969} & \num{0.996} & \num{0.987} & \num{0.982}\\
        {$20\pm10\%$} && \num{0.599} & \num{0.996} & \num{0.746} & \num{0.748} && \num{0.980} & \num{0.996} & \num{0.990} & \num{0.988}\\
        \bottomrule
    \end{tabular}
\end{table}

% \item[Ablation Studies]
\subsection{Ablation Studies}

We evaluated the effectiveness of 1. including random color intensities when creating the synthetic color masks, and 2. detaching gradients from the inputs of the $\textsc{Dec}_K$ branch. We otherwise trained and evaluated our model in the same way as Section \ref{ref:generalization_studies}. We show overall better performance by including these elements of our proposed approach (Table \ref{tab:ablation}).

\begin{table}[htbp]
    \centering
    \sisetup{round-mode = places, 
        round-precision = 2, 
        text-series-to-math = true,
        propagate-math-font = true
    }
    \caption{Model ablation studies; comparison against clinical diagnosis}%
    \begin{tabular}{lSSSS} 
        \toprule
        {Model} & {Precision} & {Recall} & {Accuracy} & {F1} \\
        \midrule\midrule
        {IHCScoreGAN, no color variation} & \textbf{\num{0.9818}} & \num{0.8244} & \num{0.9579} & \num{0.8963} \\
        {IHCScoreGAN, no detach} & \num{0.9810} & \num{0.7710} & \num{0.9461} & \num{0.8632} \\
        {IHCScoreGAN} & \num{0.8921} & \textbf{\num{0.9466}} & \textbf{\num{0.9630}} & \textbf{\num{0.9185}} \\
        \bottomrule
    \end{tabular}
    \label{tab:ablation}
\end{table}

\subsection{Sample Size Experiments}
\label{ref:sample_size}

We evaluated the data dependency of IHCScoreGAN by gradually increasing the number of randomly-selected samples seen by the model during training. We used the same train and test splits as in Section \ref{ref:generalization_studies}. To evaluate model stability, we trained our framework five separate times with the same training samples and then calculated the mean and standard deviation of each error metric. Our results are in Table \ref{tab:sample_size}, where $N$ represents the number of unique samples seen by the model during training.

\begin{table}[htbp]
    \centering
    \caption{IHCScoreGAN sample size experiments; comparison against clinical diagnosis}%
    \sisetup{round-mode = places, round-precision = 2}
    \begin{tabular}{ScScScScS} 
        \toprule
        {$N$} && {Precision} && {Recall} && {Accuracy} && {F1} \\
        \midrule\midrule
        {100} &&{\num{0.8084640798064209}$\pm$\num{0.17809812511608722}} && {\num{0.9251181388585967}$\pm$\num{0.0994355037817925}} && {\num{0.8132303426421074}$\pm$\num{0.1698040514456183}} && {\num{0.7445892841381222}$\pm$\num{0.164599708205284}} \\
        {250} &&{\num{0.847932245372544}$\pm$\num{0.16887567155393735}} && {\num{0.9142130134496547}$\pm$\num{0.08829934806266039}} && {\num{0.8682192293303406}$\pm$\num{0.13878631675422193}} && {\num{0.815417713704595}$\pm$\num{0.12972527834749867}} \\
        {500} &&{\num{0.9355340838059041}$\pm$\num{0.04324916802126467}} && {\num{0.9567430025445293}$\pm$\num{0.058579462764485175}} && {\num{0.8932178932178932}$\pm$\num{0.07478049367809014}} && {\num{0.8618735272439432}$\pm$\num{0.0576104616826226}} \\
        {800} && {\num{0.9447956496963259}$\pm$\num{0.06058034626388691}} && {\num{0.9089422028353326}$\pm$\num{0.08941704111421751}} && {\num{0.9402356902356903}$\pm$\num{0.029900525486092285}} && {\num{0.87009725903167}$\pm$\num{0.044627276835862525}} \\
        {1000} && {\num{0.9114289113247094}$\pm$\num{0.05446620313808181}} && {\num{0.943002544529262}$\pm$\num{0.07520106447660407}} && {\num{0.9032751760024488}$\pm$\num{0.06307001875537573}} && {\num{0.8873390802264581}$\pm$\num{0.030495162392353645}} \\
        {2000} && {\num{0.9328839853987615}$\pm$\num{0.04472989149667449}} && {\num{0.9451769604441362}$\pm$\num{0.06810739290151448}} && {\num{0.926557239057239}$\pm$\num{0.06322068589731805}} && {\num{0.89528317091843}$\pm$\num{0.04374295865107108}} \\
        {4000} && {\num{0.894589724886403}$\pm$\num{0.06639818635231225}} && {\num{0.9219025249559601}$\pm$\num{0.07716251457886178}} && {\num{0.9347175458286571}$\pm$\num{0.045157596339712384}} && {\num{0.8875390071650611}$\pm$\num{0.020995686182201528}} \\
        {8000} && {\num{0.9715795717493858}$\pm$\num{0.012747559234607731}} && {\num{0.9431721798134013}$\pm$\num{0.06254690098248215}} && {\num{0.9476243920688366}$\pm$\num{0.010836324996603237}} && {\num{0.8772634211428447}$\pm$\num{0.02723724047113123}} \\
        {16000} && {\num{0.955119846410489}$\pm$\num{0.051176725773578645}} && {\num{0.9541984732824428}$\pm$\num{0.06087756347290652}} && {\num{0.9263468013468015}$\pm$\num{0.034330812638528614}} && {\num{0.8831235349927588}$\pm$\num{0.030886007065784777}} \\
        {32000} && {\num{0.9562053816846742}$\pm$\num{0.04651851553465304}} && {\num{0.958969465648855}$\pm$\num{0.058377921862375684}} && {\num{0.9436026936026934}$\pm$\num{0.025229132449678962}} && {\num{0.9043361161005555}$\pm$\num{0.0092092548098679}} \\
        {40000} && {\num{0.9475469657715558}$\pm$\num{0.05557846805002715}} && {\num{0.9516539440203561}$\pm$\num{0.061333184698188034}} && {\num{0.9404040404040405}$\pm$\num{0.029709869923846152}} && {\num{0.8901450494620755}$\pm$\num{0.03071321238057966}} \\
        \bottomrule
    \end{tabular}
    \label{tab:sample_size}
\end{table}

\subsection{Cross Validation Studies}

\sisetup{round-mode = places, 
    round-precision = 2, 
    text-series-to-math = true,
    propagate-math-font = true
} We evaluated a 5-fold cross validation of our model for case-level classification, where case-level folds are chosen randomly across our entire internal dataset. Specifically, our training splits consisted of approximately 1701 cases, and our testing splits consisted of approximately 425 cases, with the corresponding tiles aggregated into the respective split in each K-fold experiment. We additionally evaluated DeepLIIF and U-Net on each testing split, for reference. The averaged results across all five splits is shown in Table \ref{tab:cross_validation}.

\begin{table}[htbp]
    \centering
    \sisetup{round-mode = places, 
        round-precision = 2, 
        text-series-to-math = true,
        propagate-math-font = true
    }
    \caption{Cross-validation studies; comparison against clinical diagnosis}%
    \begin{tabular}{cSSSS} 
        \toprule
        {Model} & {Precision} & {Recall} & {Accuracy} & {F1} \\\midrule\midrule
        {UNet} & \textbf{\num{0.9932027649769586}$\pm$\num{0.008342795848033038}} & {\num{0.6253934570468725}$\pm$\num{0.03019333836073179}} & {\num{0.9115636564484948}$\pm$\num{0.00988038074175724}} & {\num{0.7671259515900561}$\pm$\num{0.023610771717636734}} \\
        {DeepLIIF} & {\num{0.9559490813462889}$\pm$\num{0.023461691836143155}} & {\num{0.6726033434650456}$\pm$\num{0.03948162547440419}} & {\num{0.9162695388014359}$\pm$\num{0.006475824186659931}} & {\num{0.788446068897315}$\pm$\num{0.021775208636155355}} \\
        {IHCScoreGAN} & {\num{0.9355741447323938}$\pm$\num{0.08521560552463697}} & \textbf{\num{0.9047564389697649}$\pm$\num{0.06439350754345188}} & \textbf{\num{0.9612137531068765}$\pm$\num{0.017594904377147128}} & \textbf{\num{0.9146023806088008}$\pm$\num{0.03424031670019001}} \\
        \bottomrule
    \end{tabular}
    \label{tab:cross_validation}
\end{table}

\end{document}