\documentclass{midl} % Include author names

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

\usepackage{mwe} % to get dummy images
\usepackage{amsmath,graphicx}
\usepackage{multirow}
\usepackage{booktabs}
\usepackage{graphicx}
\usepackage{subcaption}
\usepackage{subcaption}
\usepackage{algorithm} 
\usepackage{adjustbox}
\usepackage{algpseudocode}\usepackage{xcolor}


\jmlryear{2025}
\jmlrworkshop{Full Paper -- MIDL 2025 submission}
\jmlrvolume{-- nnn}
\editors{Accepted for publication at MIDL 2025}

\title[Unsupervised Cellular Anomaly Detection]{Unsupervised Cellular Anomaly Detection in Toxicological Histopathology}
 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{\Name{Saketh Juturu\midljointauthortext{Contributed equally}}\Email{saketh.juturu@airamatrix.com}\\
\Name{Geetank Raipuria\midljointauthortext{Contributed equally}}\Email{geetank.raipuria@airamatrix.com}\\
\Name{Raghav Amaravadi} \Email{raghav.amaravadi@airamatrix.com }\\
\Name{Aman Srivastava} \Email{aman.shrivastava@airamatrix.com }\\
\Name{Malini Roy} \Email{malini.roy@airamatrix.com }\\
\Name{Nitin Singhal} \Email{nitin.singhal@airamatrix.com }\\
AIRA Matrix, Mumbai, India
}

\begin{document}

\maketitle

\begin{abstract}

Irregularities in cellular representation play a crucial role in assessing drug-induced tissue alterations in toxicological histopathology studies. However, the process of annotating rare abnormal cellular variations for training supervised deep learning models presents significant challenges and lacks scalability. While anomaly detection is well-suited for this purpose, it has not yet been explored for cellular-level analysis. In this study, we evaluate cellular anomaly detection using datasets derived from the kidney and liver tissue of Wistar rats. Our findings show that a KNN-distance-based anomaly detection method significantly benefits from employing a feature extractor that has been pre-trained on extensive unsupervised histopathology datasets. When utilizing the best-performing feature extractor, the KNN-distance method surpasses state-of-the-art anomaly detection models by over 4.84\% (AUC), including the denoising diffusion probabilistic model, in detecting cellular anomalies. Additionally, we assess the effectiveness of this method in identifying variations in anomalous cell counts between control and treated animal tissues within a toxicological study, revealing a statistically significant difference between the two dosage groups.

% We explore the use of K-Nearest-Neighbor(KNN) distance for AD and exploit foundation model trained on large-scale histopathology data.
\end{abstract}

\begin{keywords}
Anomaly Detection, Out-of-distribution Detection, Toxicology, Histopathology, Foundation Models, Cellular Analysis, Drug Safety Assessment.
\end{keywords}

\section{Introduction}

%Toxicological histopathology plays a critical role in non-clinical drug safety assessment, by determining the level of toxicity caused by a test drug in various tissues. It encompasses the study of whole slide images (WSI) derived from laboratory animals exposed to a test drug and aims to discover microscopic tissue alterations as a sign of toxicity. The variations in tissue from drug-treated animals are observed against tissue from a control group to identify drug induced abnormal characteristics \cite{greaves2011histopathology}. 
%Deep Learning based model have been explored extensively of assist pathologists in assessing tissue variations in form of cellular distribution, arrangement, and morphology \cite{mehrvar2021deep}. 
Toxicological histopathology is essential for non-clinical drug safety evaluations, as it assesses the extent of toxicity induced by a test drug across tissues. It involves analyzing whole slide images (WSI) obtained from laboratory animals that have been exposed to the test drug, with the goal of identifying microscopic tissue changes indicative of toxicity. By comparing the tissue variations in drug-treated animals to those in a control group, researchers can pinpoint abnormal characteristics caused by the drug \cite{greaves2011histopathology}.

Detecting deviations from normal cell representation is a crucial aspect of a pathologist's routine. For instance, the presence of single cell necrosis in liver tissue and neutrophils in kidney tissue are key indicators. Figure \ref{fig:subfigures}, provides samples of normal and abnormal cells in liver and kidney tissue. These abnormalities occur in a very small fraction of the tissue and require analysis at high magnifications at which microscopic details of the cell are clearly visible.  Consequently, the manual examination of tissue sections for cellular irregularities is labor-intensive and prone to interobserver variability.


Deep learning approaches for cell detection and classification have been widely investigated \cite{graham2019hover, baumann2024hover, horst2024cellvit} to aid pathologists in identifying cellular abnormalities. However, generating a large-scale labeled dataset by annotating various cellular anomalies among millions of normal cells is a time-consuming task, even for experienced pathologists. Additionally, while only a few cellular abnormalities are frequently observed, many others are rare.

%Identifying deviation from normal cell representation is an essential part of pathologist's routine. For example, Single Cell Necrosis in liver tissue and Neutrophils in kidney tissue. Such abnormalities are observed in a minuscule fraction of the tissue, and needs to be analysed at a high magnification. Thus, manual inspection of tissue sections for cellular abnormalities is intensive and suffers from interobserver variability. Deep learning based cell detection and classification has been extensively explored in the field of histopathology \cite{graham2019hover, baumann2024hover, horst2024cellvit}, to assist the pathologist in identifying cellular abnormalities. However, creating a large scale labeled dataset by annotating 
%various cellular abnormalities from million of normal cells is an exceedingly lengthy process, even for expert pathologists. Furthermore, only a few cellular abnormalities are commonly observed, and a large number of abnormalities are rare.

% However,  as it requires extensive review of high resolution images at high magnification, to identify cellular variations.



Anomaly detection (AD) is a vital component of medical image analysis, aimed at identifying deviations from established normal patterns. While there is an abundance of data exhibiting normal characteristics available for training AD models, abnormal data, which encompasses a wide range of variations from the normal, is often scarce or even unknown. AD alleviates the reliance on annotated data and enables the detection of previously unseen variations. This approach is particularly well-suited for preclinical toxicological studies, where unfamiliar representations of cellular variation may arise, making it impractical to train a generalized supervised model.

%Anomaly detection (AD) is an essential task in medical image analysis, to identify deviation from the known normal patterns. A large amount of data with normal characteristics is readily available for training the AD model, while abnormal data that consists of diverse variations from the normal is rare or possibly unknown. AD reduces dependency on annotated data, and allows identifying unknown variations. AD is especially suitable for preclinical toxicological studies, where an unknown representation of cellular variation may appear for which training a generalized supervised model is infeasible.  

\subsection{Related Work}
{Numerous studies have explored anomaly detection (AD) in medical image analysis \cite{bao2024bmad, cai2024medianomaly}. Among a variety of anomaly detection methods \cite{ruff2021unifying}, reconstruction-based, distance-based, or one-class classifier methods have been widely used. Reconstruction-based methods include autoencoders \cite{baur2021autoencoders}, Generative Adversarial Networks (GANs) \cite{goodfellow2020generative} and Denoising Diffusion Probabilistic Models (DDPMs) \cite{ho2020denoising} that learn to reconstruct normal images. The reconstruction error serves as a scoring function for detecting anomalous samples. Given that the reconstruction model has only seen normal images, a high loss is observed for anomalous samples.}

%Numerous studies have explored anomaly detection (AD) in medical image analysis \cite{bao2024bmad, cai2024medianomaly}. This process involves training a computational model on a normal in-distribution (ID) dataset to identify unseen anomalies in a test dataset. AD methods can be broadly categorized into two types: reconstruction-based and projection-based. 


{Auto encoders and its variants like Variational AE (VAE) \cite{kingma2013auto}, Denoising AE \cite{kascenas2022denoising}, learn to reconstruct the input image from a low-dimensional latent space representation.}
GANs employ a generative adversarial approach to learn representations of normal images. For instance, F-AnoGAN \cite{schlegl2019f} uses a WGAN architecture combined with an additional encoder to map images into latent space for anomaly detection. Another study \cite{zehnder2022multiscale} incorporates multi-scale input images and perceptual loss to enhance contextual understanding.

{DDPMs partially corrupt normal tissue images by adding noise, followed by a denoising  the image for a fixed amount of timesteps to reconstruct the image based on the remaining signal.} AnoDDPM \cite{wyatt2022anoddpm} suggests using Simplex noise for effective image corruption, while \cite{bercea2023mask} enhances the robustness of diffusion models through the integration of automatic masking, stitching, and resampling techniques. Prior work \cite{cai2024medianomaly, bercea2023mask} found that AutoDDPM outperformed all other reconstruction-based methods.



%Numerous studies have examined AD for medical image analysis \cite{bao2024bmad, cai2024medianomaly}, including radiology, retinopathy and histopathology. It involves training a computational model on normal indistribution (ID) dataset to identify unseen anomalies in test dataset. The AD methods can be broadly segregated into two, Reconstruction-based and Projection-based methods. Reconstruction-based methods employ Generative Adversarial Models (GAN)\cite{goodfellow2020generative} or Denoising Diffusion Probabilistic Models(DDPM)\cite{ho2020denoising} learn to reconstruct normal images, and use the reconstruction error as the scoring function to identify anomalous samples. GANS uses generative adversarial technique to learn normal image representation. F-AnoGAN \cite{schlegl2019f} uses WGAN architecture and an additional encoder to map images to the latent space for anomaly detection. \cite{zehnder2022multiscale} uses multi-scale input images and a perceptual loss to enhance context. DDPMs apply partial diffusion to corrupt normal tissue images, followed by a denoising process to reconstruct the normal image. AnoDDPM \cite{wyatt2022anoddpm} proposed to use Simplex noise instead of Gaussian to effectively corrupt the images, \cite{bercea2023mask} further improved the robustness of diffusion models by the integration of automatic masking, stitching, and re-sampling techniques for anomaly detection. \cite{cai2024medianomaly} found AutoDDPM to outperform all other reconstruction-based methods. 



% a method for identifying data points that significantly deviate from the expected distribution (considered "out-of-distribution" or OOD) by projecting the data onto a lower dimensional space and analyzing how much the projections deviate from the "normal" pattern observed in the training data, essentially treating large deviations in the projected space as anomalies

% \textcolor{red}{Projection-based methods project data into a learned feature embedding space, to create a better separation between normal and anomalous samples. Samples deviating from normal training data are identified as anomalous. The feature extractor for these methods is trained on a class-labeled dataset comprising normal in-distribution class(es) \cite{wang2022vim, salehi2021multiresolution, li2023rethinking}, often through a proxy task such as tissue type classification \cite{zingman2024learning, dippel2024ai}, or it may be fine-tuned on the in-distribution dataset \cite{reiss2021panda}. One-class classifier \cite{zingman2024learning, ruff2018deep, yi2020patch, scholkopf2001estimating}, representation discrepancy in teacher-student pair \cite{salehi2021multiresolution, yamada2021reconstruction}, K-Nearest Neighbor (KNN) distance \cite{li2023rethinking, reiss2021panda, sun2022out} or a combination of information from logits and feature embeddings \cite{wang2022vim} can be used to obtain anomaly scores.}

{One-class classifier based methods \cite{zingman2024learning, ruff2018deep, yi2020patch, scholkopf2001estimating} try to find a hyper-sphere enclosing normal data to identify anomalous samples. Non-parametric K-nearest-neighbor(KNN) distance based approach search within memory bank of normal features to obtain the distance based anomaly score \cite{sun2022out}. Recently,  representation discrepancy in teacher-student pair \cite{salehi2021multiresolution, yamada2021reconstruction}, and a combination of information from logits and feature embeddings \cite{wang2022vim} have also been explored. The above methods project data into a learned feature embedding space, to create a better separation between normal and anomalous samples. The feature extractor for these methods is trained on a class-labeled dataset comprising normal in-distribution class(es) \cite{wang2022vim, salehi2021multiresolution, li2023rethinking}, often through a proxy task such as tissue type classification \cite{zingman2024learning, dippel2024ai}, or it may be fine-tuned on the in-distribution dataset \cite{reiss2021panda}.}


% Projection-based methods project data into a learned feature embedding space, to create a better separation between normal and anomalous samples. Samples deviating from normal training data are identified as anomalous. The feature extractor for these methods is trained on a class-labeled dataset comprising normal in-distribution class(es) \cite{wang2022vim, salehi2021multiresolution, li2023rethinking}, often through a proxy task such as tissue type classification \cite{zingman2024learning, dippel2024ai}, or it may be fine-tuned on the in-distribution dataset \cite{reiss2021panda}. One-class classifier \cite{zingman2024learning, ruff2018deep, yi2020patch, scholkopf2001estimating}, representation discrepancy in teacher-student pair \cite{salehi2021multiresolution, yamada2021reconstruction}, K-Nearest Neighbor (KNN) distance \cite{li2023rethinking, reiss2021panda, sun2022out} or a combination of information from logits and feature embeddings \cite{wang2022vim} can be used to obtain anomaly scores.

% has often been used to distinguish anomalous samples 

% , , or 

% Numerous 
% scoring function has been explored, including 

% Classifier probabilities \cite{zingman2024learning, dippel2024ai}, K-Nearest Neighbor (KNN) distance \cite{li2023rethinking, reiss2021panda, sun2022out}, or a combination of information from logits and feature embeddings \cite{wang2022vim} are used as the scoring function. Studies \cite{cai2024medianomaly, linmans2024diffusion} have shown that DDPMs outperform all projection-based models in anomaly detection for histopathology.

%Projection-based methods exploit the feature embedding space representation of normal and anomalous data, to delineate the two classes. Feature extractor for projection-based method is trained on a supervised dataset of normal indistribution classes \cite{wang2022vim, salehi2021multiresolution}, a proxy task like tissue type classification\cite{zingman2024learning, dippel2024ai} or fine-tuned on in-distribution dataset\cite{reiss2021panda}, followed by using classifier probabilities \cite{zingman2024learning, dippel2024ai} or K-Nearest Neighbor(KNN)-distance \cite{reiss2021panda,sun2022out} or a combination of information from logits and feature embedding \cite{wang2022vim} as scoring function. \cite{cai2024medianomaly, linmans2024diffusion} found DDPMs to outperform all projection-based models on AD for histopathology.

 % fine-tunes a pre-trained feature extractor on indistirbution data with center loss, and use KNN-distance as anomaly score. Other methods train feature extractor on auxiliary task like tissue classification  or use other tissue as proxy to anomlolous samples \cite{dippel2024ai},

% trains a feature extractor using an axillary loss followed by a one-class SVM classifier.  use tissue samples from  to train a classifier.





% \cite{linmans2023predictive} explored AnoDDPM
% \cite{bercea2023mask,linmans2024diffusion, cai2024medianomaly}. Similarly, GANS are used 

% for denoising corrupted images .


% Numerous studies have examined AD for medical image analysis \cite{bao2024bmad, cai2024medianomaly}, including radiology, retinopathy and histopathology. It involves training a computational model on normal indistribution (ID) dataset to identify unseen anomalies in test dataset. The AD methods can be broadly segregated into two, Reconstruction-based methods \cite{schlegl2019f, akcay2019ganomaly, graham2023denoising, bercea2023mask, wyatt2022anoddpm}  \cite{wang2022vim, sun2022out, reiss2021panda, li2021cutpaste, salehi2021multiresolution, ruff2018deep}. 
% Reconstruction-based methods employ generative models that learn to generate normal images, and use a reconstruction error as the scoring function to identify anomalous samples.

% Reconstruction-based methods  Pr

% \begin{figure}
% 	\centering
% 	\begin{subfigure}{\linewidth}
% 		\includegraphics[scale=0.5]{images/isbi-paper_Liver.png}
% 		\caption{Cells in Wistar Rat Liver Tissue}
% 		\label{fig:subfigA}
% 	\end{subfigure}
% 	\begin{subfigure}{\linewidth}
% 		\includegraphics[scale=0.5]{images/isbi-paper_Kidney.png}
% 		\caption{Cells in Wistar Rat Kidney Tissue}
% 		\label{fig:subfigB}
% 	\end{subfigure}
% 	\caption{Example cells in Liver and Kidney tissue. First Row: Normal cells, Second Row: Abnormal cells }
% 	\label{fig:cells}
% \end{figure}

% \begin{figure}[t!]
% 	\centering
% 	\begin{subfigure}[t]{0.3\linewidth}
%             \centering
% 		\includegraphics[width=\textwidth]{images/isbi-paper_Liver.png}
% 		\caption{Cells in Wistar Rat Liver Tissue}	    
% 	\end{subfigure}}
%         \vfill
% 	\begin{subfigure}[t]{0.3\textwidth}
%             \centering
% 		\includegraphics[width=\linewidth]{images/isbi-paper_Kidney.png}
% 		\caption{Cells in Wistar Rat Kidney Tissue}
% 	\end{subfigure}
% 	\caption{Showing three cars in different colors horizontally.}
% 	\label{fig:subfigures}
% \end{figure}

\begin{figure}[t!]
        \centering
        \includegraphics[scale=0.14]{images/isbi-paper_1.drawio.png}
	\caption{Examples of cells found in the liver and kidney tissues of Wistar rats. Anomalous cells include single cell necrosis, mitosis, and microgranuloma in liver; neutrophils and medullary nephrocalcinosis in kidney. {Each patch represents a tissue area of ~17x17 micrometers at 40x magnification.}}
	\label{fig:subfigures}
\end{figure}

% Previous works have explored anomaly detection for histopathology \cite{zehnder2022multiscale, zingman2024learning, dippel2024ai, stepec2021unsupervised, pocevivciute2021unsupervised, linmans2024diffusion, cai2024medianomaly} to detect anomalous tissue using both generative and projection based approaches. Generative Adversarial Network (GAN) are trained to reconstruct healthy tissue and obtain higher reconstruction loss on unseen anomalous tissue \cite{zehnder2022multiscale, stepec2021unsupervised, pocevivciute2021unsupervised}. The reconstruction error between raw image and GAN output used as anomaly score. Similarly,  Lastly, denoising Auto-encoder have also been found to perform at par with DDPMs \cite{linmans2024diffusion}. 
% Projection-based methods for anomaly detection in Histhology involve training a feature extractor on auxiliary task to generate a compact representation of normal data distribution, followed by using classification probability as anomaly scores\cite{zingman2024learning, dippel2024ai}. 

\subsection{Motivation}
Existing approaches often validate their performance using anomalous samples that exhibit significant semantic differences from normal in-distribution data. For instance, tissue necrosis in liver tissue \cite{zingman2024learning} or tumors among benign tissue regions \cite{cai2024medianomaly, bao2024bmad, linmans2024diffusion, zingman2024learning}. Such far-out-of-distribution (Far-OOD) samples \cite{winkens2020contrastive, linmans2023predictive} are generally easier to differentiate from normal data.
In contrast, as illustrated in Figure \ref{fig:subfigures}, the anomalous cells we observe are classified as near-out-of-distribution (Near-OOD). These cells share semantic similarities with normal cells and only exhibit subtle differences. Our experiments indicate that this similarity results in limited performance for models benchmarked for Far-OOD detection. Additionally, many state-of-the-art {distance}-based methods rely on classifiers trained on {labeled class datasets} \cite{wang2022vim, salehi2021multiresolution, dippel2024ai}, which poses a challenge when such datasets are unavailable for pre-training. We aim to leverage advancements in foundation models that have been trained on large-scale unsupervised data, which have demonstrated the ability to outperform models trained with supervised data \cite{caron2021emerging, kang2023benchmarking, wolflein2023benchmarking}. This potential has largely been overlooked in previous research on anomaly detection.

%However, the existing approaches validate the performance on anomolous samples that have significant semantic dissimilarity from the normal indistribution data. For example, tissue necrosis in liver tissue \cite{zingman2024learning}or tumor alongst benign tissue\cite{cai2024medianomaly, bao2024bmad, linmans2024diffusion, zingman2024learning}. Such Far-OOD (Out-of-Distribution) samples \cite{winkens2020contrastive, linmans2023predictive}  are easier to distinguish from normal data. As seen in figure \ref{fig:subfigures}, the anomalous cells are Near-OOD, that is, cells exhibit semantic similarities to normal cells and only present subtle differences. We observe that this leads to the limited performance of the model benchmarked for Far-OOD detection, based on our experimentation.  Also, many of the state-of-the-art projection-based methods require a classifier trained on a labeled dataset \cite{wang2022vim,salehi2021multiresolution,dippel2024ai}. This is particularly limiting when no such dataset is available for pre-training. Lastly, we would like to exploit improvements in foundation models trained on large-scale unsupervised data, that have shown to even outperform models trained with supervised data \cite{caron2021emerging, kang2023benchmarking, wolflein2023benchmarking}. This has been largely unexplored by previous works on anomaly detection. 

% \cite{bao2024bmad, cai2024medianomaly, lagogiannis2023unsupervised, linmans2024diffusion}.
 
 % using multiscale data as input to a GAN, where as



% We found that SOTA methods lack the ability to effectively segregate different types of anomalous cells from normal cell distribution in preclinical toxicological studies on Liver and Kidney Tissue. 



% This is due to a multitude of reasons, as we observe in this work:

% \begin{enumerate}
%     \item 
%     \item Pre-trained transformers have shown to outperform ResNets, especially for Near-OOD detection, due to robustness to input distribution shifts, and natural adversarial examples and exhibit less texture bias \cite{fort2021exploring}. However, SOTA methods use features from ResNets for feature extraction.
%     \item Most existing Projection-based methods use feature extractor pre-trained on ImageNet weights that do not generalize well to histopathology or limited data for an auxiliary histopathology task that requires additional dataset curation.
    
% \end{enumerate}

% Goal of this work is to present a modern AD method which overcomes the shortcoming of the existing methods and provides a strong baseline for future development of Cellular Anomaly Detection. 

% For this, we  




% and can be further propelled by foundation models. 

% Recent developments in foundation models have revolutionized histopathology, and 

% Multiple foundation models have 
% Recent imprvement in 
% leverage Foundatio
% n


% We find that KN based on foundation model outperform existingmethods

We introduce a cutting-edge AD method that significantly surpasses existing techniques and establishes a robust baseline for future advancements. Our proposed method calculates the anomaly score based on the distance of a test sample to its K-Nearest Neighbors within the in-distribution feature embedding space, which consists of normal samples.
In contrast to previous studies that assessed KNN-distance-based anomaly detection \cite{reiss2021panda, sun2022out, linmans2024diffusion}, our approach leverages foundation models trained on extensive histopathology datasets to effectively differentiate between ID and OOD samples in the feature embedding space, leading to a notable enhancement in model performance. The key contributions of this work are summarized below:

%We present a modern anomaly detection method, that significantly outperforms existing methods and provides a strong baseline for future development. The proposed anomaly detection method uses the distance of a test sample to K-Nearest Neighbors in the in-distribution feature embedding space consisting of normal samples, as the anomaly score. Unlike prior work that evaluated KNN-distance based anomaly detection \cite{reiss2021panda, sun2022out, linmans2024diffusion}, our method exploits foundation models trained on large-scale histopathology data to segregate in-distribution and out-of-distriution samples in the feature embedding space, which significantly boost the model performance. The main contributions of this work are summarized below: 
\begin{enumerate}
    \item  To the best of our knowledge, we are the first to assess a deep learning model for cellular analysis in toxicological histopathology data using unsupervised anomaly detection techniques.
    % \item We train a transformer model using a self-supervised training regime, on 750M+ cell crops from 8 sources and two mice tissue types (Liver and Kidney).

    \item We evaluate state-of-the-art foundation models trained on large-scale histopathology datasets for KNN-distance-based unsupervised cellular anomaly detection.
    
    \item We demonstrate that our KNN-distance-based anomaly detection method, when paired with an effective feature extractor, outperforms state-of-the-art anomaly detection models, including diffusion models, in the context of cellular anomaly detection.
    
    % \item We create two cellular anomaly datasets to evaluate cellular anomaly benchmark, one for each liver and Kidney; to show that Cell-UAD  , beats state-of-the-art methods for cellular anomaly detection.
    
    \item Finally, in our evaluation of toxicological studies, we demonstrate that the unsupervised method is capable of identifying a higher proportion of cellular abnormalities in drug treated tissues compared to control tissues. 

\end{enumerate}}

\begin{figure}[]
        \centering
        \includegraphics[scale=0.12]{images/flow_chart_30_01.drawio.png}
	\caption{KNN-distance based anomaly detection approach for cellular anomaly detection. A feature extractor trained on large-scale unsupervised histopathology data is employed to obtain feature embeddings from in-distribution data derived from control whole slide images (WSI). The anomaly score for a test patch is determined by calculating the distance to its K-nearest neighbors within the in-distribution feature space. }
	\label{fig:flowchart}
\end{figure}

In the following sections, we outline the KNN-distance-based anomaly detection method, followed by a description of the experimental setup and the corresponding results to identify the optimal foundation model for our approach. We will also compare its performance with that of state-of-the-art generative models. Please note that throughout this paper, the terms "anomaly detection" and "out-of-distribution (OOD) detection" are used interchangeably.

%In the following sections we describe the KNN-distance based anomaly detection method, followed by the experimental setup and corresponding results to evaluate the best foundation model for our method, and compare the performance with state-of-the-art generative models. Note, through the paper, the terms anomaly detection and out-of-distribution (OOD) detection are interchangeability used.
% Implementation details for various models are provided in the appendix.




\section{Method}
We illustrate our approach via Figure \ref{fig:flowchart} and Algorithm 1, which can be classified as a {distance}-based method. This method utilizes feature embeddings ($Z_{ind}$) extracted from healthy (training) tissue samples ($D_{in}$) using the feature extractor $ft$, thereby creating a feature space ($D^R$). The anomaly score for a test sample ($x_{test}$) is determined by its proximity to the in-distribution data within this feature space. A cell that closely resembles a healthy cell and is located in a high-density region of the in-distribution feature space will receive a low anomaly score, while a cell that differs from the in-distribution and is found in a low-density region will be assigned a high score. The distance to K-Nearest Neighbors serves as the scoring function. Specifically, we calculate the average distances to K-Nearest Neighbors between the embedding of each test sample and the in-distribution dataset.

%We illustrate the approach in Figure \ref{fig:flowchart}, which can be categorized as a projection-based method. The method leverages feature embeddings ($Z_{ind}$) extracted for healthy (training) tissue samples ($D_{in}$), using the feature extractor $ft$, to form a feature space ($D^R$). A test sample ($x_{test}$) is scored based on its proximity to the in-distribution data in this feature space. A cell similar to a healthy cell, present in a high density region in indistribution feature space, would get a low anomaly score, whereas a cell distinct from the in-distribution that is present in a low density region would obtain a high score. Distance to K-Nearest Neighbors is used as the scoring function. Specifically, we compute the average of K-Nearest Neighbor distances between the embedding of each test sample and the in-distribution dataset. Algorithm \ref{algo} summarises the approach.

 
\begin{algorithm}[]
	\caption{Anomaly Detection Algorithm} 
	\begin{algorithmic}[1]
        \textbf{Input}: Normal (training) dataset $D_{in}$, pre-trained feature extractor $ft$, test samples $x_{test}$.
        For all x $\in$ $D_{in}$, obtain feature vector representations $Z_{ind}$. 
        \vfill
        \textbf{Testing}: Given a test sample $x_{test}$, obtain the feature vector $Z_{test}$ and the k-Nearest Neighbors from $D_{in}$.
        \vfill
        \textbf{Output}: Anomaly Score based on KNN-distance
	\end{algorithmic} 
    \label{algo}
\end{algorithm}

We use feature extractor pre-trained on large-scale histopathology dataset using self-supervised learning. These foundation models have been shown to surpass the performance of feature extractors trained on supervised dataset, when evaluating for KNN-distance based patch classification, nuclei instance segmentation and image retrieval \cite{caron2021emerging, kang2023benchmarking}, thus making them effective for our approach. In section \ref{compare-foundation}, we compare various state-of-the-art foundation models for cellular anomaly detection.


Our proposed method offer two major benefits over existing methods.
\begin{enumerate}
    % \item \textbf{The method uses feature embeddings, without needing class probabilities}. Prior work \cite{wang2022vim, zingman2024learning, dippel2024ai} train a classifier 
    
    % Training a classifier requires supervised data for an auxiliary task which would be cumbersome, especially in cellular anomaly detection. Using feature extractors from a pre-trained model allows skipping the need for a supervised dataset for an auxiliary task.

    \item \textbf{No training required}. The method does not require training on normal in-distribution data. Feature extraction is performed with frozen weights followed by a nearest neighbor search to assign anomaly scores. This significantly reduces the resource requirement for model development. Also, additional in-distribution data can be added at no cost, without model re-training.

        
    \item \textbf{The model's performance can be enhanced by improvement in foundation models.} The method exploits feature embedding to identify test samples in low-density regions of the in-distribution dataset, thus, the method's performance can be enhanced with better features that can differentiate normal and anomalous samples. This allows us to benefit from foundation models that are trained on large-scale and diverse unsupervised datasets. 

    
    % \item \textbf{Better explainability}. The method uses distance to nearest neighbour(s) in the feature embedding space, these can be exploited to visualize the nearest neighbor for better explainability.   
    
\end{enumerate}




% \textbf{Notation}: We first define the notation for AD task. We assume to have a in-distribution dataset ${Data_{Ind}}$ which consists of normal cell samples and are in abundance. The goal of anomaly detection is to identify if an anomalous cell that is out-of-distribution from normal the ${D_{Ind}}$. To evaluate the algortihhm, we futher create a balanced test set consisting of both normal and anomolous cells - ${D_{test}}$. \\ \\
% \textbf{Method}: Given the ${Data_{Ind}}$, to evaluate AD performance on ${D_{test}}$ we follow the below method
% \begin{enumerate}
%     \item Given ${D_{train}}$ \& ${D_{test}}$ in high dimensional image space $R^{(WxHxD)}$, a feature extractor $Ft$ is used to obtain a feature representation of each sample $R^{D}$.
%     \item For $ s \in {D_{test}}$, calculate the distance to $K$ nearest neighbour in ${D_{train}}$. This serves as the AD score for $s$.  
% \end{enumerate} 
% The method scores a test sample based  $R^D$,  Thus, 

% 

% We find that K value of 10 gave the best AUC scores, evaluating between K=5 and k=25. Lastly, we compare the method, using the best performing $Ft$, with state-of-the-art Anomaly Detection methods.


% \begin{table}[]
% \centering
% \begin{tabular}{llcc}
% \hline
% \multicolumn{1}{c}{} &                & Liver      & Kidney               \\ \hline
% Train                & InDistribution & 1.8M       & 2.2M                          \\
% Test                 & InDistribution & 22993      & 5179              \\
% Test                 & Anomolous      & 23883      & 5070                          \\ \hline
% \end{tabular}
% \caption{The dataset used for performance evaluation}
% \label{table:data}
% \end{table}

% \begin{table}[]
% \centering
% \begin{tabular}{lcccc}
% \hline
% \multicolumn{1}{c}{} &                & Liver & Kidney & \# WSI              \\ \hline
% Train                & InDistribution & 1.8M  & 2.2M   & 4                   \\ \hline
% Test                 & InDistribution & 11496 & 10358  & \multirow{2}{*}{10} \\
% Test                 & Anomolous      & 11941 & 10140  &                     \\ \hline
% \end{tabular}
% \caption{The dataset used for performance evaluation}
% \label{table:data}
% \end{table}






% For training and testing, two sets of cellular annotations are generated by an expert pathologist. 


% are created for a few High Dosage animal tissue to evaluate the effectiveness of different anomaly detection methods.



% The study data is split into two parts 1) ${Data_{InD}}$ to create indistribution for normal data from Control group WSI 2) ${Data_{Test}}$, including Normal and Anomalous cells, from Control and High Dose group, based on pathologists annotations. Table \ref{table:data} lists the number of patches for available for each sub group. Eight WSI are used for the creation of the dataset, stratified between Train and Test to prevent data leakage.
% along with pathologists assessment. 


% Two levels of pathologist feedback is generated, cellular level annotations and WSi level grading. 


 


% Specifically, relevant for this work, 



% In this section, we explore different design options including feature extractor architecture and  pre-text task, followed by comparison with state-of-the-art Projection and Reconstruction based methods. Table \ref{table:ft_ext} and \ref{table:SOTA} summarize the results. 




\section{Experiments}
We aim to establish the best method for detecting cellular anomalies by comparing KNN-distance based anomaly detection with state-of-the-art methods. In this section, we first describe the dataset used for evaluation, followed by a comparison of various state-of-the-art foundation models for extracting features in KNN-distance based method, and finally compare the KNN methods with state-of-the-art anomaly detection methods for unsupervised cellular anomaly detection. {AUC is used as the performance metric for the evaluation, as used by prior work \cite{sun2022out, reiss2021panda, cai2024medianomaly, bao2024bmad, fort2021exploring, graham2023denoising}.
AUC is not a function of a specific threshold on the anomaly score and provides a robust measure of model performance.} 

\subsection{Dataset}
The dataset used to evaluate cellular anomaly includes a toxicological histopathology study, consisting of WSI from control and high dosaged Wistar Rat, for Liver and Kidney. Specifically, study consists of 14 tissue samples from Control group and 10 samples from drug treated group, for both kidney and liver tissue.


For model evaluation, we created a training and testing cell patch dataset. Specifically, training data is created by annotating cells in fields-of-view with all normal cells, from control WSI. This allows us to create a large pool of in-distribution data which would have near zero abnormal cells. {Test set is created by annotating cells on field-of-views from WSI of dosed animal group.} Anomalous cell annotations include; Liver: Single Cell Necrosis, Mitosis, Extramedullary Hematopoiesis \& Microgranuloma; Kidney : Neutrophils \& Medullary Nephro-
calcinosis). Table \ref{table:data} gives an overview of the dataset. {An equal amount of normal cells are obtained from 
dosed animal group.} For each annotated cell, a crop of size 64x64px is extracted at 40x magnification, aligning the cell in the center. 


\begin{table}[!b]
\centering
\begin{adjustbox}{width=1\textwidth}
\small
\begin{tabular}{cccllcccc}
\cline{1-3} \cline{5-9}
                     & \textbf{\# Liver WSI}       & \textbf{\# Kidney WSI}      &  & \multicolumn{1}{c}{} &                & Liver & Kidney & \# WSI              \\ \cline{1-3} \cline{5-9} 
Control              & 14                   & 14                   &  & Train                & In-Distribution & 1.8M  & 2.2M   & 4                   \\ \cline{5-9} 
Treated              & 10                   & 10                   &  & Test                 & Normal & 11496 & 10358  & {10} \\ \cline{1-3}
\multicolumn{1}{l}{} & \multicolumn{1}{l}{} & \multicolumn{1}{l}{} &  & Test                 & Anomalous      & 11941 & 10140  & {10}                     \\ \cline{5-9} 
\end{tabular}
\end{adjustbox}
\caption{The dataset used for performance evaluation. Left: The number of WSI in the toxicological study for each organ. Right: Number of patches and WSI used to create training and testing dataset.}
\label{table:data}
\end{table}


\subsection{Evaluating the best feature extractor}
\label{compare-foundation}
Since performance of KNN-distance based anomaly detection method significantly depends on the feature extractor's ability to segregate normal and anomalous samples in the feature space, we first evaluate with different foundation models \cite{chen2022scaling, kang2023benchmarking, wang2022transformer, filiot2023scaling, chen2024towards, zimmermann2024virchow, nechaev2024hibou, lu2024visual}, that were trained using varying self-supervised learning techniques and diverse datasets \footnote{Note, none of the above foundation models have been trained on cell patch data, rather on patches sized 224x224 extracted at different magnifications.}. Table \ref{table:ft_ext} provides results. 
% For the best performing feature extractors, we compare KNN using Euclidean vs Mahanalobis distance, and found the two distance function to give similar results.

{As expected, when comparing performance using ResNet\cite{he2016deep} \& ViT\cite{dosovitskiy2020image} foundation models pre-trained on ImageNet and histopathology datasets, a significant gain in performance is observed when using the domain-specific dataset for pre-training.} Next, we compare ResNet\cite{he2016deep} and ViT\cite{dosovitskiy2020image} architecture, using the weights optimized on the same pre-training histopathology dataset \cite{kang2023benchmarking}. Vision Transforms outperforms ResNet, demonstrating that transformers learn better features from large-scale pre-training dataset when compared to the ResNet, as shown in previous works \cite{caron2021emerging, kang2023benchmarking}.


We observe that the KNN-distance based anomaly detection model's performance improves when using foundation models trained on data that includes samples at 40x magnification, which corresponds to the magnification at which cell patch dataset is extracted for cellular anomaly detection. The model performance shows correlation with an increase in model size and the amount of the pre-training data used, large model size and larger pre-training dataset improve the performance. Virchow2 \cite{zimmermann2024virchow} is found to be the best performing feature extractor for anomalous cell patch detection, that uses the largest amount of data for pre-training, extracting tiles at multiple magnifications.  




% For Liver cell dataset best scores are obtained when using features from Lunit\cite{kang2023benchmarking}, where as in case of Kidney dataset UNI\cite{chen2024towards} obtained the best performance. Amongst different Self-Supervised Learning methods, it is observed that DINO\cite{caron2021emerging} outperforms. 

\begin{table*}[]
\centering
\begin{adjustbox}{width=1\textwidth}
\small
\begin{tabular}{cccc|ccc}
\hline
\textbf{Method}                             & \textbf{Model}      &  \textbf{\#WSI}    & \textbf{Magnification} & \textbf{Liver}     & \textbf{Kidney} & \textbf{Mean}\\ \hline
DINO\cite{caron2021emerging}          & ResNet-50                 & NA           & NA    & 85.06              & 53.02	 & 69.04  \\
DINO\cite{caron2021emerging}          & ViT-S                 & NA           & NA    & 82.90              & 51.58   & 67.24  \\
DINO\cite{caron2021emerging}         & ViT-B                & NA           & NA    & 85.81              & 52.11    & 68.96  \\
HIPT\cite{chen2022scaling}                 & ViT-S                 &  11K         & 20x  & 91.95              & 67.66    & 79.80\\
Lunit\cite{kang2023benchmarking}         & ResNet-50             &  21K         & 20x,40x  & 85.54              & 62.59    &  74.06\\
%Lunit-DINO\cite{kang2023benchmarking}        & ViT-S (8x8)           &  21K           & \underline{95.36}  & \underline{88.12}      \\
Lunit\cite{kang2023benchmarking}        & ViT-S                 &  21K           &  20x,40x & \underline{95.50}  & {86.71}    & {91.10}\\
%prov-gigapath    & ViT-G                &  170K WSI         & 84.71  &          \\
CPath\cite{wang2022transformer}                             & ViT-S                 &  32K          &   20x    & 89.18            & 84.54  &  86.86\\
Phikon\cite{filiot2023scaling}           & ViT-B                 &  6.1K        &  20x   & 92.90             & 82.17  &  87.53\\
CONCH\cite{lu2024visual}                 & ViT-B                    & 1.1M*              &   20x   & 94.25    & \underline{91.71}  & \underline{92.98}   \\ 
UNI\cite{chen2024towards}                & ViT-L                   &  100K        & 20x   & {93.79}             & {88.22}  &   {91.00}\\
Virchow2\cite{zimmermann2024virchow}                      & ViT-H                    & 3.1M            &  5x,10x,20x,40x     & \textbf{96.97}    & \textbf{91.83}   & \textbf{94.40}  \\ 
% PhikonV2\cite{}                      & ViT-L                    & M                    & 89.33    & 87.52     \\ 
% Hibou-b\cite{nechaev2024hibou}                      & ViT-B                    & 1.2M          & DINOv2         & 85.38    & 73.76     \\ 
% Hibou-L\cite{}                      & ViT-L                    & M                    & 83.03    & 76.01     \\ 


\hline
\end{tabular}
\end{adjustbox}{
\caption{Performance comparison of different feature extractor pre-trained on large-scale unsupervised histopathology dataset for cellular anomaly detection using KNN-distance based method. The table reports, model architecture, WSI used, magnification of patches used for training and AUC on anomaly scores of liver and kidney {cell patch dataset}. Top two scores are highlighted in Bold and Underlined. *CONCH uses 1.1M image text pairs. }
\label{table:ft_ext}
\end{table*}


\begin{table}[]
\centering
\begin{tabular}{l|ccc}
\hline
                     & Liver & Kidney & Mean\\ \hline
f-AnoGAN\cite{schlegl2019f}                 &   93.23    & 85.90 & 89.56\\
%Cut Paste            &       & \\
AutoDDPM\cite{bercea2023mask}           &   82.03    & 75.03  & 78.53\\
% UNet-DAE           &        &  & \\
PANDAS\cite{reiss2021panda}               &   91.74	  & 72.30 &  82.02\\ 
KNN with Virchow2 (ours)         &   \textbf{96.97}    & \textbf{91.83} & \textbf{94.4}\\ \hline
\end{tabular}
\caption{Performance comparison of anomaly detection methods on liver and kidney cell {patch dataset, comparing KNN-distance based method with state-of-the-art methods.} The table report AUC on anomaly scores.}
\label{table:SOTA}
\end{table}

% \begin{table*}[]
% \centering
% \begin{adjustbox}{width=1\textwidth}
% \small
% \begin{tabular}{ccccc|ccc}
% \hline
% \textbf{Method}                             & \textbf{Model}      &  \textbf{\#WSI}   &\textbf{SSL} & \textbf{Mag.} & \textbf{Liver}     & \textbf{Kidney} & \textbf{Mean}\\ \hline
% ImageNet\cite{caron2021emerging}          & ViT-S                 & NA         &DINO   & NA    & 82.90              & 51.58   & 67.24  \\
% HIPT\cite{chen2022scaling}                 & ViT-S                 &  11K      &DINO   & 20x  & 91.95              & 67.66    & 79.80\\
% Lunit\cite{kang2023benchmarking}         & ResNet-50             &  21K       &DINO  & 20x,40x  & 85.54              & 62.59    &  74.06\\
% %Lunit-DINO\cite{kang2023benchmarking}        & ViT-S (8x8)           &  21K           & \underline{95.36}  & \underline{88.12}      \\
% Lunit\cite{kang2023benchmarking}        & ViT-S                 &  21K        &DINO   &  20x,40x & \underline{95.50}  & {86.71}    & {91.10}\\
% %prov-gigapath    & ViT-G                &  170K WSI         & 84.71  &          \\
% CPath\cite{wang2022transformer}                             & ViT-S                 &  32K      &MocoV3    &   20x    & 89.18            & 84.54  &  86.86\\
% Phikon\cite{filiot2023scaling}           & ViT-B                 &  6.1K      &iBOT   &  20x   & 92.90             & 82.17  &  87.53\\
% CONCH\cite{lu2024visual}                 & ViT-B                    & 1.1M*           & CoCa   &   20x   & 94.25    & \underline{91.71}  & \underline{92.98}   \\ 
% UNI\cite{chen2024towards}                & ViT-L                   &  100K      &DINOv2    & 20x   & {93.79}             & {88.22}  &   {91.00}\\
% Virchow2\cite{zimmermann2024virchow}                      & ViT-H                    & 3.1M           &   DINOv2 &  5x,10x,20x,40x     & \textbf{96.97}    & \textbf{91.83}   & \textbf{94.4}  \\ 
% % PhikonV2\cite{}                      & ViT-L                    & M                    & 89.33    & 87.52     \\ 
% % Hibou-b\cite{nechaev2024hibou}                      & ViT-B                    & 1.2M          & DINOv2         & 85.38    & 73.76     \\ 
% % Hibou-L\cite{}                      & ViT-L                    & M                    & 83.03    & 76.01     \\ 


% \hline
% \end{tabular}
% \end{adjustbox}{
% \caption{Performance comparison of different feature extractor pre-trained on large-scale unsupervised histopathology dataset for cellular anomaly detection using KNN based method. The table report AUC on anomaly scores of liver and kidney cellular dataset. Top two scores are highlighted in Bold and Underlined. }
% \label{table:ft_ext}
% \end{table*}

% \textbf{We also compare foundation models with feature extractor with feature from CellVit.}

%AUC based on the anomaly scores for the test dataset including normal and anomalous patches, are used as the evaluation metric. 

\subsection{Comparing with state-of-the-art methods}
Next, we compare the KNN-distance based method with the state-of-the-art models. Based on previous work \cite{cai2024medianomaly, bao2024bmad}, we identify three best performing models f-AnoGAN\cite{schlegl2019f}, AutoDDPM\cite{bercea2023mask} and PANDAS\cite{reiss2021panda}.
{F-AnoGAN trains WGAN architecture and an additional encoder using output of the WGAN. The combined anomaly score is computed using a discriminator feature residual error and an image reconstruction error. AutoDDPM,  consists of three stages ie. mask, stitch, and re-sampling. Diffusion process is used to generate an initial likelihood map of potential anomalies followed by stitching them with the original image and joint noised distribution re-sampling.
Whereas, PANDAS finetunes a pre-trained feature extractor using a compactness loss.}

 Implementation details for each method are provided in the Appendix section \ref{Appendix:Details}. Table \ref{table:SOTA} provides the AUC scores for anomaly detection on liver and kidney tissue cell {patches}. Our approach, utilizing a self-supervised pre-trained feature extractor and a KNN-distance based scoring function, outperforms other methods for both tissue types. Interestingly, the KNN method using the top four best-performing feature extractors achieves a higher AUC than all three comparison methods. Pre-training on a large-scale dataset enabled our method to achieve superior performance 

PANDAS adapts ResNet\cite{he2016deep} weights, trained on supervised ImageNet data, to the anomaly detection task using compactness and elastic weight consolidation loss. The AUC score for this method is higher than some of the feature extractors used with the KNN-distance based method, as seen in table \ref{table:ft_ext}. Specifically, scores obtained using ImageNet pre-trained weights with SSL significantly underperform compared to PANDAS. However, fine-tuning on an in-distribution dataset falls short when compared to feature extractors trained on larger and diverse histopathology datasets.

Figure \ref{fig:reconstruction}, in appendix, provides reconstruction of normal and anomalous patches for f-AnoGAN AutoDDPM. We observe that AutoDDPM is able to reconstruct anomalous images with low error, which reduces its ability to identify these images. We believe this can be attributed to subtle variation between normal and anomalous samples, allowing the models to have low reconstruction loss for both image categories. f-AnoGAN\cite{schlegl2019f} on other hand has lower quality reconstruction for anomalous samples, achieving higher scores than AutoDDPM. 
% Appendix section \ref{section:class-wise} also provides class-wise AUC for three liver classes and one kidney class.

% Prior work\cite{linmans2024diffusion,cai2024medianomaly} found diffusion models to outperform distance-based, however, the feature extractor was either trained using reconstruction loss \cite{linmans2024diffusion} on in-distribution dataset or pre-trained on Image-Net dataset\it, unlike our approach of using foundation models.

% We identify best performing models based on the prior works \cite{cai2024medianomaly, bao2024bmad, linmans2024diffusion}. In projection-based methods, PANDA\cite{reiss2021panda} was a top perform, along with 




% We compare Cell-UAD with f-AnoGAN\cite{schlegl2019f} and PANDAS\cite{reiss2021panda}, that have shown to obtain best results for medical anomaly benchmarks \cite{cai2024medianomaly,bao2024bmad}. 


% PANDAS\cite{reiss2021panda} fine-tunes Imagenet pretrained model on in-distribution data, however, 

% However, both methods fall short of the Cell-UAD using . 

% PANDAS adapts the ImageNet feature to the AD task using Compactness loss and Early Stopping. We believe that pre-training on a large-scale dataset and the use of Vision Transform enabled Cell-UAD to outperform. On the other hand,



% \cite{} KKn as shown promising results, which is validated firtyher for histopathology\usepackage{}

% Projection method, using distance to Nearest Neighbors from training data has recently shown promising results \cite{sun2022out} 

% \textbf{even though foundation models were not trained on cellular data, our approach is able to out perform the SOTA diffusion techniques}

% Finally, to evaluate the utility of cellular anomaly detection for tissue triaging 
% we compare anomalous cell count as a percentage of total cell count for Control and Treated WSI as described in section.Specifically, 10 images from Control and Treated group each are used for the tissue triaging, excluding ones used for creating training data to avoid data data leakage.
%  For this, cells are detected using pretrained Cell-ViT \cite{horst2024cellvit} model, and a crop of size 64x64 is taken for all the cells. 
 
% We also evaluate the effectiveness of the cellular anomaly detection algorithm for detecting drug-induced cellular changes in liver and kidney tissue, using the all WSI in the study except ones used for creating the training dataset. Specifically, 10 images from Control and Treated group each are used for the tissue triaging, excluding ones used for creating training data to avoid data data leakage. 

% For our KNN based method, the effectiveness of identifying anomalies depends upon the ability of the feature extractor to segregate normal and anomalous samples in low and high-density regions formed by the in-distribution samples.  











% identified 
%  as the leading generative methods, while 
% based method described in 

% Features are extracted from the final layer.



% e evaluate the above method with different feature extractors trained using self-supervised learning on diverse large-scale histopathology dataset. \textbf{TODO}





% \begin{figure}
% 	\centering
% 	\begin{subfigure}{0.45\linewidth}
% 		\includegraphics[width=\linewidth]{images/liver_study.png}
% 		\caption{Liver}
% 		\label{fig:subfigA}
% 	\end{subfigure}
% 	\begin{subfigure}{0.45\linewidth}
% 		\includegraphics[width=\linewidth]{images/kidney_study.png}
% 		\caption{Kidney}
% 		\label{fig:subfigB}
% 	\end{subfigure}
% 	\caption{Distribution of Percentage Anomalous Cells in Control and test WSI. }
% 	\label{fig:study}
% \end{figure}

% \vspace{-3mm}

\begin{figure}[]
        \centering
        \includegraphics[scale=0.075]{images/MIDL_paper.drawio.png}
	\caption{Evaluation of the KNN-distance based unsupervised cellular anomaly detection on a toxicology study, to detect changes in anomalous cell count, on administration of a test drug. The figure shows box plots of percentage of anomalous cells in control and drug-treated tissue. {The variation in anomalous cell count obtained a p value of 1.5e-5 \& 0.0147 for liver and kidney respectively. Thus, a larger proportion of cells in treated group have higher anomaly score than control group.} }
	\label{fig:box_plots}
\end{figure}


\section{Evaluation on Toxicological Study}

% We also evaluate the effectiveness of the cellular anomaly detection algorithm for detecting drug-induced changes in cellular distribution in liver and kidney tissue, using all WSI in the study except ones used for creating the training dataset to avoid data data leakage. Specifically, 10 images from control and treated group each are used. For all the WSI, cells are detected using pretrained Cell-ViT \cite{horst2024cellvit} model and a crop of size 64x64 is taken centered at the cell. For these WSI, we also obtain pathologist assessment for the treated tissue. 


{We evaluate the capability of the approach using KNN-distance based unsupervised anomaly method, along with best performing feature extractor - Virchow2\cite{zimmermann2024virchow}, to detect pathologically relevant changes in the tissue due to the administered drug. Annotating all anomalous cells across 20 WSIs is not feasible due to the high cell count. Thus instead of an AUC score, we analyze cellular distribution for the toxicological study, comparing the number of abnormal cell patches in the control and drug-treated tissue.} A higher count of abnormal cells like single cell necrosis, mitosis, and microgranuloma could indicate drug toxicity \cite{greaves2011histopathology}. We compare anomalous cell count as a percentage of total cell count, to account for tissue area. 10 images from control and drug-treated group each are used from kidney and liver, this excludes four WSI used for creating training data to avoid data leakage. 

{A threshold based on the anomaly scores of in-distribution data is used to classify a cell as abnormal, and is set to \( Q_3 + 1.5 \times \mathrm{IQR} \), where Q3 is the third quartile distance \& IQR represents Inter-quartile range. Using \( Q_3 + 1.5 \times \mathrm{IQR} \) as the threshold allows the rejection of outliers from the in-distribution data.  Futher implementation details are provided in appendix section \ref{Impl_detail_study}.}

Box plot in figure \ref{fig:box_plots} shows the percentage of anomalous cells. A significant increase in percentage of anomalous cells is observed in the drug-treated animal group, {A p-value of 1.5e-5 \& 0.0147 was obtained  liver and kidney tissue respectively. Thus, we can conclude that more cells in the treated group are far from normal cellular representation in the feature embedding space, compared to the control dose group.} The compound was confirmed by a pathologist to induce toxicity in liver and kidney tissue, verifying the assessment made using cellular anomaly detection. Figure \ref{fig:qualitative} in appendix provides examples of predictions by the unsupervised anomaly detection algorithm. 





% We use all the tissues in toxicological study, excluding the control WSI used for creating the training dataset, to detect changes in cellular distribution between control and treated tissue.  Figure \ref{fig:study} shows a box plot of  the percentage of detected anomalous cells in 10 control and 10 treated tissue WSI. 



% The drug 

%  Specific to cellular anomalies, treated tissue can have similar or a higher percentage of anomalous cells depending on the characteristics of the drug. That is, in case of increased drug-induced toxicity, a higher count of anomalous cells is expected in the treated tissue.  to evaluate the capability our cellular anomaly detection algorithm to detect changes in cellular distribution between control and treated tissue. 


% Cellular Anomaly Detection for Toxicological studies can help triage Treated Tissue WSI relative to Control WSI. Figure \ref{fig:study} shows the percentage of cell in 4 Control and 4 High dose WSI that were observed to be anomalous by pathologists, not used in the Train and Test data.  A separation is observed between the two sets of WSI, establishing the utility of Cellular Anomaly Detection for preclinical application.


\section{Conclusion}

We show that KNN-distance based unsupervised anomaly detection, using vision transformer as a feature extractor, pre-trained on large Histopathology data, achieves high AUC scores for cellular anomaly detection. 
The method is found to outperform state-of-the-art reconstruction based methods, by exploiting foundation models. The method is found to differentiate between control and drug-treated tissue, based on proportion of anomalous cells, indicating drug toxicity. In the future, we plan to pre-train a feature extractor using large-scale cell {patch} data from multiple organs, to further improve model performance.
% In future work, we plan to train a self-supervised model on cellular data.


% \clearpage  % Acknowledgements, references, and appendix do not count toward the page limit (if any)
% % Acknowledgments---Will not appear in anonymized version
% \midlacknowledgments{We thank a bunch of people.}


\bibliography{midl25_145}
\newpage

\appendix

\section{Example predictions of cellular anomaly detection}

\begin{figure}[!h]
        \centering
        \includegraphics[scale=0.2]{images/MIDL_qualitative.drawio.png}
	\caption{{The figure shows example predictions of cellular anomaly detection method using best performing feature extractor and KNN-distance based anomaly score, for liver and kidney tissue. The cells predicted as anomalous are highlighted in green.} }
	\label{fig:qualitative}
\end{figure}


\section{Example cell reconstruction}

{Figure \ref{fig:reconstruction} provides sample reconstruction of normal and abnormal cells for both liver and kidney tissue, obtained from AutoDDPM and f-AnoGAN method. It is observed that AutoDDPM is able to reconstruct the image patch much better than f-AnoGAN, even in case of abnormal samples.}

\begin{figure}[!h]
        \centering
        \includegraphics[scale=0.3]{images/predictions_07_03.drawio.png}
	\caption{{The figure shows example reconstruction using f-AnoGAN \cite{schlegl2019f} and AutoDDPM\cite{bercea2023mask} for normal and abnormal cells. For liver, abnormal cells consists of  Single cell necrosis (SCN), (microgranuloma) MG \& extramedullary hematopoiesis) EMH, from 3rd to 5th row; for kidney all three abnormal cells are neutrophils.}}
	\label{fig:reconstruction}
\end{figure}


\section{Benchmarking Cellular anomaly detection based on anomaly type}
\label{section:class-wise}
{We created a class-labeled data set to evaluate class-wise performance of all foundation models using KNN-distance based method, and other state-of-the-art methods. For each class - Liver: Single cell necrosis (SCN), microgranuloma (MG) \& Extramedullary Hematopoiesis (EMH); Kidney : neutrophils, 2000 cells were identified by the pathologist. Figure \ref{fig:class-scores} provides the class-wise results. }

{We observed that KNN-distance based method with  Virchow2 as feature extractor obtains the highest scores for SCN, MG and Neutrophils classes. However, for EMH class, f-AnoGAN achieves the highest AUC score. LUNIT, CONCH \& UNI feature extractors also obtain high class-wise scores. }

\begin{figure}[]
        \centering
        \includegraphics[scale=0.33]{images/class_scores.pdf}
	\caption{{The figure provides class-wise AUC scores for four anomalous cell types, comparing all feature extractors benchmarked for KNN-distance based method and three state-of-the-art methods - AutoDDPM, f-AnoGAN and PANDAS. Virchow2 feature extractor obtains the highest overall results.}}
	\label{fig:class-scores}
\end{figure}

% \section{Evaluating sensitivity to threshold in }


\section{Implementation Details}
All model were trained and inferred using NVIDIA RTX A4000 GPUs.

\label{Appendix:Details}

\subsection{KNN(Ours)}
The code was implemented in Pytorch, and uses Faiss library \cite{johnson2019billion} for the nearest neighbour distance calculation. The anomaly scoring function using KNN-distance based method can be obtained in two ways, using the mean distance of K-nearest neighbours or using the distance to the Kth-nearest neighbor. {We find that when using Virchow2 as the feature extractor, the AUC for liver reduced to 96.66\% when using  distance to the Kth-nearest neighbor, as compared to 96.97\%.} 
We also experimented with different k values [10, 25, 50, 100, 200, 500, 1000, 2500] for obtaining KNN-distance and found k=200 to give best results, as seen in figure \ref{fig:K-value}.

\begin{figure}[]
        \centering
        \includegraphics[scale=0.33]{images/chart.pdf}
    \caption{{We evaluate the sensitivity of KNN-distance based method to the number of neighbors (K) used to calculate the anomaly score. The figure provides the AUC value for cellular anomaly detection, with Virchow2\cite{zimmermann2024virchow} feature extractor, for different K values.}}
    \label{fig:K-value}
\end{figure}


\subsection{AutoDDPM}
 
We use architecture and training procedures as provided \cite{bercea2023mask} using code provided by \url{https://github.com/ci-ber/autoDDPM/tree/main}. We use a 3-layer U-Net with [128, 256, 256] channels,  one residual block per layer and a single-headed attention block after each residual block with a corresponding spatial dimension of 2. The architecture takes 3 channeled images of size (64*64). The noise level is set to t=200 and resampling steps to 5. We trained two separate models for liver and kidney cell dataset as described in table \ref{table:data}. Both the models were trained  for 200,000 iterations using Adam optimizer and Cosine learning rate scheduler with maximum learning rate of 1e-4 and a batch size of 128.

\subsection{F-AnoGAN}
We train Wasserstein GAN (WGAN) followed by image-to-image (izi) mapping encoder, as described in \cite{schlegl2019f}, using code available at \url{https://github.com/A03ki/f-AnoGAN}. WGAN was trained for 20 epochs with learning rate 0.0002 using ADAM optimizer with the batch size of 32. izi encoder is trained for 20 epochs with learning rate of 0.0002 using adam optimizer with batch size of 128. Combination of MSE loss between original and reconstructed image, and MSE loss between encoder mapping of real and fake image is used as anomaly score, as provided by the github repo.


% WGAN is trained on 2.1M kidney and 1.8M liver normal cell patches. 
% After the generator and discriminator training, another image to image mapping encoder is trained. The idea behind a izi encoder is to create mapping in the GAN’s Latent space. To train the izi mapping encoder, fake image and real image is passed through the encoder model and trained by reducing the MSE loss between the generated encoding between fake and real images.izi encoder is trained with the same normal images.  


\subsection{PANDAS}
We use best performing feature extractor,  Resnet-152 pre-trained on ImageNet dataset,  and training setup as described by \cite{reiss2021panda}, using code from \url{https://github.com/talreiss/PANDA}. The feature extractor is trained for 15 epochs on the training dataset as described in section \ref{table:data}, using a batch size of 1024 with a learning rate of 1e-2. We found K=200 as best performing. Sum of distances of k nearest neighbors of each test feature from the train features is used as the anomaly score.


% We use the best performing feature extractor




% A is used 




% is trained using 

% Resnet 152 pretrained Model is finetuned on the training data by extracting the train features and test features. 

% The center of the train feature space is calculated. Compactness loss is calculated from the center to the test features.  AUC is calculated from the sum of distances of 2 nearest neighbors of each test feature from the train features in the space and the corresponding labels. 
% k for KNN: 2
% Epochs: 15
% Learning rate: 
% Resent pretrained: 152
% Batch size: 
% Binary classifier with normal class as 0 and abnormal classes as 1
% AUC calculated on the the ground truth and the distance from 2 nearest neighbors from the train set
% Optimiser: SGD
% Loss : Compactness Loss

\subsection{Evaluation on Toxicological Study}
\label{Impl_detail_study}
For the analysis, cells are detected using pretrained Cell-ViT \cite{horst2024cellvit} model, and a crop of size 64x64 px is taken for all the cells, at 40x magnification. To identify anomalous samples in test data, a threshold is applied on the anomaly score. The threshold is based on the anomaly scores of in-distribution data and is set to \( Q_3 + 1.5 \times \mathrm{IQR} \), where Q3 is the third quartile distance \& IQR represents Inter-quartile range. Using \( Q_3 + 1.5 \times \mathrm{IQR} \) as the threshold allows the rejection of outliers from the in-distribution data. Figure \ref{fig:qualitative} provides example field-of-views with anomalous cell prediction. For better visualization, a circular overlay is created around the centroid of the cells.

\end{document}
