\documentclass{midl} % Include author names

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution


\usepackage{multirow}
\usepackage{graphicx}
\usepackage{rotating}
\usepackage{appendix}

\usepackage{mwe} % to get dummy images
\jmlrvolume{-- Under Review}
\jmlryear{2024}
% \jmlrworkshop{Full Paper -- MIDL 2024 submission}
% \editors{Under Review for MIDL 2024}


\jmlryear{2024}\jmlrworkshop{Full Paper -- MIDL 2024}\jmlrvolume{-- 211}\editors{Accepted for publication at MIDL 2024}
\title[MIRViT]{Analysis of Transformers for Medical Image Retrieval}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{\Name{Arvapalli Sai Susmitha} \Email{\lowercase{susmitha@cse.iitk.ac.in}}\\
\addr  IIT Kanpur, India
\AND
\Name{Vinay P. Namboodiri} \Email{\lowercase{vpn22@bath.ac.uk}}\\
\addr University of Bath, UK
}

\begin{document}
% \begin{sloppypar}

\maketitle
\vspace{-7mm}
\begin{abstract}

This paper investigates the application of transformers to medical image retrieval. Although various methods have been attempted in this domain, transformers have not been extensively explored. Leveraging vision transformers, we consider co-attention between image tokens. Two main aspects are investigated: the analysis of various architectures and parameters for transformers and the evaluation of explanation techniques. Specifically, we employ contrastive learning to retrieve attention-based images that consider the relationships between query and database images. Our experiments on diverse medical datasets, such as ISIC 2017, COVID-19 chest X-ray, and Kvasir, using multiple transformer architectures, demonstrate superior performance compared to convolution-based methods and transformers using cross-entropy losses. Further, we conducted a quantitative evaluation of various state-of-the-art explanation techniques using insertion-deletion metrics, in addition to basic qualitative assessments. Among these methods, Transformer Input Sampling (TIS) stands out, showcasing superior performance and enhancing interpretability, thus distinguishing it from black-box models.
\end{abstract}

\begin{keywords}
Content-based medical Image Retrieval, Vision Transformers, Deep Learning, Contrastive Learning, Explainable AI.
\end{keywords}
\vspace{-5mm}
\section{Introduction}

Over the past decade, the intersection of medical imaging and deep learning has witnessed significant advancements, addressing challenges in managing vast datasets, as highlighted by Hwang et al. \cite{hwang2012medical}. Content-based medical image retrieval (MIR) has emerged as a crucial tool, aiding clinicians in recognizing related medical images and recalling prior cases during diagnosis \cite{agrawal2022content}.
Traditionally, MIR has heavily relied on convolutional neural networks (CNNs), as evidenced by \cite{shetty2023medical} and \cite{qayyum2017medical}. Despite their effectiveness in feature extraction and similarity identification, they pose challenges due to their inability to effectively capture long-range dependencies and the lack of interpretability, known as the ``black-box" problem \cite{hu2022x}.

Our work addresses these challenges in medical image retrieval by adopting Vision Transformers (ViTs) \cite{dosovitskiy2021an}. ViTs, as explored in \cite{el2021training}, offer superior performance, excelling at capturing long-range dependencies and relationships between distant image regions through their multi-head attention mechanism \cite{zuo2022vision}. Inspired by \cite{el2021training} work we experiment with Vision Transformers employing contrastive learning and regularization, comparing their performance with convolutional baselines using both cross-entropy loss and contrastive loss under similar settings and also vision Transformers with cross-entropy loss. Our findings demonstrate the superiority of Vision Transformers over these baselines for all datasets. Recognizing the significance of model explainability in clinical applications, We apply various state-of-the-art eXplainable AI (XAI) techniques tailored to Vision Transformers. Beyond simple qualitative comparisons, we conduct quantitative evaluations of saliency maps using insertion and deletion methods.\\
%\vspace{2mm}
\textbf{The primary contributions of our work are as follows:}
\begin{itemize}
    \item We provide a systematic analysis of vision transformer architectures (Section \ref{subsec:Evaluation of Architectures}) and hyperparameters/variants (Section \ref{subsec:Evaluation with parameters}). Our findings consistently highlight the superiority of transformers with contrastive loss over convolutional baselines. These also outperform vision transformers and convolution baselines using cross-entropy loss across all datasets.
    \item Analysis of various state-of-the-art explanation techniques for Vision Transformers (Section \ref{subsec:Explanation methods experimentation}). We quantitatively evaluate the resulting saliency maps using insertion and deletion methods. 
\end{itemize}
\vspace{-8mm}
\section{Related Works}
\vspace{-2mm}
Content-based Medical Image Retrieval (MIR) plays a crucial role in enhancing diagnostic reliability for radiologists by retrieving pertinent medical cases that resemble a provided image. The early MIR methods relied on basic features and struggled to capture intricate relationships within medical images. The ``semantic gap" between features and actual content led to inaccurate retrieval \cite{xuan1995segmentation} \cite{zhang2008shape}. Convolutional neural networks (CNNs) revolutionized MIR and bridged the semantic gap by surpassing hand-crafted techniques \cite{sklan2015toward}.

In the context of COVID, Kvasir, and ISIC datasets, studies \cite{tschandl2019diagnostic}, \cite{shetty2023medical}, and \cite{agrawal2022content} leveraged pre-trained CNN architectures like ResNet, VGG, Densenet for medical image retrieval. Recently,\cite{ahmed2023content} proposed a novel relative difference-based similarity measure (RDBSM) for improved retrieval. Further,  \cite{ozturk2023content} have introduced an opponent class adaptive margin (OCAM) loss for S-bit hash code generation in image retrieval. CNN-based architectures have been extensively employed in the majority of studies published in the literature. However, the limitations of CNNs in capturing long-range dependencies prompted the exploration of vision transformers (ViTs) \cite{el2021training}. Inspired by the capabilities of ViT in general computer vision, we employ them with contrastive loss and differential entropy regularization for the task of medical image retrieval. 

Few recent studies have explored various ViT-based approaches for Medical Image Retrieval. \cite{trinh2021endoscopy} present a Mixer-MLP-based MIR for endoscopic images. \cite{thakrar2023semantic} use a modified ViT for content-based image retrieval in chest X-rays, employing binary cross entropy and L$_1$ loss for training. \cite{gupta2023medical} propose dense-link-search for efficient nearest neighbours in medical image retrieval. \cite{manzari2023medvit} have introduced MedViT, a hybrid model combining ViTs and CNNs for Medical Image Classification. However, these methods may lack interpretability, acting as black boxes in decision-making processes. \cite {hu2022x} address interpretability concerns with their X-MIR method for CNNs, employing deep metric learning and similarity-based saliency maps for visual explanations of retrieved images. In our work, we undertake a benchmarking study to analyse the various architectures and loss-functions and further analyse the interpretability of these methods.

Explaining Vision Transformers (ViTs) presents challenges, as traditional attention weight-based methods designed for CNNs are inadequate due to the distinctive nature of ViTs with multiple attention heads and encoder blocks \cite{Serrano2019IsAI} \cite{stassin2023explainability}. To address this, various ViT-specific explainability methods are explored. Attention Rollout \cite{Abnar2020QuantifyingAF}, Chefer 2 \cite{chefer2021generic}, Transition Attention Maps(TAMS) \cite{yuan2021explaining}, Bidirectional Transformers (BT) \cite{chen2022beyond}, ViT-CX \cite{Xie2022ViTCXCE}, and Transformer Input Sampling (TiS) \cite{englebert2023explaining} contribute uniquely to understanding ViT decision-making, providing insights into attention, gradients, and perturbation-based explanations.
\vspace{-5mm}
\section{Method}
\vspace{-2mm}
\subsection{\textbf{Transformers}}

\begin{figure}[htbp]
 
\floatconts
  {fig:CBMIR}
  {\caption{Transfomer model with Contrastive loss and Regularization for Medical Image Retrieval on COVID Dataset}}
  {\includegraphics[width=0.9\linewidth]{Figures/CBMIR_new_v3_train_test_v1png.png}}
\end{figure}
\vspace{-5mm}
The vision transformer (ViT), introduced by \cite{dosovitskiy2021an}, introduced tokenization of image patches for transformers, by transforming input images into a sequence of 2D patches (e.g., 16x16). These patches undergo a learnable linear projection, resulting in token embeddings. A special learnable [CLS] token is added at the sequence's beginning and serves as a global representation. The transformer encoder block consists of L layers, each of which is composed of two sub-layers: a multi-headed self attention (MSA) layer and a multi-layer perceptron (MLP) layer. The resulting global representation, derived from the [CLS] token output, is used for subsequent processing.

As shown in Figure \ref{fig:CBMIR}, we use transformers for content-based MIR. 

In the training phase, we start with the pre-trained Vision Transformer (ViT) model and then fine-tuned for each dataset with metric learning, specifically employing a contrastive loss. Through this process, the model acquires feature embeddings i.e the cls token acts as the global image descriptor. The process maps similar images into a common feature space and cosine similarity is used to retrieve the closest images. A cross-batch memory\cite{wang2020understanding} is used along with differential entropy regularization. The use of cross-batch memory enables to reduce the dependency on mini-batch for obtaining informative negative pairs. Using a cross-batch memory we are now able to increase the number of hard negatives without incurring significant computational overhead. In the testing phase, we deploy the trained ViT model to generate feature embeddings for each query image in a separate test dataset. These embeddings are then utilized to rank database images, enabling efficient medical image retrieval. Following this, performance metrics such as precision, recall, and F1 score are calculated to assess the efficacy of the retrieval process.
% All these models are pre-trained on ImageNet1k and then fine-tuned for each dataset with metric learning, in particular with a contrastive loss.
 % This process has been observed to work well for natural images and we investigate its applicability for medical images. % that favors uniformity over the representation space. 

In our study, we explore three model variants: MIRDeiT\_small, MIRViT\_small, and MIRViT\_base. MIRDeiT\_small adopts the DeiT\_small  (Embedding length is 384)\cite{touvron2021training} architecture with a 16x16 patch size and an image size of 224x224 pixels. Similarly, MIRViT\_small follows the ViT\_small architecture \cite{dosovitskiy2021an} with matching patch and image dimensions (Embedding length is 384). In contrast, MIRViT\_base is based on the larger ViT\_base architecture (Embedding length is 768). 
% All these models and the CNN models used for comparison are pre-trained and imported from the timm PyTorch library.
 All these models and the CNN models used for comparison are imported from the timm PyTorch library.
 The total loss ($L$) used for training is a combination of the contrastive loss and the differential entropy regularization given as: \( L = L_{\text{contr}} + \lambda L_{\text{KoLeo}} \)
 
The contrastive loss (\(L_{\text{contr}}\)) encourages similarity among samples with the same label and dissimilarity among samples with different labels. Mathematically, it is expressed as:
\[ L_{\text{contr}} = \frac{1}{N} \sum_{i}^{N} \left[ \sum_{j:y_i=y_j} \left[ 1 - z_i^T z_j \right] + \sum_{j:y_i \neq y_j} \left[ z_i^T z_j - \beta  \right] \right] \]
Here, \(z_i\) represents the \(l_2\)-normalized embedding vector of sample \(i\), \(N\) is the total number of samples, and \(y_i\) is the label of sample \(i\). The margin \(\beta\) prevents the training signal from being dominated by easy negatives.  Only negative pairs with a similarity higher than a constant margin \(\beta\) contribute to the loss. The representations \(z_i\) are assumed to be \(l_2\)-normalized, making the inner product equivalent to cosine similarity.

Simultaneously, $L_{\text{KoLeo}}$ is the differential entropy loss. It serves as a regularizer \cite{Sablayrolles2018SpreadingVF} and is based on the \cite{kozachenko1987sample} differential entropy estimator. It prevents representations of different samples from being close by increasing their distance from positive examples and hard negatives.
Mathematically, $L_{\text{KoLeo}}$ aims to maximize the distance between each point and its nearest neighbor:
\vspace{-1mm}
\[ L_{\text{KoLeo}} = -\frac{1}{N}\sum_{i=1}^{N}\log(\rho_i) \]
Here, $\rho_i$ represents the minimum distance between the embedding vector $z_i$ and any other embedding vector $z_j$ (where $j \neq i$). The regularization term is then used with a weighting coefficient $\lambda$. 

During testing, each image in the test dataset, as depicted in Figure \ref{fig:CBMIR}, acts as a query, with the top \(k\) retrievals extracted for each query. The learned embedding network processes input images, denoted as \(x\), producing embedding feature vectors \(\mathbf{z}(x)\). This process applies to both query images (\(q\)) and retrieved images (\(r\)), resulting in feature vectors \(\mathbf{z}_q\) and \(\mathbf{z}_r\), respectively. To rank the retrieved images, a similarity score \(s\) between \(\mathbf{z}_q\) and \(\mathbf{z}_r\) is computed using cosine similarity: \( s(\mathbf{z}_{\mathbf{q}}, \mathbf{z}_{\mathbf{r}}) = \frac {\mathbf{z}_{\mathbf{q}} \cdot \mathbf{z}_{\mathbf{r}}}{\|\mathbf{z}_{\mathbf{q}}\| \|\mathbf{z}_{\mathbf{r}}\|}. 
\)
%This similarity score \(s\) quantifies the likeness of each retrieved image to the query image. 
Standard image retrieval metrics, such as mean average precision (mAP), mean precision (mP@K), and Recall (R@K), are calculated for evaluation. 
\vspace{-4mm}
\subsection{Explainability Methods}
% Why raw attention in transformers is not satisfactory for attention - with a citation

% 
While vision transformers (ViTs) employ attention mechanisms, relying solely on raw attention is considered insufficient for comprehensive explanations. This raw attention overlooks the value component and emphasizes the query and key elements \cite{Jain2019AttentionIN, Serrano2019IsAI}, This has led to the development of methods specifically made for ViTs. For instance, attention rollout \cite{Abnar2020QuantifyingAF} combines attention heads and an identity matrix for residual connections. Chefer 2 \cite{chefer2021generic} offers a generic explanation method for transformers, using gradients and identity matrices for attention score computation. TAMs \cite{yuan2021explaining} model representation evolution as a Markov chain, yielding class-specific explanations with integrated gradients. Bidirectional Transformer (BT) involves element-wise multiplication of Reasoning Feedback and Attention
Perception \cite{chen2022beyond}, providing saliency maps for token (BT-T) and head (BT-H). ViT-CX \cite{Xie2022ViTCXCE} avoids direct dependence on attention weights, using perturbation masks derived from patch embeddings. Transformer Input Sampling (TiS) \cite{englebert2023explaining} instead masks tokens before their introduction into a transformer, improving interpretability and reducing the number of tokens. We have compared a number of the above explainability methods to obtain explainable medical image retrieval and provide results for the same in section \ref{subsec:Explanation methods experimentation}. % we provide an analysis and comparison of the various explainability methods for the medical image retrieval task.



\vspace{-5mm}
\section{Results \& Discussions}
\vspace{-2mm}
\subsection{Datasets}
 %https://www.kaggle.com/datasets/francismon/curated-covid19-chest-xray-dataset

In our comprehensive study, we utilize three datasets for medical image analysis. The curated COVID-19 Chest X-Ray Dataset \cite{sait2020curated} includes 1281 COVID-19 X-rays, 3270 Normal X-rays, and 4657 pneumonia X-rays (viral and bacterial). Our focus is on overall pneumonia classification. The ISIC Skin Lesion Dataset \cite{codella2018skin} comprises images of benign nevi, seborrheic keratosis, and melanoma (non-cancerous and malignant). This dataset has 2,750 images. The Kvasir-V2 dataset \cite{pogorelov2017Kvasir} contains 8,000 annotated endoscopy images, categorized into eight classes by experienced endoscopists based on anatomical landmarks, pathological findings, or specific endoscopic procedures.





% In our comprehensive study, we leverage three distinct datasets to explore diverse aspects of medical image analysis. The first dataset, the publicly accessible Curated COVID-19 Chest X-Ray Dataset \cite{sait2020curated}. The dataset, encompassing 1281 COVID-19 X-rays, 3270 Normal X-rays, and a merged category of 4657 pneumonia X-rays (combining viral and bacterial pneumonia cases)
% Our emphasis is on the overall classification of pneumonia without differentiation between bacterial and viral origins.

% Moving on to the ISIC Skin Lesion Dataset, specifically, the 2017 ISIC dataset tailored for skin lesion classification \cite{codella2018skin}, it encompasses images classified into benign nevi, seborrheic keratosis, and melanoma. Benign nevi and seborrheic keratosis represent non-cancerous conditions, while melanoma signifies malignant skin cancer. This dataset totals 2,750 images, with 2,000 for training, 150 for validation, and 600 for testing.

% Additionally, we utilized the Kvasir-V2 dataset \cite{pogorelov2017Kvasir}, a collection of 8,000 annotated endoscopy images, distributed evenly with 1,000 images per class. Verified by experienced endoscopists, the annotations categorize images into eight classes: dyed-lifted-polyps, dyed-resection-margins, esophagitis, normal-cecum, normal-pylorus, normal-z-line, polyps, and ulcerative-colitis. These classifications are based on anatomical landmarks, pathological findings, or specific endoscopic procedures.
\vspace{-5mm}
\subsection{Evaluation of Architectures} \label{subsec:Evaluation of Architectures}
We adopted the same training details as outlined in \cite{el2021training}, the optimization of these models employs the AdamW optimizer with a learning rate of $3 \times 10^{-5}$, weight decay of $5 \times 10^{-4}$, and for 10k iterations. Contrastive loss margin ($\beta$) is set to 0.5. In the absence of regularization ($\lambda = 0$) and with differential entropy regularization, different variants of $\lambda$ ($\lambda = 0.3$, $0.7$) are employed. % to explore the impact of regularization on model performance. 
Standard data augmentation techniques are applied, including resizing images to $256 \times 256$, random cropping to $224 \times 224$, and random horizontal flipping. The dynamic offline memory queue aligns with the dataset's size. In the case of cross-entropy, similar optimizer and iteration settings were used, applying basic cross-entropy loss for classification. 

In our diverse evaluation across ISIC, COVID, and Kvasir datasets, as detailed in Table \ref{tab:Combined_results}, we tested traditional CNNs (Densenet121, Resnet50), various vision transformers (DeiT\_small, ViT\_small, MedViT), and the MIRViT variants. MIRViT\_small consistently outperforms CNNs, other vision transformers, and even MedViT(CNN-Transformer Hybrid) in medical image retrieval, demonstrating its efficacy. 

In Table \ref{tab:Combined_results}, to maintain conciseness, the results for ISIC are derived from $\lambda=0.3$ for MIRViT\_small and $\lambda=0$ for MIRDeiT\_small, while for Kvasir and COVID, MIRViT\_small is based on $\lambda=0.7$ and MIRDeiT\_small is based on $\lambda=0.3$ and $\lambda=0.7$ respectively.The above results are based on three experimental runs. The complete set of results of the model combinations are provided in Figure \ref{fig:Model_variants}.
% \vspace{-6mm}
%========================

\begin{table}[ht]
\centering

% \small 
% \tiny
% \footnotesize
\scriptsize
% \setlength(\tabcolsep)(-2pt)
% \resizebox{\textwidth}{2}{%
\caption{Medical Image retrieval results}
\label{tab:Combined_results}
\begin{tabular}{|c|p{18.7mm}|p{9mm}|c|c|c|c|c|c|c|}
\hline
\bfseries Dataset & \bfseries Model  & \bfseries Loss & \bfseries R@1 & \bfseries R@5& \bfseries R@10 & \bfseries mAP & \bfseries mP@1 & \bfseries mP@5 & \bfseries mP@10 \\
\hline
\multirow{9}{0.7cm}{ISIC} & {Densenet121}  & \multirow{5}{1cm}{\begin{minipage}{6mm}Contra 
 \\stive\end{minipage}} & 64.50 & 92.50 & 95.67 & 58.38$\pm$0.01 & 64.50 & 62.37 & 62.57 \\
\cline{4-10}
&  Resnet50 & & 65.33 & 92.33 & 97.00  & 57.72$\pm$0.02 &  65.33 & 65.13 & 63.90 \\
\cline{4-10}
&  MIRViT\textunderscore small & & 74.17 & 88.17 & 91.33 & {\textbf{70.96$\pm$0.01}} & 74.17 & 73.80 & 73.72 \\
\cline{4-10}
&  MIRDeiT\textunderscore small &  & 71.83 & 89.5 &  94.67 & {68.44$\pm$0.02} &  71.83 & 71.07 & 71.55 \\
\cline{4-10}
&  MedViT\textunderscore S & & 59.00  & 90.00 &  97.17 & {51.10$\pm$0.01} & 59.00  & 55.93 &54.65 \\
\cline{2-10}

&  DeiT\textunderscore small & \multirow{4}{1cm}{Cross Entropy} & 71.33 &90.50  & 95.17  & {63.32$\pm$0.02} &  71.33 & 70.87 & 70.10 \\
\cline{4-10}
&  ViT\textunderscore small & & 69.00 & 90.50 & 95.83 & {60.11$\pm$0.0} & 69.00 &   65.67 & 64.57 \\
\cline{4-10}
&  Densenet121 & & 60.50 & 91.00  & 96.17 & {58.80$\pm$0.01} &  60.50 & 60.87& 60.18 \\
\cline{4-10}
&  Resnet50 & & 67.33 & 90.50 &  97.17  & {53.99$\pm$0.01} &  67.33 & 64.73 &  63.15 \\
\hline
% \vspace{3mm}
\hline
\multirow{9}{0.85cm}{COVID} & {Densenet121}  & \multirow{5}{1cm}{\begin{minipage}{6mm}Contra\\stive\end{minipage}} & 96.20 & 98.8 & 99.19 & {94.62$\pm$0.01} & 96.20 & 95.87& 95.73 \\
\cline{4-10}
&  Resnet50 & & 94.24 &98.53 &99.02 & {91.33$\pm$0.01} &  94.24 &93.86 &93.91 \\
\cline{4-10}
&  MIRViT\textunderscore small & & 97.72 &98.26& 98.53 & {\textbf{96.96$\pm$0.01}} & 97.72 &97.52 &97.51 \\
\cline{4-10}
&  MIRDeiT\textunderscore small &  & 96.80 & 98.37 & 98.80  & {96.48$\pm$0.02} &  96.80 &  96.74 & 96.65 \\
\cline{4-10}
&  MedViT\textunderscore L & & 89.95 & 98.04 & 98.75 & {83.29$\pm$0.01} & 89.95 & 89.57 & 89.24 \\
\cline{2-10}
&  DeiT\textunderscore small & \multirow{4}{1cm}{Cross Entropy} & 95.11 & 98.53 & 98.86  & {92.93$\pm$0.01} &  95.11 &94.89 & 94.65 \\
\cline{4-10}
&  ViT\textunderscore small & & 93.05 & 97.94 & 98.59 & {93.24$\pm$0.01} & 93.05 &93.44 &93.30 \\
\cline{4-10}
&  Densenet121 & & 87.72 & 97.07 & 98.59 & {80.38$\pm$0.01} &  87.72 & 87.19 & 86.69 \\
\cline{4-10}
&  Resnet50 & & 92.07 & 97.66 & 98.59  & {82.35$\pm$0.02} &  92.07 & 90.87 & 90.20  \\
\hline
\hline
\multirow{9}{0.7cm}{Kvasir} & {Densenet121}  & \multirow{5}{1cm}{\begin{minipage}{6mm}Contra\\stive\end{minipage}} & 88.83 & 97.58 & 98.25 & {83.89$\pm$0.01} & 88.83 & 88.98 & 88.69 \\
\cline{4-10}
&  Resnet50 & & 90.42 & 97.46 & 98.67 & {84.85$\pm$0.02} &  90.42 & 89.75 & 89.63 \\
\cline{4-10}
&  MIRViT\textunderscore small & & 93.33 & 96.92 & 97.54 & {\textbf{90.16$\pm$0.01}} & 93.33 & 92.87 & 92.84 \\
\cline{4-10}
&  MIRDeiT\textunderscore small &  & 92.21 & 96.92 & 97.96 & {90.11$\pm$0.01} &  92.21 & 92.50  & 92.53 \\
\cline{4-10}
&  MedViT\textunderscore T & & 68.33 & 95.29 & 98.50  & {51.41$\pm$0.01} & 68.33 & 64.65 & 62.82 \\
\cline{2-10}
&  DeiT\textunderscore small & \multirow{4}{1cm}{Cross Entropy} & 91.46 & 97.38 & 98.25  & {88.15$\pm$0.02} &  91.46 &91.62 & 91.39 \\
\cline{4-10}
&  ViT\textunderscore small & & 86.50  & 96.71 & 98.00 & {79.33$\pm$0.01} & 86.50  & 86.59 & 86.24 \\
\cline{4-10}
&  Densenet121 & & 56.71 & 89.08 & 95.21  & {26.23$\pm$0.01} &  56.71 & 52.62 & 50.28 \\
\cline{4-10}
&  Resnet50 & & 66.96 & 92.17 & 95.88  & {40.59$\pm$0.01} &  66.96 & 63.74 & 61.42 \\

\hline
\end{tabular}%
% }
\end{table}

% MedViT\_T, MedViT\_S, and MedViT\_L share a common architecture but differ in MedViT Block repetition during the third stage  \cite{manzari2023medvit}.
MIRViT\_small's superiority is evident, for example for the ISIC dataset, achieving higher recall and a significant 13.24\% increase in mAP compared to Resnet50. It also outshines vision transformers trained with cross-entropy loss, emphasizing its precision in top-k scenarios, with a notable 7.64\% surge in mAP compared to DeiT\_small.
Similarly, across the COVID and Kvasir datasets, MIRViT\_small  outperforms convolution-based methods and transformers using cross-entropy losses, showcasing versatility in diverse medical imaging contexts. MedViT, while excelling in classification, lags in retrieval. %, likely due to its training from scratch on smaller datasets.
In summary, vision transformers, especially MIRViT\_small, exhibit good potential in medical image retrieval. Their consistent superiority in recall, precision, and mAP underscores their effectiveness. % in clinical applications. Adoption 
Use of contrastive loss, with or without differential entropy regularization is beneficial.

% In our comprehensive evaluation spanning diverse datasets, including ISIC, COVID, and Kvasir, we considered a range of models trained with specific loss functions and architectures. The model lineup comprised traditional CNN architectures, such as Densenet121 and Resnet50, various vision transformers (ViT) like deit\_small, ViT\_small, and MedViT, alongside our proposed MIRViT variants.

% MedViT\_T, MedViT\_S, and MedViT\_L share a common architecture, differing in the repetition of the MedViT Block during the third stage: twice for MedViT\_T, four times for MedViT\_S, and six times for MedViT\_L. Refer to the work by Manzari et al. \cite{manzari2023medvit} for a detailed description of the architecture.

% The training was conducted as outlined in the method section, utilizing the contrastive loss alone and incorporating the regularizer. For cross-entropy, the same optimizer and iteration settings were applied, utilizing basic cross-entropy loss for classification. The \texttt{cls} embedding served as the image representation for medical image retrieval.

% The results, detailed in \ref{tab:Combined_results}, uncover the comparative performance of various models across datasets.

% In this extensive evaluation, MIRViT\_small not only outperforms traditional CNN models like Densenet121 and Resnet50 but also surpasses other vision transformers, including deit\_small, ViT\_small and MedViT, showcasing its remarkable efficacy in medical image retrieval. Focusing on the ISIC dataset, MIRViT small exhibits substantial advancements, achieving a 9.84\% higher recall at \(k = 1\), a 4.92\% improvement at \(k = 5\), and a remarkable 6.67\% enhancement at \(k = 10\) and 13.24\% increase in mAP compared to Resnet50. The superiority of MIRViT\_small becomes even more apparent when compared to vision transformers trained with cross-entropy loss, notably achieving a 7.64\% surge in mAP compared to deit\_small, the top performer among models trained with cross-entropy loss. This underscores its adeptness in retrieving pertinent images with precision, especially in top-k scenarios.

% The robust performance of MIRViT\_small extends consistently across various datasets, including COVID and Kvasir, establishing its efficacy in diverse medical imaging applications. It consistently outshines convolution-based methods and transformers employing cross-entropy losses, underscoring its versatility and reliability in different medical imaging contexts.

% Despite MedViT showcasing state-of-the-art results in medical image classification, its performance in image retrieval tasks falls comparatively lower. This discrepancy can be attributed to MedViT's training from scratch with a hybrid architecture, devoid of reliance on pre-trained weights. The smaller size of these datasets, particularly in contrast to larger ones like MedMNIST, likely contributes to the relatively lower performance of models such as MedViT in retrieval tasks.

% In summary, our findings underscore the unparalleled potential of Vision Transformers, especially MIRViT\_small, in medical image retrieval tasks. Their consistent superiority across datasets highlights their effectiveness in clinical applications, boasting superior recall, precision, and mAP scores.  The study emphasizes the importance of adopting a contrastive loss, both alone and with differential entropy regularization during the training process \ref{fig:Model_variants}, providing crucial insights for the advancement of medical image retrieval methodologies.
\vspace{-4mm}
\subsection{Evaluation of Hyperparameters and Transformer Variants} \label{subsec:Evaluation with  parameters}
The assessment of image retrieval results in Figure \ref{fig:Model_variants} consistently shows the dominance of MIRViT\_small over MIRDeiT\_small and MIRViT\_base across diverse datasets, including ISIC, COVID, and Kvasir. In the ISIC dataset, MIRViT\_small consistently outperforms other configurations, delivering optimal performance at $\lambda = 0.3$. For the COVID and Kvasir datasets, MIRViT\_small at $\lambda = 0.7$ achieves the highest mAP, closely followed by $\lambda = 0.3$ with a small difference. Notably, MIRViT\_base does not emerge as the top performer for all datasets, indicating that a 384 embedding length performs well for retrieval on these medical datasets.In conclusion, MIRViT\_small with $\lambda = 0.3$ emerges as an optimal choice for medical image retrieval tasks.
\vspace{-0.5mm}
\begin{figure}[htbp]
  % Caption and label go in the first argument and the figure contents
  % \centering
  % go in the second argument
  \floatconts
    {fig:Model_variants}
    {\caption{mAP Values for Different Datasets and Model Variants}}
    {\scalebox{1}{\includegraphics[width=1.0\linewidth]{Figures/mAP_grouped_dataset_plot.png}}}
\end{figure}
% \vspace{-4mm}
 %, demonstrating its robust state-of-the-art performance across various datasets and distinguishing it from both CNN architectures and transformer variants trained on cross-entropy loss.
\vspace{-4mm}
\subsection{Evaluation of Image Retrieval Explanations:} \label{subsec:Explanation methods experimentation}

To assess visual explanations, we employ insertion and deletion casual metrics in image retrieval, gauging how well the generated explanations capture the causes behind predictions. {We measure changes in image similarity as a result of changes to the retrieved image. The insertion metric measures increased image similarity by starting with a blurred version of the original retrieved image and gradually revealing pixels from highest
to lowest relevance. Conversely, the deletion metric assesses the decline in image similarity we gradually mask out pixels on the retrieved image with a constant gray value from highest relevance to lowest based on the
computed saliency map. We then compute the similarity score s between the query image q and perturbed versions of the retrieved image $\hat{r}$ (either in the form of insertion onto a blurred image or deletion using a constant gray value) \( s(\mathbf{z}_{\mathbf{q}}, \mathbf{z}_{\mathbf{\hat{r}}}) =( \max(0,\frac {\mathbf{z}_{\mathbf{q}} \cdot \mathbf{z}_{\mathbf{\hat{r}}}}{\|\mathbf{z}_{\mathbf{q}}\| \|\mathbf{z}_{\mathbf{\hat{r}}}\|} ))\).} To rectify non-negative outputs, all similarity values are adjusted to a minimum of zero. The area under the curve (AUC) is used to measure the effectiveness of saliency maps. Higher AUC values are preferred for insertion, and lower AUC values are desirable for deletion.

We implemented all explainability methods using the same hyperparameters as in \cite{englebert2023explaining}. In the saliency maps, vibrant (red) regions signify the primary focus, while cooler (blue) areas have less impact. The maps are rescaled using bilinear interpolation to align with the input image resolution.
% The saliency maps value range from 0 to 1, visually highlighting crucial decision-making areas. Vibrant (red) regions signify primary focus, while cooler (blue) areas have less impact. 


% \subsection{Comparison of different explainability methods}
Chefer2 and TIS consistently emerge as strong performers Table \ref{tab:AUC_results}, shows top-tier scores across datasets and metrics. However, the qualitative evaluation through visual explanation maps shown in Figure \ref{fig:All_explanations} suggests that TIS is qualitatively better. %, effectively highlighting the model's focus areas during image retrieval. TIS, although having slightly lower AUC scores, excels in offering a deeper understanding of the model's decision-making process.
% \vspace{-5mm}
\begin{table}[ht]
% \footnotesize
\centering
\caption{AUC values for insertion and deletion metrics on MIRViT\_small with various explanation techniques}
\label{tab:AUC_results}
% \vspace{2mm}
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
\hline
\bfseries Dataset & \bfseries Metric & \bfseries bth &\bfseries btt  & \bfseries chefer2& \bfseries rollout & \bfseries tam & \bfseries tis & \bfseries vitcx  \\
\hline
\multirow{2}{1.5cm}{ISIC} & Insertion  & 0.79 & 0.79  & 0.79 & 0.78 & 0.79 & 0.78 & 0.72 \\
\cline{2-9}
&  Deletion  & 0.46 & 0.43 & 0.41 & 0.45 & 0.44 & 0.41 & 0.53 \\
\cline{1-9}
\multirow{2}{1.5cm}{COVID} & Insertion  & 0.67 & 0.66  & 0.70 & 0.69 & 0.66 & 0.67 & 0.62 \\
\cline{2-9}
&  Deletion  & 0.45 & 0.47  & 0.42 & 0.44 & 0.47 & 0.46 & 0.51 \\
\cline{1-9}

\multirow{2}{1.5cm}{Kvasir} & Insertion  & 0.72 & 0.72 & 0.74 & 0.74 & 0.71 & 0.72 & 0.68 \\
\cline{2-9}
&  Deletion  & 0.46 & 0.46 & 0.40 & 0.42 & 0.48 & 0.42 & 0.49 \\
\hline
\end{tabular}
\end{table}
% \vspace{-1mm}
% Chefer2 and TIS consistently emerge as strong performers \ref{tab:AUC_results}, showcasing top-tier scores across datasets and metrics. However, the qualitative evaluation through visual explanation maps \ref{fig:ISIC_explanations} ,\ref{fig:COVID_explanations}  reveals TIS's prowess in providing lucid and insightful explanations \ref{fig:Kvasir_explanations}, effectively highlighting the model's focus areas during image retrieval. TIS, although having slightly lower AUC scores, excels in offering a deeper understanding of the model's decision-making process.   
On the other hand, Rollout, while maintaining competitive AUC values, displays a tendency towards higher deletion scores and falls short in delivering detailed visual explanations. Other methods in the comparison, such as BTT, BTH, VitCX, and TAM, exhibit inconsistency, excelling in one metric while lagging in another. For detailed results please refer to Appendix \ref{appendix:Insertion_deletion}.
\vspace{-2mm}
\begin{figure}[htbp]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:All_explanations}
  {\caption{ MIRViT\_small top-3 Retreivals and Explanation maps on all Datasets}}
  {\includegraphics[width=1\linewidth]{Figures/all_tis_chefer2_cropped.png}}
\end{figure}
\vspace{-2mm}
%TIS stands out as a consistent performer, achieving high insertion and low deletion AUC scores. This remarkable consistency implies that TIS maintains its effectiveness across a range of image retrieval scenarios. The higher insertion scores suggest that TIS rapidly increases image similarity with added pixels, while the lower deletion scores indicate its ability to minimize the impact of removed pixels on image similarity. These consistent results indicate that TIS has proven to be the most effective method for the MIRViT Transformer models in medical image retrieval tasks, highlighting the crucial features contributing to the model's decision-making process.
\vspace{-6mm}
\section{Conclusion}
In conclusion, our study presents an analysis of various architectures and parameters for transformers, along with the evaluation of explanation techniques. MIRViT small emerges as the top-performing model across varied datasets ISIC, COVID, and Kvasir. 
Our exploration of loss functions underscores the limitations of simple cross-entropy and the effectiveness of the contrastive approach. Further, our analysis of state-of-the-art eXplainable AI methods suggests Transformer Input Sampling (TIS) as being better. In future, we intend to further explore various advances for these algorithms for the medical image retrieval task. 


\bibliography{midl24_211}

\newpage
\appendix


\section{}
\label{appendix:Insertion_deletion}
% As mentioned earlier to quantitatively evaluate the interpretability of the explanation maps, we employ the causal metrics of insertion and deletion, as illustrated in Figure \ref{fig:Ins_Del}. 

The Figure \ref{fig:Ins_Del} visually demonstrates the intermediate steps of the Insertion-Deletion metrics using the TIS explainability method on the MIRViT\_small model. The deletion process is performed on a Kvasir image, while the insertion process is applied to a COVID chest X-ray image. Both processes effectively highlight the evolving significance of pixels in the explanation sequences.
\begin{figure}[htbp]
  \centering
  % Caption and label go in the first argument and the figure contents
  % go in the second argument
  \floatconts
    {fig:Ins_Del}
    {\caption{TIS Insertion \& Deletion Steps}}
    {\includegraphics[width=1\linewidth]{Figures/Ins_del_v2.png}}
    % {\includegraphics[width=1\linewidth, trim=1cm 5cm 1.5cm 4.5cm]{Kvasir_tis_chefer2_v4.png}}
\end{figure}
The value \( \textbf{p} \) represents the distance between the query image and the intermediate perturbed versions of the retrieved image. As anticipated, the \( \textbf{p} \) value decreases during deletion as pixels are gradually removed and increases during insertion with the progressive addition of pixels. The area under the curve for these graphs (AUC) serves as the quantitative measure of the effectiveness of the saliency maps.
% The value \textbf{p} signifies the distance between the query image and the intermediate perturbed versions of the retrieved image. As expected the \textbf{p} value decreases during deletion as pixels are systematically removed and increases during insertion with the progressive addition of pixels. The area under the curve (AUC) of the graph is used as the quantitative measure of the 'goodness' of the saliency maps. 
% Higher AUC values indicate superior performance for insertion, whereas lower AUC values are preferable for deletion. This metric serves as a robust evaluation tool for assessing the efficacy of the explanation maps. 

In Figure \ref{fig:ISIC_right_wrong_explanations_all_methods}, we present top-3 retrievals using various explainability techniques for the ISIC dataset. Subfigure a) illustrates a query image with accurate retrievals (indicated by a green border). Saliency maps distinctly concentrate on lesion regions, providing meaningful insights. On the other hand, Subfigure b) showcases two incorrect retrievals (marked with a red border), where the saliency maps tend to focus on areas around the lesion rather than precisely on the lesion itself. In the first incorrect retrieval, TIS exhibits a focus on some scale at the bottom of the image, while in the subsequent incorrect retrieval, it predominantly concentrates around the lesion, focusing on non-lesion regions.

Additionally, Figure \ref{fig:COVID_Kvasir_explanations_all_methods} displays an example query image alongside its top-3 retrievals for the COVID and Kvasir datasets, where TIS explanations are visually better than other methods. Figure \ref{fig:Kvasir_explanations_8_classes} further illustrates TIS explanation maps for the top-3 retrievals, each representing one example image per class, demonstrating the effectiveness of saliency maps in highlighting relevant regions.

\begin{figure}[htbp]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:ISIC_right_wrong_explanations_all_methods}
  {\caption{Top-3 Retrieval Explanations for Two Images from ISIC Dataset Using Different Explanation Methods.}}
  % \hspace{-3.5cm}
  {\includegraphics[width=0.9\linewidth]{Figures/isic_wrong_right_retrievals_all_methods_v1_crop.png}}
\end{figure}

\begin{figure}[htbp]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:COVID_Kvasir_explanations_all_methods}
  {\caption{Top-3 Retrieval Explanations for One Example Image from COVID and Kvasir Using Different Explanation Methods.}}
  % \hspace{-3.5cm}
  {\includegraphics[width=1.1\linewidth]{Figures/Covid_Kvasir_all_method_retrievals_v2.png}}
\end{figure}
\begin{figure}[htbp]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:Kvasir_explanations_8_classes}
  {\caption{Top-3 Retrieval Explanations for an Example Image from Each of the 8 Classes in Kvasir using TIS.}}
  % \hspace{-3.5cm}
  {\includegraphics[width=1.4\linewidth]{Figures/kvasir_8_classes_tis_v2.png}}
\end{figure}
\newpage


\begin{table}[h]
\centering
\caption{{Model and the Number of Learnable Parameters}}
% \caption{Model and the Number of Learnable Parameters}
\vspace{3mm}
\label{tab:model_params}
{
\begin{tabular}{|l|c|}
\hline
\textbf{Model}                    & \textbf{\# Parameters} \\ \hline
MedViT\_small (CNN-Transformer Hybrid)  & 31.14M                  \\ \hline
MedViT\_base (CNN-Transformer Hybrid)   & 44.41M                  \\ \hline
MedViT\_large (CNN-Transformer Hybrid)  & 57.68M                  \\ \hline
ResNet50                                 & 23.51M                  \\ \hline
DenseNet121                              & 6.95M                   \\ \hline
MIRViT\_small                              & 21.67M                  \\ \hline
MIRViT\_base                              & 85.80M                  \\ \hline
MIRDeiT\_small                            & 21.67M                  \\ \hline
\end{tabular}
}
\end{table}

As can be observed from the table \ref{tab:model_params}, the MIRViT\_small and MIRDeiT\_small have fewer parameters than ResNet50 and have improved performance. DenseNet model is more compact, but in general, the parameter settings for other small and base models are comparable.
% document body
% \end{sloppypar}
\end{document}
