\section{Experiments and Results }
\label{section:experiments}


\subsection{Dataset and Implementation Details} 
\subsubsection{Datasets}
We evaluate our approach on three diverse medical imaging datasets spanning different anatomical regions and diagnostic tasks to assess the generalizability of ICL across various clinical scenarios.\\
\underline{\textit{Skin Dataset (ISIC)}}~\cite{tschandl2018ham10000,codella2018skin,combalia2019bcn20000}: This dataset comprises dermatological images for skin lesion classification, including multiple categories of benign and malignant conditions such as melanoma, basal cell carcinoma, actinic keratosis, and benign nevi. The dataset presents challenges due to significant intra-class variability in lesion appearance, color, texture, and morphology across different skin types and imaging conditions. We perform binary classification between melanoma (MEL) and melanocytic nevi (NV) with 4522 samples in each test class.\\
\underline{\textit{OCT (Optical Coherence Tomography)}}~\cite{kermany2018identifying}: The OCT dataset contains cross-sectional retinal scans used for diagnosing various retinal pathologies. Images include normal retinal tissue as well as pathological conditions such as diabetic macular edema (DME), choroidal neovascularization (CNV), and drusen. OCT imaging provides high-resolution visualization of retinal layers, making it valuable for early detection and monitoring of retinal diseases. We perform binary classification between CNV and Normal scans with 1000 samples per class.\\
\underline{\textit{DR (Diabetic Retinopathy)}}~\cite{eyepacs2025messidor}: This dataset consists of fundus photographs for diabetic retinopathy screening and grading. Images are categorized based on DR severity levels ranging from no DR to Proliferative Diabetic Retinopathy (PDR), with intermediate stages including mild, moderate, and severe non-proliferative DR. The task requires identifying subtle vascular abnormalities, microaneurysms, hemorrhages, and neovascularization. We perform binary classification between no DR and Proliferative Diabetic Retinopathy (PDR) with 1466 samples from each test class.

For all three datasets, we formulate the evaluation as binary classification tasks to facilitate consistent comparison across different medical imaging modalities. Importantly, the test sets are carefully balanced to ensure equal representation of positive and negative classes, eliminating class imbalance as a confounding factor in performance evaluation and providing a fair assessment of model discrimination capabilities.
\subsubsection{Implementation Details}

To contextualize our implementation, we briefly outline the foundation models used in our experiments. Evaluating PIKACHU across these backbones allows us to assess the generality of our in-context learning strategy under differing pretraining objectives, data sources, and representational characteristics:
\begin{enumerate}[i.]
    \item \textit{PubMedCLIP}~\cite{eslami2023pubmedclip} is a vision-language model pretrained on large-scale biomedical image–text pairs, providing domain-specific representations well suited for clinical imagery; its medical grounding makes it a strong baseline for specialized tasks.
    \item \textit{SigLIP (Sigmoid Loss for Language-Image Pre-Training)}~\cite{zhai2023sigmoid} is a contrastive vision–language model trained on massive web-scale datasets using a sigmoid loss formulation, enabling robust and semantically aligned visual features that generalize effectively beyond its training distribution.
    \item \textit{DINOv2}~\cite{oquab2023dinov2} is a self-supervised vision model learned through knowledge distillation without labels, trained on a huge dataset of web-curated images and producing highly transferable visual embeddings that excel across diverse downstream tasks.
    \item \textit{Vision Transformer (ViT)}~\cite{wu2020visual} represents a purely vision-based architecture trained on generic large-scale image datasets, offering a neutral baseline with no modality-specific inductive biases. 
\end{enumerate}


All experiments are implemented in PyTorch, with pretrained foundation model backbones loaded through their respective public model hubs. Model weights are downloaded once and cached locally for reproducibility and efficient re-use. Each backbone is kept fully frozen throughout training and inference to preserve its pretrained representations and ensure a consistent evaluation of in-context adaptation. All input images are converted to PIL format and preprocessed using the normalization pipeline associated with each model before feature extraction.


The only trainable parameter is the logarithm of the temperature $\log T$, optimized with Adam using a learning rate of $1 \times 10^{-4}$. Training is performed for $5$-$10$ epochs depending on the dataset, with each 
epoch consisting of thousands of randomly sampled few-shot episodes. Evaluation uses the identical episodic structure but with no gradient updates. Metrics such as accuracy, AUROC, F$_1$-score, and confusion matrices are computed to assess performance.

All experiments are conducted on a single NVIDIA H100 GPU, although CPU inference is feasible due to the minimal number of trainable parameters. 

\subsection{Experiments}
Our unified experimental design isolates the contribution of in-context learning itself, allowing us to compare improvement margins across architectures with differing representational strengths. For each of the four foundation models above, the baseline setting corresponds to the standard zero-shot or fixed-feature configuration, where the encoder remains frozen and predictions are produced without any access to support examples. This involves computing cosine similarity between query embeddings and predefined classifier weights (see Appendix~\ref{append:implementation}). In the baseline settings, no task-specific adaptation or additional fine-tuning is performed. 


\subsection{Results \& Ablations}
\paragraph{Baseline v/s ICL strategy} Table~\ref{table:basevsicl} demonstrates that ICL consistently and substantially outperforms baseline approaches across all vision models and medical imaging datasets. The improvement is particularly pronounced in the OCT dataset, where models like SigLip achieve a remarkable jump from 0.50 to 0.83 accuracy. Similarly, PubMedCLIP shows significant gains in OCT performance (0.50 to 0.72 accuracy) and DR classification (0.40 to 0.79 accuracy), highlighting ICL's effectiveness in medical image analysis tasks. Interestingly, domain-specific models like PubMedCLIP and general vision models like DinoV2 and ViT all benefit substantially from ICL. The skin data classification shows more modest but consistent improvements (0.19-0.24 accuracy gain), while OCT and DR datasets exhibit the most dramatic performance enhancements. These findings suggest that ICL is particularly effective for complex medical imaging tasks where subtle pathological features need to be distinguished, demonstrating its potential as a powerful strategy for improving diagnostic accuracy without requiring extensive model retraining or fine-tuning.

\begin{table}[h]
\centering
\begin{tabular}{cllll}
\cline{3-5}
\multicolumn{1}{l}{} &  & \multicolumn{3}{c}{\textbf{Dataset}} \\ \cline{3-5} 
\textbf{Model} & \textbf{Strategy} & \multicolumn{1}{l|}{\textbf{ISIC}} & \multicolumn{1}{l|}{\textbf{OCT}} & \textbf{DR} \\ \hline
\multirow{2}{*}{SigLIP} & Baseline & \multicolumn{1}{l|}{0.49} & \multicolumn{1}{l|}{0.50} & 0.50 \\
 & ICL & \multicolumn{1}{l|}{0.73} & \multicolumn{1}{l|}{0.83} & 0.77 \\ \hline
\multirow{2}{*}{PubMedCLIP} & Baseline & \multicolumn{1}{l|}{0.50} & \multicolumn{1}{l|}{0.50} & 0.40 \\
 & ICL & \multicolumn{1}{l|}{0.69} & \multicolumn{1}{l|}{0.72} & 0.79 \\ \hline
\multirow{2}{*}{DinoV2} & Baseline & \multicolumn{1}{l|}{0.50} & \multicolumn{1}{l|}{0.51} & 0.76 \\
 & ICL & \multicolumn{1}{l|}{0.74} & \multicolumn{1}{l|}{0.82} & 0.80 \\ \hline
\multirow{2}{*}{ViT} & Baseline & \multicolumn{1}{l|}{0.50} & \multicolumn{1}{l|}{0.52} & 0.39 \\
 & ICL & \multicolumn{1}{l|}{0.69} & \multicolumn{1}{l|}{0.83} & 0.81 \\ \hline
\end{tabular}
\caption{Performance comparison of different vision models (SigLip, PubMedCLIP, DinoV2, and ViT) using baseline(zero-shot) and ICL (In-Context Learning) strategies across three medical imaging datasets: ISIC, OCT (Optical Coherence Tomography), and DR (Diabetic Retinopathy).}
\label{table:basevsicl}
\end{table}


Figure~\ref{fig:confusion_matrix} shows the confusion matrices illustrating  the binary classification performance for all four FMs on each of the medical imaging datasets, with and without the proposed ICL method. Note the improvement in performance in all cases, as depicted visually by the confusion matrices moving from off diagonal (baseline) to diagonal (\ourmethod).

\paragraph{Effect of Support Queries}
To investigate the impact of support set size on ICL performance, we conducted experiments varying the number of support samples per class across 1, 5, and 10 examples. The results reveal a clear positive correlation between the number of support queries and model performance across all three medical imaging datasets. With a single support sample per class, models achieved modest improvements over baseline performance, demonstrating that even minimal context can enhance classification accuracy. Increasing the support set to 5 samples per class yielded substantial performance gains, particularly evident in the OCT and DR datasets, where the additional examples provided richer context for distinguishing between subtle pathological variations. At 10 support samples per class, we observed further improvements, though with diminishing returns compared to the jump from 1 to 5 samples, suggesting a saturation effect where additional examples provide incrementally less new information. 

\begin{figure}[h]
    \centering
    \includegraphics[width=\linewidth]{images/cfm.png}
    \caption{Confusion matrices comparing the binary classification performance of the four FMs on three medical imaging tasks on different datasets (ISIC, OCT, DR). For each dataset pair, the first confusion matrix (left) represents the baseline performance, while the second (right) shows results using the ICL method (\ourmethod) with K=5. Notice that for each confusion matrix pair, we observe a clear redistribution of predictions from off-diagonal elements in the baseline to diagonal elements with the proposed ICL method, indicating a substantial reduction in misclassification rates and improved class-wise discrimination.}
    \label{fig:confusion_matrix}
\end{figure}


\begin{figure}[t]
    \centering

    \begin{minipage}{0.32\textwidth}
        \centering
        \includegraphics[width=\linewidth]{images/isic.png}
        \caption*{(a) Skin data}
    \end{minipage}
    \hspace{-4mm}
    \begin{minipage}{0.32\textwidth}
        \centering
        \includegraphics[width=\linewidth]{images/oct.png}
        \caption*{(b) OCT}
    \end{minipage}
    \hspace{-4mm}
    \begin{minipage}{0.32\textwidth}
        \centering
        \includegraphics[width=\linewidth]{images/dr.png}
        \caption*{(c) DR}
    \end{minipage}
    \caption{Impact of support set size (K) on ICL performance. Accuracy increases with more support samples per class (K = 1, 5, 10) across all datasets and models, with steeper improvements in OCT and DR compared to Skin data. Performance gains are most pronounced between K=1 and K=5.}.
    \label{fig:effect_K}
\end{figure}


\subsection{Limitations}
While \ourmethod\ provides a simple and effective mechanism for few-shot adaptation in medical image classification, several limitations warrant consideration. First, the framework relies on the representational quality of the underlying frozen backbone. If the pretrained features fail to capture clinically meaningful disease cues, prototype construction may be insufficient to recover task-specific distinctions. Second, the method assumes that a small number of support examples are representative of each class, which may not hold in settings with extreme intra-class variability or subtle disease patterns. Additionally, our current formulation treats each support example independently and does not account for label noise, class imbalance, or image acquisition artifacts commonly encountered in clinical datasets. Prototype averaging may also oversimplify complex class manifolds, particularly for multi-modal or heterogeneous disease categories, where more expressive task embeddings could yield further gains. Finally, our experiments focus on episodic evaluation under controlled few-shot settings. Additional work is needed to assess robustness under real-world deployment scenarios, such as varying support set quality, shifting patient populations, or incomplete class coverage. Despite its robustness, these limitations outline opportunities for improving the adaptability and reliability of in-context learning in medical imaging.
