\section{Introduction} \label{section:introduction}

Medical imaging is central to clinical diagnosis and treatment across a wide range of diseases and healthcare environments, leading to remarkable heterogeneity in real-world imaging data. For example, chest radiographs acquired across community clinics and tertiary-care hospitals often exhibit substantial differences in disease appearance, image quality, and acquisition protocols~\cite{irvin2019chexpert,johnson2019mimic}. Dermatology images captured using consumer-grade smartphones show large variations in lighting, color balance, and zoom level~\cite{tschandl2018ham10000}, while MRI scans vary widely due to changes in echo time, slice thickness, and scanner field strength~\cite{zhou2019models}. Despite this variability, most deep learning models are optimized for narrow, predefined tasks, such as classifying thoracic diseases in CheXpert~\cite{irvin2019chexpert} or detecting retinal fluid in OCT~\cite{kermany2018identifying}. When deployed in new settings involving previously unseen diseases (e.g., COVID-19 pneumonia) or changes in imaging acquisition protocol, model accuracy often degrades sharply due to out-of-distribution shifts~\cite{cohen2020covid}. Maintaining reliable performance across such shifts typically requires costly retraining and large new annotated datasets, which are rarely available in emerging or resource-constrained clinical environments.


Recently, self-supervised learning has enabled the training of large-scale foundation models (FMs) by deriving training labels directly from the data itself \cite{eslami2023pubmedclip,oquab2023dinov2,zhai2023sigmoid}. These models can therefore leverage data from numerous sources and diverse diseases to learn generalizable representations applicable across a range of downstream tasks. After pre-training, such downstream prediction tasks (e.g., classification) can be performed through zero-shot inference in vision-language FMs, or by training simple classifiers on top of frozen or fine-tuned vision FMs (such as Tip-Adapter~\cite{zhang2022tip}, Proto-Adapter~\cite{kato2024proto}, and LoRA~\cite{hu2022lora}). While these strategies have shown success in numerous applications \cite{clip, oquab2023dinov2, He2021MaskedAA}, healthcare settings pose greater challenges, where labeled data are often extremely limited and test distributions differ greatly from those seen during training. In these settings, common FM adaptation methods may overfit, fail to generalize, or underperform \cite{robustnessvlm,kumar2022fine}. As a result, there remains a need for adaptation strategies that are effective in low-data, out-of-distribution medical applications.

A promising yet underexplored alternative is in-context learning (ICL), which enables models to infer a new task directly from a small set of example inputs and outputs. While ICL has transformed natural language processing, its adoption in medical imaging has been narrow, focusing almost exclusively on segmentation. Existing visual ICL methods, including SegGPT~\cite{wang2023seggpt}, and Iris~\cite{gao2025show}, demonstrate strong adaptability to unseen anatomical structures through reference image–mask pairs. However, these approaches face key limitations: (i) They are restricted to segmentation tasks, leaving classification, detection, and diagnostic reasoning largely unexplored; (ii) Many rely on computationally heavy architectures or inefficient inference (e.g., repeated reference encoding), and limiting scalability; (iii) They do not explicitly address the broader issue of clinical heterogeneity, where tasks vary not only by anatomy but also by disease definitions, scanner protocols, and institutional differences.

To overcome these limitations, we introduce Prototypical In-Context Knowledge Adaptation for Clinical Heterogeneous Usage (\ourmethod), a lightweight in-context learning framework designed specifically domain and classification task adaptation via in context learning on a very small number of reference image-label pairs. 

\ourmethod\ enables foundation models to adapt to entirely new classification tasks using only a few reference image-label pairs (i.e., \textit{support sets}), without fine-tuning, retraining, or relying on large annotated datasets. This framework uses a perceptual task encoding module that distills disease- and domain-specific cues from the support examples into compact, reusable task embeddings. The embeddings are then used to condition an inference module that performs classification in a single forward pass. Since the encoder parameters are never updated, \ourmethod\ avoids catastrophic forgetting and overfitting, trains efficiently, and remains robust across varied imaging modalities and distributions. Our contributions are summarized as follows:
\begin{enumerate}[i.]
\item We propose a universal in-context learning framework by introducing a task encoding mechanism that captures disease- and domain-specific signals from only a few reference samples during inference, without requiring parameter updates. 
\item We perform comprehensive evaluations across heterogeneous datasets, tasks, and foundation models (including both natural and medical imaging models, as well as vision-language FMs and vision-only FMs), demonstrating improved generalization and performance to novel disease categories and distribution shifts.
\item We conduct ablation studies to determine the impact of the support set size (i.e., number of labelled samples per class) on classification performance.
\end{enumerate}

PIKACHU represents a step toward truly adaptable clinical AI, one that learns new imaging tasks instantly from examples, mirrors the flexibility of clinicians, and operates robustly in heterogeneous real-world environments.