\section{Methodology}
\label{section:methodology}
\ourmethod\ implements an end-to-end in-context learning (ICL) pipeline designed to adapt foundation models to new classification tasks and domains using only 
a handful of labeled support examples. The method operates entirely in a 
feature-space defined by a frozen pretrained encoder and requires learning only a single temperature parameter, making it computationally efficient and suitable for rapid deployment across diverse clinical workflows. Our methodology combines: (i) a frozen medical vision–language backbone; (ii) a prototypical in-context inference mechanism; and (iii) a lightweight learnable temperature parameter enabling task-specific calibration.

Unlike prior medical ICL work restricted to segmentation or heavy cross-attention adapters, \ourmethod\  directly performs few-shot classification using the representational geometry of a pretrained foundation model. Figure~\ref{fig:intro} summarizes the approach.

\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{images/method.png}
    \caption{Caption}
    \label{fig:intro}
\end{figure}


\subsection{Problem Formulation}
\label{subsec:problem}

Given a query image $x_q$ and a small support set $\mathcal{S} = \{(x_i, y_i)\}_{i=1}^{C \times K}$ containing $K$ labeled examples from each of $C$ classes, \ourmethod\ classifies $x_q$ using three sequential steps:
\begin{enumerate}[i.]
    \item Extracting normalized image embeddings using a frozen foundation model encoder;
    \item Constructing class prototypes by averaging the embeddings of support examples;
    \item Performing temperature-scaled cosine similarity between the query embedding and each prototype to obtain a final prediction. Because only the temperature parameter is learned during training, the method maintains strong calibration and generalizes across diverse tasks without modifying the pretrained backbone.
\end{enumerate}
\subsection{Frozen Medical Vision-Language Backbone}
\label{sec:frozen_backbone}

At the core of \ourmethod\ lies a pretrained medical vision-language foundation 
model, PubMedCLIP (BiomedCLIP), which provides a rich semantic feature space 
well-suited for downstream clinical tasks. To preserve the generalization 
capabilities learned from large-scale multimodal biomedical corpora, we adopt 
a strictly \emph{frozen-backbone} design: All parameters of the vision encoder 
remain fixed throughout both training and inference. This enables the framework 
to function as a true in-context learner, adapting to new tasks solely through 
the support set rather than through gradient-based fine-tuning.

Each input image $x$ (either support or query) is first converted to PIL format 
and preprocessed using the original PubMedCLIP normalization pipeline. The 
frozen encoder then maps the image to a high-dimensional embedding:
$\mathbf{f}(x) = \text{Encode}(x)$, which is subsequently L2-normalized to ensure stable geometric comparisons:
\begin{equation}
    \mathbf{f}(x) \leftarrow 
    \frac{\mathbf{f}(x)}{\|\mathbf{f}(x)\|_2}.
\end{equation}

This normalization step is critical, as it aligns embeddings onto the unit 
hypersphere, making cosine similarity equivalent to dot-product similarity and 
thereby improving prototype based reasoning. Because the encoder parameters are 
never updated, \ourmethod\ avoids catastrophic forgetting, trains efficiently, and 
remains robust across varied imaging modalities and distributions. 

By operating entirely in the frozen foundation feature space, the model 
inherits strong priors from large-scale biomedical pretraining while enabling 
rapid few-shot adaptation to heterogeneous clinical tasks without any 
task-specific retraining.

\subsection{In-Context Prototypical Reasoning}
\label{sec:icl_reasoning}

\ourmethod\ performs few-shot classification entirely through in-context prototypical reasoning. Given a query image and a small labeled support set, the model constructs class-specific prototypes in the embedding space of the frozen 
PubMedCLIP encoder and compares the query embedding against these prototypes 
using temperature-scaled cosine similarity. This procedure enables rapid task 
adaptation without any modification to the backbone parameters.

\subsubsection{Support-Query Batch Construction}
\label{sec:support_query_batch}

Each episodic task consists of a \emph{query set} and a \emph{support set}. 
For an $C$-way, $K$-shot classification task, the support set is defined as
\[
    \mathcal{S} = 
    \{(x_i, y_i)\}_{i=1}^{C \times K},
\]
containing $K$ labeled examples from each of the $C$ classes. The dataloader 
constructs these support sets on the fly by grouping images by class and 
sampling $K$ shots per class. 

Both support and query images are converted to PIL format and preprocessed 
identically using the PubMedCLIP transform pipeline. The frozen encoder then 
extracts L2-normalized embeddings:
\[
    \mathbf{f}(x) = 
    \frac{\text{Encode}(x)}{\|\text{Encode}(x)\|_2}.
\]
These embeddings serve as the basis for prototype computation and inference.

\subsubsection{Prototype Computation}
\label{sec:prototype_computation}

For each class $c \in \{1, \dots, C\}$, \ourmethod\ computes a class prototype by 
averaging the embeddings of the corresponding support images:
\begin{equation}
    \mathbf{p}_c = 
    \frac{1}{K} 
    \sum_{(x_i, y_i = c)} 
    \mathbf{f}(x_i)
\end{equation}
    

To stabilize similarity computations, each prototype is further normalized:
\begin{equation}
    \mathbf{p}_c 
    \leftarrow 
    \frac{\mathbf{p}_c}{\|\mathbf{p}_c\|_2}
\end{equation}
If a class is absent in a particular episode (rare in balanced sampling), a 
zero vector is used by design. The resulting prototype matrix 
$\mathbf{P} = [\mathbf{p}_1, \dots, \mathbf{p}_C]^\top$ forms the 
in-context representation against which query embeddings are compared.

\subsubsection{Cosine-Similarity Inference with Temperature Scaling}
\label{sec:similarity_inference}

Given a query embedding $\mathbf{f}(x_q)$ and the set of class prototypes 
$\{\mathbf{p}_c\}$, PIKACHU performs classification via temperature-scaled 
cosine similarity:
\begin{equation}
    \text{logits}_{q,c} 
    = \exp(\tau) \cdot 
    \mathbf{f}(x_q)^\top \mathbf{p}_c
\end{equation}
    

where $\tau$ is a learnable temperature parameter (optimized during training) 
that sharpens or smooths the similarity distribution. Since both embeddings and 
prototypes lie on the unit hypersphere, the dot product is equivalent to cosine 
similarity. 

The final class probabilities are obtained by applying a softmax:
\begin{equation}
    p(y_q = c \mid x_q, \mathcal{S})
    = \frac{
        \exp(\exp(\tau) \cdot 
        \mathbf{f}(x_q)^\top \mathbf{p}_c)
    }{
        \sum_{c'=1}^{C} 
        \exp(\exp(\tau) \cdot 
        \mathbf{f}(x_q)^\top \mathbf{p}_{c'})
    }
\end{equation}
Crucially, the temperature $\tau$ is the \emph{only} trainable parameter in the entire framework. This ensures that PIKACHU maintains the in-context property of 
few-shot learning while requiring minimal computation and avoiding any task-specific fine-tuning of the foundation model. Algorithm~\ref{alg:pikachu} (Appendix~\ref{appendix:pikachu_algorithm}) shows the detailed steps for our implementation.





