\section{Methodology}
\label{section:methodology}
\ourmethod\ implements an in-context learning (ICL) pipeline designed to adapt VLFM to new classification tasks and domains using only 
a handful of labeled support examples. The method operates entirely in a feature-space defined by a frozen pretrained image encoder and requires learning only a single temperature parameter. This preserves the generalization capabilities learned from large-scale multimodal biomedical pre-training and makes it suitable for rapid deployment across diverse clinical workflows. Our methodology combines: (i) a frozen imaging encoder or foundation model; (ii) a prototypical in-context inference mechanism; and (iii) a lightweight learnable temperature parameter enabling task-specific calibration.

Unlike prior medical ICL work restricted to segmentation or heavy cross-attention adapters, \ourmethod\ directly performs few-shot classification using the representational geometry of a pretrained foundation model. Figure~\ref{fig:intro} summarizes our approach.

\subsection{Problem Formulation}
\label{subsec:problem}

Given a query image $x_q$ and a small support set $\mathcal{S}$ containing $K$ labeled examples from each of $C$ classes, \ourmethod\ classifies $x_q$ using three sequential steps:
\begin{enumerate}[i.]
    \item Extracting normalized image embeddings using a frozen foundation model vision encoder (Section \ref{sec:support_query_batch});
    \item Constructing class prototypes by averaging the embeddings of support examples (i.e., for each class, we take the average of the image embeddings across all samples within that class, Section \ref{sec:prototype_computation});
    \item Performing temperature-scaled cosine similarity between the query embedding and each prototype to obtain a final prediction (Section \ref{sec:similarity_inference}). Because only the temperature parameter is learned during training, the method maintains strong calibration and generalizes across diverse tasks without modifying the pretrained backbone.
\end{enumerate}

\begin{figure}[t]
    \centering
    \includegraphics[width=\linewidth]{images/method.png}
    \caption{Overview of the \ourmethod\ framework. The method operates in three stages: (1) Feature Extraction: Query images from the test dataset and support set samples (e.g. Class A: NV, Class B: MEL) are encoded using a frozen pretrained foundation model (e.g. PubMedCLIP encoder). The encoder is pretrained on diverse medical images with distributions different from the test data. (2) Prototype Computation: L-2 normalized feature embeddings from the support set are aggregated to construct class prototypes $p_c$ through mean pooling of class-specific features. (3) Similarity and Prediction: Query features $f_q$ are compared against class prototypes using temperature-scaled cosine similarity, producing prediction probabilities via softmax computations. The final prediction $\hat{y}_q$ is determined by selecting the class with maximum posterior probability.}
    \label{fig:intro}
\end{figure}
\subsection{Support-Query Batch Construction and Encoding}
\label{sec:support_query_batch}

\ourmethod\ performs few-shot classification entirely through in-context prototypical reasoning. Each task consists of a \emph{query image} and a \emph{support image}. 
For a $C$-way, $K$-shot classification task, the support set is defined as
\[
    \mathcal{S} = 
    \{(x_i, y_i)\}_{i=1}^{C \times K},
\]
containing $K$ labeled examples from each of the $C$ classes. Support sets are constructed by grouping images by class and sampling $K$ labeled examples per class. Both support and query images are then input to a frozen vision encoder to produce high-dimensional embeddings:
$\mathbf{f}(x) = \text{Encode}(x)$, which is subsequently L2-normalized to ensure stable geometric comparisons:
\begin{equation}
    \mathbf{f}(x) \leftarrow 
    \frac{\mathbf{f}(x)}{\|\mathbf{f}(x)\|_2}.
\end{equation}

These embeddings serve as the basis for prototype computation and inference.

For each task, the support set is constructed by randomly sampling K examples from each class. For a given support set size K, the sampled support set is fixed and used consistently across all corresponding evaluations, ensuring a controlled comparison while maintaining random class-balanced selection. The default value of K is 5.
\subsection{Prototype Computation}
\label{sec:prototype_computation}

For each class $c \in \{1, \dots, C\}$, \ourmethod\ computes a class prototype by 
averaging the embeddings of the corresponding support images (note that using a different aggregation strategy in such a low data regime of 1-10 examples might lead to instability):
\begin{equation}
    \mathbf{p}_c = 
    \frac{1}{K} 
    \sum_{(x_i, y_i = c)} 
    \mathbf{f}(x_i)
\end{equation}
    

To stabilize similarity computations, each prototype is further normalized:
\begin{equation}
    \mathbf{p}_c 
    \leftarrow 
    \frac{\mathbf{p}_c}{\|\mathbf{p}_c\|_2}
\end{equation}
If a class is absent in a particular episode (rare in balanced sampling), a 
zero vector is used by design. The resulting prototype matrix 
$\mathbf{P} = [\mathbf{p}_1, \dots, \mathbf{p}_C]^\top$ forms the 
in-context representation against which query embeddings are compared.

\subsection{Cosine-Similarity Inference with Temperature Scaling}
\label{sec:similarity_inference}

Given a query embedding $\mathbf{f}(x_q)$ and the set of class prototypes 
$\{\mathbf{p}_c\}$, PIKACHU performs classification via temperature-scaled 
cosine similarity:
\begin{equation}
    \text{logits}_{q,c} 
    = \exp(\tau) \cdot 
    \mathbf{f}(x_q)^\top \mathbf{p}_c
\end{equation}
    

\noindent where $\tau$ is a learnable temperature parameter (optimized during training) 
that sharpens or smooths the similarity distribution. Since both embeddings and 
prototypes lie on the unit hypersphere, the dot product is equivalent to cosine 
similarity. 

The final class probabilities are obtained by applying a softmax:
\begin{equation}
    p(y_q = c \mid x_q, \mathcal{S})
    = \frac{
        \exp(\exp(\tau) \cdot 
        \mathbf{f}(x_q)^\top \mathbf{p}_c)
    }{
        \sum_{c'=1}^{C} 
        \exp(\exp(\tau) \cdot 
        \mathbf{f}(x_q)^\top \mathbf{p}_{c'})
    }
\end{equation}
Crucially, the temperature $\tau$ is the \emph{only} trainable parameter in the entire framework. This ensures that PIKACHU maintains the in-context property of 
few-shot learning while requiring minimal computation and avoiding any task-specific fine-tuning of the foundation model. Algorithm~\ref{alg:pikachu} (Appendix~\ref{appendix:pikachu_algorithm}) shows the detailed steps for our implementation.





