
\section{Ablations}
\subsection{Comparisons to other few-shot classification methods}

While PIKACHU operates through in-context prototypical reasoning with a single learnable temperature parameter, we contextualize our approach by comparing against three representative methods: Tip-Adapter~\cite{zhang2022tip}, Proto-Adapter~\cite{kato2024proto}, and LoRA~\cite{hu2022lora}.

\paragraph{Tip-Adapter:}Tip-Adapter~\cite{zhang2022tip} augments frozen CLIP-style models with a cache of support embeddings and performs prediction via weighted retrieval, optionally combining cache-based scores with zero-shot logits. The training-free variant requires no optimization, while the trainable variant learns cache keys with $N \times K \times D$ parameters. However, inference requires computing similarities against all cached support samples, leading to $\mathcal{O}(N \times K)$ complexity and linear memory growth with the support set size. In contrast, PIKACHU compresses all support information into $C$ class prototypes, enabling $\mathcal{O}(C)$ inference and a substantially lower memory footprint.%, resulting in a $\sim$5$\times$ inference speedup for 5-shot binary classification.

\paragraph{Proto-Adapter:} Proto-Adapter~\cite{kato2024proto} introduces lightweight bottleneck modules that refine features through $\text{Adapter}(\mathbf{f}) = \mathbf{W}_{\text{up}} \cdot \text{GELU}(\mathbf{W}_{\text{down}} \cdot \mathbf{f})$ with dimensionality reduction ($D \rightarrow d \rightarrow D$, typically $d=64$). This yields $2 \times D \times d + 1$ trainable parameters, enabling task-specific feature space transformations. While this expressiveness may benefit tasks where pretrained features are misaligned, it introduces more parameters than PIKACHU and risks overfitting in extreme low-data regimes ($K \leq 5$). PIKACHU's design philosophy instead trusts the discriminative power of modern medical foundation models (e.g., PubMedCLIP~\cite{eslami2023pubmedclip}), requiring no feature manipulation.

\paragraph{LoRA:} LoRA~\cite{hu2022lora}, originally developed for large language models, adapts frozen weights through low-rank decompositions: $\mathbf{W}' = \mathbf{W} + \frac{\alpha}{r} \mathbf{B} \mathbf{A}$ where $\mathbf{A} \in \mathbb{R}^{r \times D}$, $\mathbf{B} \in \mathbb{R}^{D \times r}$, and $r \ll D$. With rank $r \in \{8, 16\}$, LoRA introduces $2 \times r \times D + 1 \approx 12$--25K parameters. While LoRA demonstrates strong empirical success in NLP and offers a compelling balance between simplicity and expressiveness, it still requires two orders of magnitude more parameters than PIKACHU and necessitates careful rank selection. Additionally, LoRA's injection of low-rank updates into weight matrices may inadvertently disrupt pretrained medical knowledge when support sets are limited.

\begin{table}[h]
\centering
\begin{tabular}{c|ccc|c}
\cline{2-4}
 & \multicolumn{3}{c|}{\textbf{Datasets}} &  \\ \cline{2-5} 
\textbf{Few-Shot Method} & \multicolumn{1}{c|}{\textbf{ISIC}} & \multicolumn{1}{c|}{\textbf{OCT}} & \textbf{DR} & \textbf{Trainable Params} \\ \hline
Tip-Adapter & \multicolumn{1}{c|}{0.71} & \multicolumn{1}{c|}{0.78} & 0.74 & 1 \\
Proto-Adapter & \multicolumn{1}{c|}{\textbf{0.73}} & \multicolumn{1}{c|}{0.82} & 0.76 & 99,137 \\
Lora (rank=8) & \multicolumn{1}{c|}{0.72} & \multicolumn{1}{c|}{\textbf{0.83}} & 0.76 & 12,289 \\
Lora (rank=16) & \multicolumn{1}{c|}{0.73} & \multicolumn{1}{c|}{0.82} & \textbf{0.77} & 24,577 \\
PIKACHU & \multicolumn{1}{c|}{\textbf{0.73}} & \multicolumn{1}{c|}{\textbf{0.83}} & \textbf{0.77} & 1 \\ \hline
\end{tabular}
\caption{Few-shot comparison showing that PIKACHU attains competitive performance relative to Tip-Adapter, Proto-Adapter, and LoRA, despite introducing only a single trainable parameter. Note SigLIP is the backbone foundation model here.}
\label{tab:peft_comparisons}
\end{table}

Table~\ref{tab:peft_comparisons} summarizes the architectural and computational trade-offs. PIKACHU occupies an extreme point in the PEFT design space, sacrificing feature-space adaptation for parameter efficiency. We show that PIKACHU achieves performance on par with, and in some cases matching the best results of, stronger adaptation methods while being more parameter-efficient. Across ISIC, OCT, and DR, PIKACHU matches the top accuracy achieved by Proto-Adapter and LoRA variants, despite introducing only a single trainable temperature parameter. In contrast, competing methods require orders of magnitude more parameters (e.g., tens of thousands for LoRA and nearly 100k for Proto-Adapter). This highlights that much of the few-shot performance gain can be attributed to effective prototype construction and similarity calibration rather than extensive parameter adaptation, demonstrating that PIKACHU provides an excellent accuracy–efficiency trade-off.
\subsection{In-Context Learning Aggregation Strategy: K-Nearest Neighbors with Weighted Aggregation}
While the prototypical network approach (described in Section~\ref{sec:prototype_computation}) has demonstrated strong performance for few-shot medical image classification, offering an elegant and computationally efficient solution through class-level prototype representations. However, for completeness and to provide a comprehensive evaluation of in-context learning paradigms, we conduct an ablation study that explores alternative aggregation strategy, \textit{K-nearest neighbors with weighted aggregation}. This alternative preserve fine-grained support set structure rather than collapsing samples into class-level prototypes, which may be beneficial in edge cases involving complex multimodal distributions or overlapping decision boundaries. 


Unlike prototypical networks, which condense all support samples of a given class into a single representative vector, the K-nearest neighbors (KNN) approach with weighted aggregation preserves the contribution of individual support examples. This enables the model to adapt its predictions based on local similarity structure rather than relying solely on global class prototypes.

\paragraph{Feature Extraction and Similarity Computation}
Given a query image $x_q$ and support set $\mathcal{S} = \{(x_i, y_i)\}_{i=1}^{N \times K}$, we first extract and normalize features using the frozen encoder $E(\cdot)$ as in the prototypical network:
\begin{equation}
\mathbf{f}_i = \frac{E(x_i)}{\|E(x_i)\|_2}, \quad \mathbf{f}_q = \frac{E(x_q)}{\|E(x_q)\|_2}
\end{equation}

For each query feature $\mathbf{f}_q$, we compute cosine similarities to all support features:
\begin{equation}
s_i = \mathbf{f}_q^\top \mathbf{f}_i, \quad \forall (x_i, y_i) \in \mathcal{S}
\end{equation}

Since both query and support features are L2-normalized, this dot product directly corresponds to cosine similarity.

\paragraph{K-Nearest Neighbor Selection.}
We identify the $K$ support samples with the highest similarity scores:
\begin{equation}
\mathcal{N}_K(x_q) = \{(x_{i_1}, y_{i_1}), \ldots, (x_{i_K}, y_{i_K})\}
\end{equation}
where $s_{i_1} \geq s_{i_2} \geq \cdots \geq s_{i_K}$ denote the top-$K$ similarity scores among all support samples. This selection process focuses attention on the most relevant support examples for each query.

\paragraph{Distance-Weighted Aggregation.}
Rather than treating all $K$ neighbors equally, we employ distance-weighted voting where each neighbor's contribution is proportional to its similarity to the query. The weights are computed via a temperature-scaled softmax:
\begin{equation}
w_j = \frac{\exp(s_{i_j} / \tau)}{\sum_{k=1}^{K} \exp(s_{i_k} / \tau)}
\end{equation}
where $\tau$ is a learnable temperature parameter (optimized as $\log \tau$ during training).

The final class probabilities are obtained by aggregating weighted votes across the $K$ nearest neighbors:
\begin{equation}
P(y = c \mid x_q, \mathcal{S}) = \sum_{j=1}^{K} w_j \cdot \mathbb{1}[y_{i_j} = c]
\end{equation}
where $\mathbb{1}[\cdot]$ denotes the indicator function.

% RESULTS

\begin{table}[h]
\centering
\begin{tabular}{cc|ccc|}
\cline{3-5}
 &  & \multicolumn{3}{c|}{\textbf{Datasets}} \\ \cline{3-5} 
\textbf{Model} & \textbf{Strategy} & \multicolumn{1}{c|}{\textbf{ISIC}} & \multicolumn{1}{c|}{\textbf{OCT}} & \textbf{DR} \\ \hline
\multirow{3}{*}{SigLIP} & Baseline & \multicolumn{1}{c|}{0.49} & \multicolumn{1}{c|}{0.50} & 0.50 \\
 & ICL (KNN weighted) & \multicolumn{1}{c|}{0.61} & \multicolumn{1}{c|}{0.64} & 0.70 \\
 & PIKACHU & \multicolumn{1}{c|}{\textbf{0.73}} & \multicolumn{1}{c|}{\textbf{0.83}} & \textbf{0.77} \\ \hline
\end{tabular}
\caption{Performance of different aggregation strategies, prototype- and KNN-based, for in-context learning (ICL) on ISIC, OCT, and DR datasets.}
\label{table:aggregation_strategy}
\end{table}

Table~\ref{table:aggregation_strategy} compares different in-context aggregation strategies using a fixed SigLIP backbone. The baseline zero-shot performance is near chance across all datasets, while KNN-weighted in-context learning provides a clear improvement by leveraging support examples. PIKACHU further yields a substantial performance gain on ISIC, OCT, and DR, consistently outperforming both the baseline and KNN-based ICL. This demonstrates that prototype-based aggregation with similarity calibration is significantly more effective than instance-level retrieval for in-context adaptation in medical imaging tasks.
