\section{Additional Implementation Details}
\label{append:implementation}
\subsection{Training using DINOv2, ViT}
\paragraph{DINOv2} In this DINOv2-based setup the backbone is kept fully frozen and classification is performed via a prototypical, similarity-based head. Images are first encoded using a frozen DINOv2-Large model, and the resulting CLS-token features are L2-normalized. Class prototypes are then computed on-the-fly as the mean of support features for each class, followed by normalization. During training, gradients flow exclusively to the temperature parameter, while the backbone and prototype construction remain fixed. This design enables stable and efficient adaptation by calibrating decision boundaries without learning a conventional linear layer or modifying the pretrained DINOv2 representations.
\paragraph{ViT} Training in this setup follows an episodic in-context learning (ICL) regime that mirrors few-shot inference rather than conventional supervised fine-tuning. For each training batch, episodes are constructed consisting of a support set and corresponding query samples. Query and support images are first passed through a frozen ViT-Base backbone to extract L2-normalized CLS-token features. Class prototypes are then computed on-the-fly as the mean of the support features for each class within the episode. Classification logits for query samples are obtained via cosine similarity to these prototypes, scaled by a learnable temperature. During backpropagation, only the temperature parameter is updated using cross-entropy loss on the query labels, while the backbone and prototype computation remain fixed. This episodic training calibrates the similarity scale across tasks without altering the representation space, ensuring stable optimization and consistency between training and inference in the few-shot setting 
