Keywords: Few-shot adaptation, representation learning, Vision-language Models
Abstract: Foundation Vision-Language Models (VLMs) like CLIP generalize well due to
large-scale pretraining, but their performance degrades under significant distribution
shifts in appearance and label semantics. Few-shot adaptation via adapter or
prompt tuning addresses limited-data tasks, but are not specifically designed to
handle such extreme domain shifts. Some cross-domain few-shot methods consider
such domain-shifts but often use episodic settings with fixed classes, limiting
real-world applicability. To address this gap, we propose a novel framework MIST
(Multiple Stochastic Prompt Tuning), which adapts CLIP to extreme domain shifts
with few labeled examples across all classes simultaneously. MIST uses multiple
learnable prompts per class to capture diverse modes in visual features, modeled
as Gaussian distributions to improve generalization and reduce overfitting. Extensive
experiments show the effectiveness of the proposed framework.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 10893
Loading