MIST: Multiple Stochastic Prompt Tuning for Few-shot Adaptation under Extreme Domain Shift

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Few-shot adaptation, representation learning, Vision-language Models
Abstract: Foundation Vision-Language Models (VLMs) like CLIP generalize well due to large-scale pretraining, but their performance degrades under significant distribution shifts in appearance and label semantics. Few-shot adaptation via adapter or prompt tuning addresses limited-data tasks, but are not specifically designed to handle such extreme domain shifts. Some cross-domain few-shot methods consider such domain-shifts but often use episodic settings with fixed classes, limiting real-world applicability. To address this gap, we propose a novel framework MIST (Multiple Stochastic Prompt Tuning), which adapts CLIP to extreme domain shifts with few labeled examples across all classes simultaneously. MIST uses multiple learnable prompts per class to capture diverse modes in visual features, modeled as Gaussian distributions to improve generalization and reduce overfitting. Extensive experiments show the effectiveness of the proposed framework.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 10893
Loading