Distilling Prompts at Test-Time for Multimodal Few-Shot Learning

Published: 10 Jun 2025, Last Modified: 11 Jul 2025PUT at ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Multimodal Models, Knowledge Distillation, Meta Learning, Prompt Tuning
TL;DR: A meta-learning method using soft prompts and an attention-mapper boosts few-shot performance in Large Multimodal Models at test-time.
Abstract: In-Context Learning (ICL) has been a well-established paradigm to adapt Large Multimodal Models (LMMs) to novel tasks with minimal supervision. However, the ICL performance of LMMs improves inconsistently with increasing examples due to additional information present in image embeddings, which is irrelevant to the downstream task. To address this, we introduce a meta-learning strategy that distills task-relevant image features into a fixed set of soft prompts, which can be fine-tuned with just a few examples at test time. Further, to facilitate this distillation, we propose an attention-mapper module, integrated in the LLaVA v1.5 architecture, and trained alongside the soft prompts to enable rapid adaptation under low-data conditions. We show that on the VL-ICL Benchmark, our method outperforms ICL and other prompt distillation approaches and boosts the few-shot visual question-answering performance of LMMs.
Submission Number: 38
Loading