PGMPL: Prototype-Guided Multi-modal Prompt Learning for Vision-Language Models

17 Sept 2025 (modified: 29 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Prompt Learning, Vision-Language Models, Transfer Learning
TL;DR: Prototype-Guided Multi-modal Prompt Learning for Vision-Language Models
Abstract: Vision-language models (VLMs) have been widely applied to various visual tasks due to their strong zero-shot transfer capabilities. However, their performance on downstream tasks often remains suboptimal. While fine-tuning can improve accuracy on base classes, it often compromises generalization to novel classes. To address this challenge, we propose the Prototype-Guided Multi-modal Prompt Learning (PGMPL), which guides representation learning through a supervisory signal with intra-class summary information. Specifically, we construct a category-level prototype for each class by aggregating multi-image features with textual semantics. This prototype serves as a cross-modal, summarizing supervisory signal, strengthening image-text alignment and enhancing the generalization of the learned representations. To further optimize prototype and its guidance of representation learning, we refine multi-modal representations via prompt learning and introduce bidirectional cross-attention to alleviate the image-text matching inconsistency induced by newly inserted prompts. Extensive experiments demonstrate the effectiveness of PGMPL, which achieves a higher overall harmonic mean than state-of-the-art methods in zero-shot tasks across 11 datasets. Our code is available at https://anonymous.4open.science/r/PGMPL.
Supplementary Material: zip
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 8732
Loading