Abstract: Large pre-trained vision-language models like CLIP have shown amazing zero-shot recognition performance. To adapt pre-trained vision-language models to downstream tasks, recent studies have focused on the "learnable context + class name" paradigm, which learns continuous prompt contexts on downstream datasets. In practice, the learned prompt context tends to overfit the base categories and cannot generalize well to novel categories out of the training data. Recent works have also noticed this problem and have proposed several improvements. In this work, we draw a new insight based on empirical analysis, that is, uninformative class names lead to degraded base-to-novel generalization performance in prompt learning, which is usually overlooked by existing works. Under this motivation, we advocate to improve the base-to-novel generalization performance of prompt learning by enhancing the semantic richness of class names. We coin our approach as the Information Disengagement based Associative Prompt Learning (IDAPL) mechanism which considers the associative, meanwhile, decoupled learning of prompt context and class name embedding. IDAPL can effectively alleviate the phenomenon of learnable context overfitting to base classes, meanwhile, learning more informative semantic representation of base classes by fine-tuning the class name embedding, leading to improved performance on both base and novel classes. Experimental results on eleven widely used few-shot learning benchmarks clearly validate the effectiveness of our proposed approach. Code is available at https://github.com/tiggers23/IDAPL
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: This work proposes a novel prompt learning approach APL for Large Vision-Language Models (LVLM) by enhancing the semantic richness of class names and decoupling the learnable context and class name embeddings. APL improves the performance of the prompt learning fine-tuning paradigm of LVLM in downstream tasks, meanwhile enhancing the cross-category generalization ability of prompt learning. Therefore, we believe this work could bring a significant contribution to multimodal processing.
Supplementary Material: zip
Submission Number: 3089
Loading