M$^3$PL: Identifying and Exploiting View Bias of Prompt Learning

Published: 21 Sept 2024, Last Modified: 21 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Prompt learning is an effective means of fine-tuning multi-modal foundation models such as CLIP. Despite existing success, the inner mechanism of multi-modal prompt learning has not been well understood. In this work, we identify an inductive bias of multi-modal prompt learning, which we refer to as view bias, that the learned prompts may extract only a partial subset of useful features (views) and ignore others. This bias can undermine the model's generalization ability, particularly under distribution shifts. We further observe that independently trained prompts have distinct view biases, contrary to the existing belief that they may converge to similar local optima due to having the same cross-modal representation matching objective. Based on our observations, we propose Multi-modal Matching Multi-Prompt Learning (M$^3$PL), which incorporates multiple paired prompts and a cross-modal contrastive regularizer that facilitates the prompt pairs to encapsulate a broader spectrum of views. Extensive experiments show that M$^3$PL effectively boosts the model's generalization capability, achieving state-of-the-art performance under various distribution shifts.
Submission Length: Regular submission (no more than 12 pages of main content)
Code: https://github.com/cdyyyy/M3PL
Supplementary Material: zip
Assigned Action Editor: ~Grigorios_Chrysos1
Submission Number: 2772
Loading