Unleash the Power of Vision-Language Models by Visual Attention Prompt and Multimodal Interaction

Wenyao Zhang, Letian Wu, Zequn Zhang, Tao Yu, Chao Ma, Xin Jin, Xiaokang Yang, Wenjun Zeng

Published: 01 Jan 2025, Last Modified: 04 Nov 2025IEEE Transactions on MultimediaEveryoneRevisionsCC BY-SA 4.0

Abstract: Pre-trained vision-language models (VLMs), equipped with parameter-efficient tuning (PET) methods like prompting, have shown impressive knowledge transferability on new downstream tasks, but they are still prone to be limited by catastrophic forgetting and overfitting dilemma due to large gaps among tasks. Furthermore, the underlying physical mechanisms of prompt-based tuning methods (especially for visual prompting) remain largely unexplored. It is unclear why these methods work solely based on learnable parameters as prompts for adaptation. To address the above challenges, we present a new prompt-based framework for vision-language models, termed Uni-prompt. Our framework transfers VLMs to downstream tasks by designing visual prompts from an attention perspective that reduces the transfer/solution space, which enables the vision model to focus on task-relevant regions of the input image while also learning task-specific knowledge. Additionally, Uni-prompt further aligns visual-text prompts learning through a pretext task with masked representation modeling interactions, which implicitly learns a global cross-modal matching between visual and language concepts for consistency. We conduct extensive experiments on the few-shot classification task and achieve significant improvement using our Uni-prompt method while requiring minimal extra parameters cost.

External IDs:doi:10.1109/tmm.2024.3521785