MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models
Abstract: Large-scale pre-trained Vision-Language Models (VLMs) have significantly advanced transfer learning across a wide range of tasks. However, adapting these models with limited few-shot data often leads to overfitting, undermining their ability to generalize to new tasks. To address this challenge, we propose a novel framework, Multi-Modal Representation Learning (MMRL), which introduces a shared, learnable, and modality-agnostic representation space. Specifically, MMRL generates a set of space tokens, which are projected into both the text and image encoders as representation tokens, facilitating more effective cross-modal interactions. Unlike prior methods that primarily optimize class token features, MMRL integrates representation tokens into the higher layers of the encoders–where task-specific features are more prominent–while preserving general knowledge in the lower layers. During training, both class and representation features are jointly optimized: a trainable projection layer is applied to representation tokens for task adaptation, while the projection layer for class token remains frozen to retain pre-trained knowledge. To further promote generalization, we introduce a regularization term that aligns class and text features with the frozen VLM’s zero-shot features. During inference, we employ a decoupling strategy: both class and representation features are used for base tasks, while only class features, which are more generalizable, are utilized for novel tasks. Building upon this, we propose MMRL++, a parameter-efficient and interaction-aware extension that significantly reduces the number of trainable parameters and enhances intra-modal interactions–particularly across the layers of representation tokens–allowing gradient sharing and instance-specific information to propagate more effectively through the network. Extensive experiments on 15 datasets demonstrate that MMRL and MMRL++ consistently outperform state-of-the-art methods, achieving a strong balance between task-specific adaptation and generalization.
External IDs:dblp:journals/ijcv/GuoG26
Loading