Abstract: Research on continual learning in multi-modal tasks has been receiving increasing attention. However, most existing work overlooks the explicit cross-modal and cross-task interaction. In this paper, we innovatively propose the Low-rank Prompt Interaction (LPI) to address this general problem of multi-modal understanding, which considers both cross-modal interaction and cross-task interaction. Specifically, as for the former, we employ multi-modal correlation modules for corresponding Transformer layers. Considering that the training parameters scale to the number of layers and tasks, we propose Low-rank Interaction-augmented Decomposition to avoid memory explosion, while enhancing the cross-modal association through sharing and separating common-specific low-rank factors. In addition, due to the multi-modal semantic differences carried by the low-rank initialization, we adopt hierarchical low-rank contrastive learning to ensure training robustness. As for the latter, we initially employ visual analysis and identify that different tasks have clear distinctions in terms of proximity. Therefore, we introduce explicit task contrastive constraints in the prompt learning process based on task semantic distance. Experiments on two retrieval tasks show performance improvements with the introduction of a minimal number of parameters, demonstrating the effectiveness of our method.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: Our work focuses on multimodal retrieval tasks within the context of continual learning, including image-text retrieval and referring expression comprehension tasks. This scenario is more aligned with real-world conditions. The approach we propose enhances retrieval accuracy, mitigates catastrophic forgetting often encountered in continual learning, and simultaneously reduces the consumption of computational and storage resources, thereby offering a viable solution for research and application in this direction.
Supplementary Material: zip
Submission Number: 3151
Loading