Incremental Learning of Vision-Language Models via Task Subspace Projection and Dynamic LoRA

Hao Fu; qian feng; Hanbin Zhao; Jiahua Dong; Wei Ji; Lina Wei; Chao Zhang; Hui Qian

Incremental Learning of Vision-Language Models via Task Subspace Projection and Dynamic LoRA

Hao Fu, qian feng, Hanbin Zhao, Jiahua Dong, Wei Ji, Lina Wei, Chao Zhang, Hui Qian

16 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal learning, incremental learning, continual learning, vision-language model

Abstract: Recent pre-trained vision-language models usually face a Multi-Domain Task-Incremental Learning (MTIL) benchmark in practice, where a set of classes of multi-modal tasks arrive incrementally. Due to privacy concerns and memory constraints, MTIL with pre-trained models encounters forgetting of knowledge from old tasks, degradation of zero-shot transfer capability, and underfitting of new-task knowledge. To overcome these challenges, previous MTIL methods attempt to learn a discriminative cross-task identification (CTI) module and an effective new-task adaptation (NTA) module. However, current CTI modules suffer from severe task confusion between seen and unseen tasks, and NTA modules cannot adaptively balance the performance and parameter cost while incorporating task-specific knowledge. To alleviate the above dilemmas, we propose an effective and efficient TSP-DLoRA method for MTIL, which consists of Task Subspace Projection (TSP) and Dynamic Low Rank Adapter (DLoRA) modules. Specifically, our TSP module includes a task identifier classifier based on task-specific subspaces and a feature projection strategy that can determine the identifier associated with samples from both seen and unseen tasks. Our DLoRA improves the knowledge adaptation from new tasks by dynamically assigning Low Rank Adapter (LoRA) across transformer layers based on the task distributions. Experimental evaluations across 11 datasets, using three performance metrics, demonstrate the effectiveness of our proposed method.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 6996

Loading