Towards Consistent Cross-Modal Alignment in Continual Learning for Vision-Language Models

ICLR 2026 Conference Submission17971 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Continual learning, Vision-Language model, Prototype Learning, Prompt Tuning
Abstract: Vision-language models (VLMs) such as CLIP face significant challenges in continual learning (CL), where they must retain both pre-trained and incremental knowledge. Existing methods often rely on reference datasets or domain discriminators, leading to high overhead or limited generalization. Moreover, the semantic gap between modalities hinders effective alignment. While prototypes can partially mitigate this issue, they introduce new challenges: 1) inconsistent prototype fidelity across classes can impede modality fusion and fine-grained alignment, and 2) prototype separability degrades as tasks accumulate in CL. To tackle these, we propose a residual prototype coupled with uncertainty-aware fusion to achieve consistent CLIP alignment. Class-wise prototypes derived from the backbone capture task-specific distributions, supporting both knowledge retention and generalization. Residual prototypes then refine these class representations, mitigating fidelity inconsistency and preserving cross-task separability. In parallel, Bayesian uncertainty-aware estimation and fusion draws on the complementarity between visual prototypes and textual descriptions to dynamically balance multiple objectives, effectively promoting more robust modality fusion and unbiased semantic alignment. Extensive experiments across challenging CL scenarios demonstrate that our method outperforms state-of-the-art approaches, including strong rehearsal-based baselines, across key metrics.
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 17971
Loading