Keywords: Continual Multimodal Learning; Prototype Optimization; Mixture-of-Experts
Abstract: Vision-language models, such as CLIP, achieve strong zero-shot performance through contrastive pre-training but face significant challenges in continual learning scenarios. When learning new tasks sequentially, current methods suffer from degradation in prototype quality due to passive averaging and underutilize their visual adaptation capabilities. We propose RLAP-CLIP, which addresses these limitations through three components. First, Reinforcement Learning-based Prototype Optimization (RLPO) formulates prototype construction as a reinforcement learning problem to actively optimize class separability rather than relying on simple averaging. Second, difficulty-aware cross-modal fusion uses a mixture-of-experts to route samples through specialized processing pathways based on complexity. Third, dual-modal prompting balances visual and textual adaptation. Experiments across eight datasets demonstrate consistent improvements, with RLAP-CLIP achieving average accuracy gains of 3.72-4.46 points and final accuracy improvements of 0.49-4.48 points over other methods, validating that RLAP-CLIP achieves state-of-the-art performance. Our source code is available at [RLAP-CLIP](https://anonymous.4open.science/r/197165541613026132779/RLAP-CLIP).
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 8836
Loading