Keywords: Continual Learning, Visual Question Answering, Compositional Generalization, Mixture of Experts
Abstract: Continual visual question answering with multimodal large language models is promising because of their strong reasoning and generative capabilities, but it remains hindered by catastrophic forgetting, concept drift across tasks, and the need for compositional generalization. Previous work has mainly targeted forgetting while overlooking the challenge of intertask composition, where real-world visual question answering requires combining knowledge across tasks. We introduce dual-purpose experts within a Mixture of Experts framework to address these challenges without the need for a replay buffer. Our approach expands expert layers in the multimodal space using low-rank adaptation and trains each expert jointly on Visual Question Answering and Visual Question Generation with a shared MLLM backbone. This unified design enriches multimodal knowledge, while knowledge sharing through the extraction and fusion of information from past experts further mitigates forgetting and enhances composition. A lightweight language-based router then enables effective expert selection. To better evaluate this setting, we also propose a compositional benchmark that reflects real compositional questions. Experiments on diverse benchmarks demonstrate that our method substantially reduces forgetting and improves compositional generalization compared to previous generative continual visual question answering approaches.
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 20944
Loading