Learning to Compose: Continual Visual QA through a Dual-Purpose Mixture-of-Experts Framework

Learning to Compose: Continual Visual QA through a Dual-Purpose Mixture-of-Experts Framework

ICLR 2026 Conference Submission20944 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Continual Learning, Visual Question Answering, Compositional Generalization, Mixture of Experts

Abstract: Continual visual question answering with multimodal large language models is promising because of their strong reasoning and generative capabilities, but it remains hindered by catastrophic forgetting, concept drift across tasks, and the need for compositional generalization. Previous work has mainly targeted forgetting while overlooking the challenge of intertask composition, where real-world visual question answering requires combining knowledge across tasks. We introduce dual-purpose experts within a Mixture of Experts framework to address these challenges without the need for a replay buffer. Our approach expands expert layers in the multimodal space using low-rank adaptation and trains each expert jointly on Visual Question Answering and Visual Question Generation with a shared MLLM backbone. This unified design enriches multimodal knowledge, while knowledge sharing through the extraction and fusion of information from past experts further mitigates forgetting and enhances composition. A lightweight language-based router then enables effective expert selection. To better evaluate this setting, we also propose a compositional benchmark that reflects real compositional questions. Experiments on diverse benchmarks demonstrate that our method substantially reduces forgetting and improves compositional generalization compared to previous generative continual visual question answering approaches.

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 20944

Loading