Keywords: Audio-Visual Question Answering, Continual Learning, Mixture-of-Experts, Anchor-Based Routing, Catastrophic Forgetting
Abstract: Audio-Visual Question Answering Continual Learning (AVQACL) aims to enable models to adapt to new tasks while preserving prior knowledge. Existing methods face two primary challenges: Firstly, fine-tuning induces catastrophic forgetting due to cross-task interference in shared backbone parameters, as gradient signals from diverse tasks conflict and overwrite prior knowledge; Secondly, most approaches rely on task labels at inference, which is impractical in open-world settings where task boundaries are fluid and unlabeled. To address these challenges, we introduce AVQACL-MoE, an anchor-based Mixture-of-Experts (MoE) framework that reframes continual learning as incremental expert composition within a frozen pre-trained audio-visual backbone. First, we train corresponding task-specialized experts for different tasks respectively and freeze their parameters after training, structurally reducing the catastrophic forgetting caused by shared backbone parameters. Second, to effectively select the corresponding task-specialized experts, we refine the task signatures into lightweight, modal-specific anchors during training. Then implement task-independent routing based on cosine similarity between the input sample and anchors for inference. On the AVQACL benchmark, AVQACL-MoE achieves state-of-the-art performance, reducing forgetting from 27% to 2% and improving final accuracy by over 30.9%. By shifting the stability-plasticity trade-off from weight adaptation to expert assembly, our approach enables scalable continual learning for open-world AVQA applications. The code is available at https://anonymous.4open.science/r/AVQACL-MoE-C454/.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16188
Loading