Keywords: audio-visual learning, mixture of experts, parameter-efficient transfer learning
Abstract: With the rapid development of multimedia technology, audio-visual learning has emerged as a promising research topic within the field of multimodal analysis. In this paper, we explore parameter-efficient transfer learning for audio-visual learning and propose the Audio-Visual Mixture of Experts (\ourmethodname) to inject adapters into pre-trained models flexibly. Specifically, we introduce unimodal and cross-modal adapters as multiple experts to specialize in intra-modal and inter-modal information, respectively, and employ a lightweight router to dynamically allocate the weights of each expert according to the specific demands of each task. Extensive experiments demonstrate that our proposed approach \ourmethodname achieves superior performance across multiple audio-visual tasks,
including AVE, AVVP, AVS, and AVQA. Furthermore, visual-only experimental results also indicate that our approach can tackle challenging scenes where modality information is missing.
The source code is available at \url{https://github.com/yingchengy/AVMOE}.
Supplementary Material: zip
Primary Area: Deep learning architectures
Submission Number: 15047
Loading