Mixtures of Experts for Audio-Visual Learning

Ying Cheng; Yang Li; Junjie He; Rui Feng

Mixtures of Experts for Audio-Visual Learning

Ying Cheng, Yang Li, Junjie He, Rui Feng

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: audio-visual learning, mixture of experts, parameter-efficient transfer learning

Abstract: With the rapid development of multimedia technology, audio-visual learning has emerged as a promising research topic within the field of multimodal analysis. In this paper, we explore parameter-efficient transfer learning for audio-visual learning and propose the Audio-Visual Mixture of Experts (\ourmethodname) to inject adapters into pre-trained models flexibly. Specifically, we introduce unimodal and cross-modal adapters as multiple experts to specialize in intra-modal and inter-modal information, respectively, and employ a lightweight router to dynamically allocate the weights of each expert according to the specific demands of each task. Extensive experiments demonstrate that our proposed approach \ourmethodname achieves superior performance across multiple audio-visual tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, visual-only experimental results also indicate that our approach can tackle challenging scenes where modality information is missing. The source code is available at \url{https://github.com/yingchengy/AVMOE}.

Supplementary Material: zip

Primary Area: Deep learning architectures

Submission Number: 15047

Loading