MOMA: Mixture-of-Modality-Adaptations for Transferring Knowledge from Image Models Towards Efficient Audio-Visual Action Recognition

Kai Wang, Dimitrios Hatzinakos

Published: 18 Mar 2024, Last Modified: 29 Sept 2024ICASSP 2024EveryoneCC BY 4.0

Abstract: In this work, we investigate how to transfer learned knowledge from pre-trained image models for the audio-visual domain without relying on a full finetuning paradigm. To achieve this objective, we propose a novel parameter-efficient scheme called Mixture-of-Modality-Adaptations (MoMA) for audio-visual action recognition, which consists of the dual-path spatial-temporal adaptation for visual modality, the acoustic-aware adaptation for audio modality, and the audio-visual multimodal adaptation for interacting different modalities. Through freezing the original parameters of pre-trained image backbones and introducing lightweight parameter-efficient adapters, our proposed MoMA efficiently adapts the image models to learn audio-visual representation without employing any audio-specific encoders and full finetuning. The experimental results on the action recognition benchmarks indicate that our MoMA achieves competitive or even better performance than existing methods while involving significantly fewer tunable parameters.