Towards Efficient Audio-Visual Learners via Empowering Pre-trained Vision Transformers with Cross-Modal Adaptation

Kai Wang, Yapeng Tian, Dimitrios Hatzinakos

Published: 10 Jun 2024, Last Modified: 29 Sept 2024CVPR 2024 workshopEveryoneRevisionsCC BY 4.0

Abstract: In this paper, we explore the cross-modal adaptation of pre-trained Vision Transformers (ViTs) for the audio-visual domain by incorporating a limited set of trainable parameters. To this end, we propose a Spatial-Temporal-Global Cross-Modal Adaptation (STG-CMA) to gradually equip the frozen ViTs with the capability for learning audio-visual representation, consisting of the modality-specific temporal adaptation for temporal reasoning of each modality, the cross-modal spatial adaptation for refining the spatial information with the cue from counterpart modality, and the cross-modal global adaptation for global interaction between audio and visual modalities. Our STG-CMA presents a meaningful finding that only leveraging the shared pre-trained image model with inserted lightweight adapters is enough for spatial-temporal modeling and feature interaction of audio-visual modality. Extensive experiments indicate that our STG-CMA achieves state-of-the-art performance on various audio-visual understanding tasks including AVE, AVS, and AVQA while containing significantly reduced tunable parameters. The code is available at https://github.com/kaiw7/STG-CMA.