MAViL: Masked Audio-Video Learners

Po-Yao Huang; Vasu Sharma; Hu Xu; Chaitanya Ryali; Haoqi Fan; Yanghao Li; Shang-Wen Li; Gargi Ghosh; Jitendra Malik; Christoph Feichtenhofer

MAViL: Masked Audio-Video Learners

Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, Haoqi Fan, Yanghao Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, Christoph Feichtenhofer

Published: 21 Sept 2023, Last Modified: 02 Nov 2023NeurIPS 2023 posterEveryoneRevisionsBibTeX

Keywords: self-supervised learning, audio representation learning, audio classification

TL;DR: SSL via aligning and reconstructing contextualized audio-video representations delivers SOTA performance on 7 audio-visual classification and textless vision-language tasks.

Abstract: We present Masked Audio-Video Learners (MAViL) to learn audio-visual representations with three complementary forms of self-supervision: (1) reconstructing masked raw audio and video inputs, (2) intra-modal and inter-modal contrastive learning with masking, and (3) self-training to predict aligned and contextualized audio-video representations learned from the first two objectives. Empirically, MAViL achieves state-of-the-art audio-video classification performance on AudioSet (53.3 mAP) and VGGSound (67.1\% accuracy), surpassing recent self-supervised models and supervised models that utilize external labeled data. Notably, pre-training with MAViL not only enhances performance in multimodal classification and retrieval tasks, but it also improves the representations of each modality in isolation, without relying on information from the other modality during uni-modal fine-tuning or inference. The code and models are available at https://github.com/facebookresearch/MAViL.

Supplementary Material: pdf

Submission Number: 2572

Loading