Listen to Motion: Robustly Learning Correlated Audio-Visual Representations

21 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: audio-visual representation learning
Abstract: Audio-visual correlation learning has many applications and is pivotal in broader multimodal understanding and generation. Recently, many existing methods try to learn audio-visual contrastive representations from web-scale videos and show impressive performance. However, these methods mainly focus on learning the correlation between audio and static visual information (such as objects and background) while ignoring the crucial role of motion information in determining sounds in videos. Besides, the widespread presence of false and multiple positive audio-visual pairs in web-scale unlabeled videos also limits the performance of audio-visual representations. In this paper, we propose \textbf{Li}sten to \textbf{Mo}tion (LiMo) to capture motion information explicitly and align motion and audio robustly. Specifically, for modeling the motion in video, we extract the temporal visual semantic by facilitating the interaction between frames, while retaining static visual-audio correlation knowledge acquired in previous models. To prompt a more robust audio-visual alignment, we propose learning motion-audio alignment more specifically by distinguishing different clips within the same video. And we quantitatively measure the likelihood of each sample being false positive or containing multiple positive instances, then adaptively reweight samples in the final learning objective. Our extensive experiments demonstrate the effectiveness of LiMo on various audio-visual downstream tasks. On audio-visual retrieval, LiMo achieves absolute improvements of at least 15\% top1 accuracy on AudioSet and VGGSound. On our newly proposed motion-specific tasks, LiMo exhibits much better performance. Moreover, LiMo also achieves advanced accuracy on audio event recognition, demonstrating enhanced discriminability of audio representations.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3658
Loading