ACAV-1M: Data Curation and Benchmarking for Audio-Visual Representation Learning

Shentong Mo; Yutong Bai; A. Sophia Koepke

ACAV-1M: Data Curation and Benchmarking for Audio-Visual Representation Learning

Shentong Mo, Yutong Bai, A. Sophia Koepke

20 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: audio-visual learning, sound souce localization, audio-visual video parsing

TL;DR: In this paper, we curate ACAV-1M, a new large-scale dataset that contains one million samples sourced from the ACAV-100M dataset.

Abstract: The natural alignment of visual and audio information in videos provides a strong learning signal. However, commonly used large-scale video datasets contain audio-visual signals that are not aligned, e.g. background music. This limits the development of robust models that leverage the complementary nature of audio and video data. To address this limitation, we curate ACAV-1M, a new large-scale dataset that contains one million samples sourced from the ACAV-100M dataset. The ACAV-1M dataset is obtained through a pipeline that ensures the audio-visual correspondence and synchronization of samples in the dataset. Our pipeline transforms raw video and audio into text captions, followed by text summarization and an extensive filtering procedure. The filtering is done based on audio-caption alignment, audio-visual instance semantic alignment, and temporal synchronization. Furthermore, we propose an audio-visual learning benchmark that supports a diverse range of downstream tasks. Empirical evaluations demonstrate that models trained on ACAV-1M achieve superior performance compared to using existing datasets across all tasks. Our ACAV-1M dataset and code to reproduce all benchmark results will be made publicly available upon acceptance.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2229

Loading