Keywords: audio-visual learning, sound souce localization, audio-visual video parsing
TL;DR: In this paper, we curate ACAV-1M, a new large-scale dataset that contains one million samples sourced from the ACAV-100M dataset.
Abstract: The natural alignment of visual and audio information in videos provides a strong learning signal. However, commonly used large-scale video datasets contain audio-visual signals that are not aligned, e.g. background music. This limits the development of robust models that leverage the complementary nature of audio and video data. To address this limitation, we curate ACAV-1M, a new large-scale dataset that contains one million samples sourced from the ACAV-100M dataset. The ACAV-1M dataset is obtained through a pipeline that ensures the audio-visual correspondence and synchronization of samples in the dataset. Our pipeline transforms raw video and audio into text captions, followed by text summarization and an extensive filtering procedure. The filtering is done based on audio-caption alignment, audio-visual instance semantic alignment, and temporal synchronization. Furthermore, we propose an audio-visual learning benchmark that supports a diverse range of downstream tasks. Empirical evaluations demonstrate that models trained on ACAV-1M achieve superior performance compared to using existing datasets across all tasks. Our ACAV-1M dataset and code to reproduce all benchmark results will be made publicly available upon acceptance.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2229
Loading