everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
The natural alignment of visual and audio information in videos provides a strong learning signal. However, commonly used large-scale video datasets contain audio-visual signals that are not aligned, e.g. background music. This limits the development of robust models that leverage the complementary nature of audio and video data. To address this limitation, we curate ACAV-1M, a new large-scale dataset that contains one million samples sourced from the ACAV-100M dataset. The ACAV-1M dataset is obtained through a pipeline that ensures the audio-visual correspondence and synchronization of samples in the dataset. Our pipeline transforms raw video and audio into text captions, followed by text summarization and an extensive filtering procedure. The filtering is done based on audio-caption alignment, audio-visual instance semantic alignment, and temporal synchronization. Furthermore, we propose an audio-visual learning benchmark that supports a diverse range of downstream tasks. Empirical evaluations demonstrate that models trained on ACAV-1M achieve superior performance compared to using existing datasets across all tasks. Our ACAV-1M dataset and code to reproduce all benchmark results will be made publicly available upon acceptance.