AVSET-10M: An Open Large-Scale Audio-Visual Dataset with High Correspondence

Xize Cheng; Ziang Zhang; Zehan Wang; Minghui Fang; Rongjie Huang; Siqi Zheng; Ruofan Hu; Bai Jionghao; Tao Jin; Zhou Zhao

AVSET-10M: An Open Large-Scale Audio-Visual Dataset with High Correspondence

Xize Cheng, Ziang Zhang, Zehan Wang, Minghui Fang, Rongjie Huang, Siqi Zheng, Ruofan Hu, Bai Jionghao, Tao Jin, Zhou Zhao

22 May 2024 (modified: 13 Nov 2024)Submitted to NeurIPS 2024 Track Datasets and BenchmarksEveryoneRevisionsBibTeXCC BY 4.0

Keywords: audio-visual corresponding dataset, sound separation, audio-video retrieval

TL;DR: An Open Large-Scale Audio-Visual Corresponding Dataset.

Abstract: Groundbreaking research from initiatives such as ChatGPT and Sora underscores the crucial role of large-scale data in advancing generative and comprehension tasks. However, the scarcity of comprehensive and large-scale audio-visual correspondence datasets poses a significant challenge to research in the audio-visual fields. To address this gap, we introduce **AVSET-10M**, a audio-visual high-corresponding dataset comprising 10 million samples, featuring the following key attributes: (1) **High Audio-Visual Correspondence**: Through meticulous sample filtering, we ensure robust correspondence between the audio and visual components of each entry. (2) **Comprehensive Categories**: Encompassing 527 unique audio categories, AVSET-10M offers the most extensive range of audio categories available. (3) **Large Scale**: With 10 million samples, AVSET-10M is the largest publicly available audio-visual corresponding dataset. We have benchmarked two critical tasks on AVSET-10M: audio-video retrieval and vision-queried sound separation. These tasks highlight the essential role of precise audio-visual correspondence in advancing audio-visual research. For more information, please visit https://avset-10m.github.io/.

Supplementary Material: zip

Flagged For Ethics Review: true

Submission Number: 685

Loading