Keywords: audio-visual corresponding dataset, sound separation, audio-video retrieval
TL;DR: An Open Large-Scale Audio-Visual Corresponding Dataset.
Abstract: Groundbreaking research from initiatives such as ChatGPT and Sora underscores the crucial role of large-scale data in advancing generative and comprehension tasks. However, the scarcity of comprehensive and large-scale audio-visual correspondence datasets poses a significant challenge to research in the audio-visual fields. To address this gap, we introduce **AVSET-10M**, a audio-visual high-corresponding dataset comprising 10 million samples, featuring the following key attributes: (1) **High Audio-Visual Correspondence**: Through meticulous sample filtering, we ensure robust correspondence between the audio and visual components of each entry. (2) **Comprehensive Categories**: Encompassing 527 unique audio categories, AVSET-10M offers the most extensive range of audio categories available. (3) **Large Scale**: With 10 million samples, AVSET-10M is the largest publicly available audio-visual corresponding dataset. We have benchmarked two critical tasks on AVSET-10M: audio-video retrieval and vision-queried sound separation. These tasks highlight the essential role of precise audio-visual correspondence in advancing audio-visual research. For more information, please visit https://avset-10m.github.io/.
Supplementary Material: zip
Flagged For Ethics Review: true
Submission Number: 685
Loading