Keywords: Relation Aware Text-to-Audio Generaion, Audio Event Corpus, Relation Corpus
TL;DR: A new audio event corpus and relation corpus supporting relation aware text-to-audio (TTA) generation task and beyond
Abstract: We present Aurelius, a new framework that enables relation aware text-to-audio (TTA) generation research at scale. Given the lack of essential audio event and relation corpora, Aurelius contributes a large-scale audio event corpus AudioEventSet and another large-scale relation corpus AudioRelSet. Comprising 110 event categories, AudioEventSet maximally covers all commonly heard audio events and each event is unique, realistic and of high-quality. AudioRelSet consists of 100 relations, comprehensively covering the relations that present in the physical world or can be neatly described by text. As the two corpora provide audio event and relation independently, they can be combined to create massive <text,audio> pairs with our pair generation strategy to support relation aware TTA investigation at scale. We comprehensively benchmark all existing TTA models from both general and relation aware evaluation perspective. We further provide an in-depth investigation into scaling existing TTA models' relation aware generation by either training from scratch or leveraging cross-domain general TTA knowledge. The introduced corpora and the findings from investigation potentially facilitate future research on relation aware TTA generation.
Primary Area: datasets and benchmarks
Submission Number: 8576
Loading