Audio-FLAN: An Instruction-Following Dataset for Unified Understanding and Generation of Speech, Music, and Sound

ICLR 2026 Conference Submission14832 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Instruction following, instruction tuning, audio-language model, large language model, zero-shot learning
TL;DR: This paper introduces Audio-FLAN, a large-scale instruction-following dataset designed to unify audio understanding and generation tasks across speech, music, and sound, enabling zero-shot learning for audio-language models.
Abstract: Instruction tuning has generalized well in language and vision, yet audio remains siloed by domain (speech, music, environmental sound) and by task type (understanding vs. generation). We present Audio-FLAN, a large-scale instruction-following corpus that unifies heterogeneous audio sources under a unified instruction schema with instruction, input, and output. It supports both understanding (audio→text) and generation (text/audio/(audio, text)→audio) across speech, music, and general audio. The dataset contains 108.5M instances spanning 23 major and 80 minor tasks drawn from 52 datasets. Instruction tuning on a small subset of Audio-FLAN yields consistent gains on diverse understanding tasks, including zero-shot generalization. We further evaluate the existing generation model and validate Audio-FLAN as an effective benchmark. Hallucination probes inform future data and training design. In summary, Audio-FLAN serves as both an effective training resource and a unified, extensible benchmark for instruction-following audio–language models. We release the dataset on HuggingFace (https://huggingface.co/datasets/Audio-FLAN/Audio-FLAN-Dataset).
Primary Area: datasets and benchmarks
Submission Number: 14832
Loading