A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

ACL ARR 2026 January Submission9509 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Unified Audio Generation, Natural Language Instruction, TTS, TTM, TTA
Abstract: Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce \textbf{UniSonate}, a unified flow-matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference-free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). Coupled with a multi-stage curriculum learning strategy, this approach effectively mitigates cross-modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state-of-the-art performance in instruction-based TTS (WER 1.47\%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe \textit{positive transfer}, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single-task baselines.
Paper Type: Long
Research Area: Speech Processing and Spoken Language Understanding
Research Area Keywords: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Contribution Types: NLP engineering experiment
Languages Studied: English, Chinese
Submission Number: 9509
Loading