Text-to-Audio Generation via Bridging Audio Language Model and Latent Diffusion

ZHENYU WANG; Chenxing Li; YONG XU; Chunlei Zhang; John H. L. Hansen; Dong Yu

Text-to-Audio Generation via Bridging Audio Language Model and Latent Diffusion

ZHENYU WANG, Chenxing Li, YONG XU, Chunlei Zhang, John H. L. Hansen, Dong Yu

Published: 10 Oct 2024, Last Modified: 17 Oct 2024Audio Imagination: NeurIPS 2024 WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion models, Auto-regressive models, Text-to-Audio Generation, Generative AI

Abstract: Diffusion models have become the foundation for most text-to-audio generation methods. These approaches rely on a large text encoder to process the textual description, serving as a semantic condition to guide the audio generation process. Meanwhile, autoregressive language model-based methods for audio generation have also emerged. These autoregressive models offer flexibility by predicting discrete audio tokens, but they often fail to achieve high fidelity. In this work, we propose an advanced system that integrates the autoregressive language model with the diffusion model, achieving flexible and refined audio generation. The auto-regressive language model is used to predict the discrete audio tokens conditioned on text prompts. Then, audio tokens are fed into the diffusion model to further purify the details of the generated audio. Consequently, compared to baseline systems, our proposed approach can deliver better results on most objective and subjective metrics on the AudioCaps test set. Audio demos generated by our proposed best system are available at https://dcldmdemo.github.io.

Submission Number: 46

Loading