Latent Diffusion Model for Audio: Generation, Quality Enhancement, and Neural Audio Codec

Published: 10 Oct 2024, Last Modified: 27 Oct 2024Audio Imagination: NeurIPS 2024 WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: diffusion model, audio generation, audio codec
TL;DR: This report describes the demo of Latent Diffusion Models (LDMs) in three audio tasks: text-to-audio generation (AudioLDM-2), audio enhancement (AudioSR), and neural audio compression (SemantiCodec).
Abstract: In this demo, we explore the versatile application of Latent Diffusion Models (LDMs) in audio tasks, showcasing their capabilities across three state-of-the-art systems: AudioLDM-2 for text-to-audio generation, AudioSR for audio quality enhancement, and SemantiCodec for ultra-low bitrate neural audio coding. AudioLDM-2 employs an LDM to decode high-quality audio from intermediate Audio Masked Autoencoder (AudioMAE) features, which are generated using a continuous language model conditioned on textual input. AudioSR leverages an LDM to perform robust audio super-resolution, enhancing the quality of low-resolution audio across various types and bandwidths, from speech and music to general sounds. SemantiCodec utilizes an LDM to efficiently decode audio from semantically rich, low-bitrate representations, demonstrating effective audio compression. Together, these systems illustrate the broad utility of LDM as audio decoder for diverse audio generation, enhancement, and neural audio codec tasks. This report highlights the significance of these innovations and outlines our demo objectives.
Submission Number: 20
Loading