Coarse-to-Fine Text-to-Music Latent Diffusion

Published: 01 Jan 2025, Last Modified: 25 Jul 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We introduce DiscoDiff, a text-to-music generative model that utilizes two latent diffusion models to produce high-fidelity 44.1kHz music hierarchically. Our approach significantly enhances audio quality through a coarse-to-fine generation strategy, leveraging residual vector quantization from the Descript Audio Codec. We consolidate this coarse-to-fine design through an important observation that the audio latent representation can be split into a primary and secondary part, controlling music content and details accordingly. We validate the effectiveness of our approach and text-audio alignment through various objective metrics. Furthermore, we provide access to high-quality synthetic captions for the MTG-Jamendo and FMA datasets, as well as open-sourcing DiscoDiff’s codebase and model checkpoints.
Loading