Keywords: Text-to-music generation, latent diffusion model, residual vector quantization
Abstract: We introduce DiscoDiff, a text-to-music generative model that utilizes two latent diffusion models to produce high-fidelity 44.1kHz music hierarchically. Our approach significantly enhances audio quality through a coarse-to-fine generation strategy, leveraging residual vector quantization from the Descript Audio Codec. We consolidate this coarse-to-fine design through an important observation that the audio latent representation can be splitted into primary and secondary part, controlling music contents and details accordingly. We validate the effectiveness of our approach and text-audio alignment through various objective metrics. Furthermore, we provide access to high-quality synthetic captions for MTG-Jamendo and FMA datasets, as well as open-sourcing DiscoDiff's codebase and model checkpoints.
Submission Number: 15
Loading