Ainur: Harmonizing Speed and Quality in Deep Music Generation Through Lyrics-Audio Embeddings

Giuseppe Concialdi, Alkis Koudounas, Eliana Pastor, Barbara Di Eugenio, Elena Baralis

Published: 01 Jan 2024, Last Modified: 06 Aug 2024ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In the domain of music generation, prevailing methods focus on text-to-music tasks, predominantly relying on diffusion models. However, they fail to achieve good vocal quality in synthetic music compositions.To tackle this critical challenge, we present Ainur, a hierarchical diffusion model that concentrates on the lyrics-to-music generation task. Through its use of multimodal Lyrics-Audio Spectrogram Pre-training (CLASP) embeddings, Ainur distinguishes itself from past approaches by specifically enhancing the vocal quality of synthetically produced music. Notably, Ainur’s training and testing processes are highly efficient, requiring only a single GPU. According to experimental results, Ainur meets or exceeds the quality of other state-of-the-art models like MusicGen, MusicLM, and AudioLDM2 in both objective and subjective evaluations. Additionally, Ainur offers near real-time inference speed, which facilitate its use in practical, real-world applications.