ComposerFlow: Step-by-Step Compositional Song Generation

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Song generation, Text-to-Music, Music generation, Green AI
TL;DR: A song generation pipeline that leverages the knowledge of multiple models to achieve low-cost, fast training.
Abstract: Song generation models seek to produce audio recordings with vocals and instrumental accompaniment from user-provided lyrics and textual descriptions. While *end-to-end* approaches yield compelling results, they demand vast training data and computational resources. In this paper, we demonstrate that a *compositional* approach can make song generation far more data-efficient by decomposing the task into three sequential sub-tasks: melody composition, singing voice synthesis, and accompaniment generation. Although prior work exists for each sub-task, we show that naively chaining off-the-shelf models yields suboptimal outcomes. Instead, these components must be re-engineered with song generation in mind. To this end, we introduce *MIDI-informed* singing accompaniment generation — a novel technique unexplored in prior literature — that conditions accompaniment on MIDI representations of vocal melody, empirically enhancing rhythmic and harmonic consistency between singing and instrumentation. By integrating pre-existing models with our newly trained components (requiring only 6k hours of audio data on a single RTX 3090 GPU), our pipeline achieves perceptual quality on par with leading end-to-end open-source models, while offering advantages in training efficiency, licensed singing voices from professional artists, and editable intermediates. We provide audio demos and will open-source our model at https://composerflow.github.io/web/.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 18303
Loading