SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

Zihan Liu; Shuangrui Ding; Zhixiong Zhang; Xiaoyi Dong; Pan Zhang; Yuhang Zang; Yuhang Cao; Dahua Lin; Jiaqi Wang

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0

TL;DR: We propose SonGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation, supporting both mixed mode and dual-track mode generation.

Abstract: Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, leading to cumbersome training and inference pipelines, as well as suboptimal overall generation quality due to error accumulation across stages. In this paper, we propose **SongGen**, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes: **mixed mode**, which generates a mixture of vocals and accompaniment directly, and **dual-track mode**, which synthesizes them separately for greater flexibility in downstream applications. We explore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline. The code is available at https://github.com/LiuZH-19/SongGen.

Lay Summary: Generating songs from text — including both vocals and accompaniment — is a complex task. Existing methods use multi-stage processes that are slow and often reduce the final output quality. We introduce SongGen, a single-stage AI model that creates songs from lyrics, musical descriptions, and short voice samples, delivering better quality and efficiency than multi-stage approaches. SongGen is fully open-source, aiming to make high-quality AI song generation more accessible and controllable.

Link To Code: https://github.com/LiuZH-19/SongGen

Primary Area: Deep Learning->Generative Models and Autoencoders

Keywords: text-to-song, song generation, auto-regressive transformer

Submission Number: 866

Loading