SongComposer: A Large Language Model for Lyric and Melody Composition in Song Generation

Shuangrui Ding; Zihan Liu; Xiaoyi Dong; Pan Zhang; Rui Qian; Junhao Huang; Conghui He; Dahua Lin; Jiaqi Wang

SongComposer: A Large Language Model for Lyric and Melody Composition in Song Generation

Shuangrui Ding, Zihan Liu, Xiaoyi Dong, Pan Zhang, Rui Qian, Junhao Huang, Conghui He, Dahua Lin, Jiaqi Wang

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Symbolic Song Composition

TL;DR: SongComposer is an innovative Large Language Model (LLM) designed specifically for symbolic song composition, capable of generating both melodies and lyrics.

Abstract: A song typically comprises the vocal track and the music track. Creating lyrics and melodies for the vocal track in a symbolic format, known as song composition, plays a significant role in the song generation. This delicate and complex task demands expert musical knowledge of melody, an advanced understanding of lyrics, and precise alignment between them. Despite achievements in sub-tasks such as lyric generation, lyric-to-melody, and melody-to-lyric, etc, a unified model for song composition has not yet been achieved. In this paper, we introduce SongComposer, a pioneering step towards a unified song composition model that can readily create symbolic lyrics and melodies following instructions. SongComposer is a music-specialized large language model (LLM) that, for the first time, integrates the capability of simultaneously composing lyrics and melodies into LLMs. To achieve this goal, three non-trivial efforts are introduced. 1) Sheet music understanding, we designed a flexible tuple format to load lyric and note attributes, fostering word-level alignment between lyrics and melodies, and enabling SongComposer to generate lyrics with accompanying well-aligned melodies. 2) Song note tokenizing, the vocabulary of the tokenizer is extended for song notes, and we find a proper scalar-manner initialization of new tokens based on musical prior is essential for the model to understand musical rhythm. 3) Structural music generation, we propose a multi-stage pipeline for progressively capturing the musical structure. Initially, we extract and feed motif-level melody patterns to SongComposer to build its basic generation capabilities. Later, we insert special tokens into the whole-song data to denote phrase-level structure, promoting logical repetition and smooth coherence. Extensive experiments demonstrate that SongComposer outperforms advanced LLMs, including GPT-4, in tasks such as lyric-to-melody generation, melody-to-lyric generation, song continuation, and text-to-song creation. We showcase the generated samples on our anonymous project page https://songcomposer.github.io/. Due to the lack of high-quality symbolic song datasets with lyrics and melodies, we have carefully curated and will publicly release SongCompose, a large-scale song pretraining and supervised finetuning dataset that includes lyrics, melodies, and paired lyrics-melodies in both Chinese and English.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6213

Loading