Adapting Speech Language Model to Singing Voice Synthesis

Yiwen Zhao; Jiatong Shi; Jinchuan Tian; Yuxun Tang; Jiarui Hai; Jionghao Han; Shinji Watanabe

Adapting Speech Language Model to Singing Voice Synthesis

Yiwen Zhao, Jiatong Shi, Jinchuan Tian, Yuxun Tang, Jiarui Hai, Jionghao Han, Shinji Watanabe

Published: 23 Sept 2025, Last Modified: 08 Nov 2025AI4MusicEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Speech Language Model, Singing Voice Synthesis

TL;DR: Adapting a text-to-speech pretrained language model to low resource singing voice synthesis task.

Abstract: Speech Language Models (SLMs) have recently emerged as a unified paradigm for addressing a wide range of speech-related tasks, including text-to-speech (TTS), speech enhancement (SE), and automatic speech recognition (ASR). However, the generalization capability of large-scale pre-trained SLMs remains underexplored. In this work, we adapt a 1.7B parameter TTS pretrained SLM for singing voice synthesis (SVS), using only a 135-hour synthetic singing corpus, ACE-Opencpop. Building upon the ESPNet-SpeechLM, our recipe involves the following procedure: (1) tokenization of music score conditions and singing waveforms, (2) multi-stream language model token prediction, (3) conditional flow matching-based mel-spectrogram generation. (4) a mel-to-wave vocoder. Experimental results demonstrate that our adapted SLM generalizes well to SVS and achieves performance comparable to leading discrete token-based SVS models.

Track: Paper Track

Confirmation: Paper Track: I confirm that I have followed the formatting guideline and anonymized my submission.

Submission Number: 80

Loading