CusSinger: Customizable Multilingual Zero-shot Singing Voice Synthesis

CusSinger: Customizable Multilingual Zero-shot Singing Voice Synthesis

ACL ARR 2025 February Submission1459 Authors

13 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Zero-shot singing voice synthesis (SVS) with style transfer and style control has various potential applications in music composition and short video dubbing. However, existing SVS models overly depend on phoneme and note boundary annotations, limiting their robustness in zero-shot scenarios and producing poor transitions between phonemes and notes. Moreover, they also lack effective multi-level style control via diverse prompts. To overcome these challenges, we introduce CusSinger, a multi-task multilingual zero-shot SVS model with style transfer and style control based on various prompts. CusSinger mainly includes three key modules: 1) Blurred Boundary Content (BBC) Encoder, predicts duration, extends content embedding, and applies masking to the boundaries to enable smooth transitions. 2) Custom Audio Encoder, uses contrastive learning to extract aligned representations from singing, speech, and textual prompts. 3) Flow-based Custom Transformer, leverages Cus-MOE, with F0 supervision, enhancing both the synthesis quality and style modeling of the generated singing voice. Experimental results show that CusSinger outperforms baseline models in both subjective and objective metrics across multiple related tasks.

Paper Type: Long

Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding

Research Area Keywords: few-shot generation, model architectures, efficient models

Contribution Types: NLP engineering experiment

Languages Studied: Chinese, English, French, Spanish, German, Italian, Japanese, Korean, Russian

Submission Number: 1459

Loading