TransferSinger: Zero-Shot Singing Voice Synthesis with Style Transfer

Anonymous

TransferSinger: Zero-Shot Singing Voice Synthesis with Style Transfer

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Zero-shot Singing Voice Synthesis (SVS) with style transfer aims to generate high-quality singing voices of unseen timbres and styles (including singing methods, rhythm, techniques, and pronunciation) from the prompt audio. However, the multifaceted nature of singing voice styles poses a significant challenge for comprehensive modeling and effective transfer. Furthermore, existing SVS models often fail to generate singing voices with a wealth of stylistic nuances for unseen singers. In this paper, we introduce TransferSinger, a novel zero-shot SVS model that primarily employs three modules to address these challenges: 1) the style encoder that employs a Vector Quantization (VQ) model to condense style information into a compact latent space, thus facilitating subsequent predictions; 2) the Style and Duration Language Model (S\&D-LM), which concurrently predicts style information and phoneme duration, thereby enhancing both; and 3) the style adaptive decoder that uses a novel style adaptive normalization method to generate singing voices with enhanced details. Experimental results show that TransferSinger outperforms baseline models in terms of both synthesis quality and singer similarity across various tasks, including zero-shot SVS, controllable style synthesis, cross-lingual style transfer, and speech-to-singing style transfer. Singing voice samples can be accessed at \url{https://transfersinger.github.io/}.

Paper Type: long

Research Area: Speech recognition, text-to-speech and spoken language understanding

Contribution Types: NLP engineering experiment

Languages Studied: Chinese, English

0 Replies

Loading