UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and  Voice Conversion

Haogeng Liu; Tao Wang; Ruibo Fu; Jiangyan Yi; Jianhua Tao; Zhengqi Wen

UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion

Haogeng Liu, Tao Wang, Ruibo Fu, Jiangyan Yi, Jianhua Tao, Zhengqi Wen

22 Sept 2022 (modified: 12 Oct 2025)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: decoupling, zero-shot learning, text-to-speech, voice conversion, vector quantization

Abstract: The unlabeled speech contains rich speaker style information, which can improve the few-shots modeling capability. This paper proposes UnifySpeech to make use of large amounts of unlabeled data for model training and boost the performance of text-to-speech (TTS) and voice conversion (VC) simultaneously. UnifySpeech brings TTS and VC into a unified framework for the first time. The model is based on the assumption that speech can be decoupled into three independent components: content information, speaker information, prosody information. Both TTS and VC can be regarded as mining these three parts of information from the input and completing the reconstruction of speech. For TTS, the speech content information is derived from the text, while in VC it's derived from the source speech, so all the remaining units are shared except for the speech content extraction module in the two pipelines. We applied vector quantization and loss optimization to bridge the gap between the content domains of TTS and VC. Objective evaluation shows UnifySpeech gets higer speaker similarity and pitch prediction accuracy, indicating the improvements of the style modeling ability. Subjective evaluation shows speech generated by UnifySpeech obtains high mean opinion score (mos) that the audio is as natural as human voice.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/unifyspeech-a-unified-framework-for-zero-shot/code)

4 Replies

Loading