Prosody-TTS: Self-Supervised Prosody Pretraining with Latent Diffusion For Text-to-SpeechDownload PDF

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Keywords: Text-to-speech, Prosody modeling, Self-supervised learning, Diffusion probabilistic model
Abstract: Expressive text-to-speech aims to generate high-quality samples with rich and diverse prosody, which is hampered by two major challenges: 1) considering the one-to-many mapping problem, prosodic attributes in highly dynamic voices are difficult to capture and model without intonation; 2) the TTS model should learn a diverse latent space and prevent producing dull samples with a collapsed prosodic distribution. This paper proposes Prosody-TTS, a two-stage TTS pipeline that improves prosody modeling and sampling by introducing several components: 1) a self-supervised learning model to derive the prosodic representation without relying on text transcriptions or local prosody ground-truth, which ensures the model covers diverse speaking voices, preventing sub-optimal solutions and distribution collapse; and 2) a latent diffusion model to sample and produce diverse patterns within the learned prosodic space, which prevents TTS models from generating the dull samples with mean distribution. Prosody-TTS achieves high-fidelity speech synthesis with rich and diverse prosodic attributes. Experiments results demonstrate that it surpasses the state-of-the-art models in terms of audio quality and prosody naturalness. The downstream evaluation and ablation studies further demonstrate the effectiveness of each design. Audio samples are available at https://Prosody-TTS.github.io/.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
TL;DR: We propose Prosody-TTS, a TTS model to enhance prosody modeling by introducing self-supervised prosody pre-training and generative latent diffusion.
24 Replies

Loading