Contrastive Context-Speech Pretraining for Expressive Text-to-Speech Synthesis

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The latest Text-to-Speech (TTS) systems can produce speech with voice quality and naturalness comparable to human speech. Yet the demand for large amount of high-quality data from target speakers remains a significant challenge. Particularly for long-form expressive reading, target speaker's training speech that covers rich contextual information are needed. In this paper a novel design of context-aware speech pre-trained model is developed for expressive TTS based on contrastive learning. The model can be trained with abundant speech data without explicitly labelled speaker identities. It captures the intricate relationship between the speech expression of a spoken sentence and the contextual text information. By incorporating cross-modal text and speech features into the TTS model, it enables the generation of coherent and expressive speech, which is especially beneficial when there is a scarcity of target speaker data. The pre-trained model is evaluated first in the task of Context-Speech retrieval and then as the integral part of a zero-shot TTS system. Experimental results demonstrate that the pretraining framework effectively learns Context-Speech representations and significantly enhances the expressiveness of synthesized speech. Audio demos are available at: https://ccsp2024.github.io/demo/.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: The Contrastive Context-Speech Pretraining (CCSP) framework significantly advances multimodal processing by enabling expressive Text-to-Speech (TTS) synthesis with enhanced context-awareness. It focuses on learning cross-modal representations that bridge text and speech, improving the naturalness and expressiveness of speech generation, particularly for long-form content. The framework utilizes contrastive learning to align context and speech, facilitating better understanding and integration of multiple modalities. CCSP allows TTS systems to generate context-aware voices without large amount of contextual voice data from target speaker, showcasing flexibility and data efficiency. This is critical for multimedia applications where large, high-quality datasets are limited. CCSP benefits applications like audiobooks, conversational agents, and news reading. Overall, CCSP represents a significant contribution to multimodal multimedia processing, pushing the boundaries of how machines generate and interact with human language in a contextually rich manner.
Submission Number: 3507
Loading