UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis

Published: 20 Jul 2024, Last Modified: 06 Aug 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Understanding the speaking style, such as the emotion of the interlocutor's speech, and responding with speech in an appropriate style is a natural occurrence in human conversations. However, technically, existing research on speech synthesis and speaking style captioning typically proceeds independently. In this work, an innovative framework, referred to as UniStyle, is proposed to incorporate both the capabilities of speaking style captioning and style-controllable speech synthesizing. Specifically, UniStyle consists of a UniConnector and a style prompt-based speech generator. The role of the UniConnector is to bridge the gap between different modalities, namely speech audio and text descriptions. It enables the generation of text descriptions with speech as input and the creation of style representations from text descriptions for speech synthesis with the speech generator. Besides, to overcome the issue of data scarcity, we propose a two-stage and semi-supervised training strategy, which reduces data requirements while boosting performance. Extensive experiments conducted on open-source corpora demonstrate that UniStyle achieves state-of-the-art performance in speaking style captioning and synthesizes expressive speech with various speaker timbres and speaking styles in a zero-shot manner.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: This work represents a significant advancement in multimodal processing by seamlessly incorporating speaking style captioning and stylistic speech synthesis within the UniStyle framework. By integrating these two crucial tasks, UniStyle enhances our ability to understand and generate speaking styles across different modalities. UniStyle achieves this by introducing a UniConnector, which serves to establish alignment between speaking styles present in speech signals and corresponding text descriptions. This component enables UniStyle to effectively bridge the gap between spoken and textual representations of speaking style. Furthermore, UniStyle empowers users to generate text descriptions or stylistic speech that vividly describe and express speaking styles across multiple modalities. This capability facilitates richer and more nuanced communication, allowing for a deeper understanding and appreciation of the nuances present in spoken language.
Submission Number: 4103
Loading