PromotiCon: Prompt-based Emotion Controllable Text-to-Speech via Prompt Generation and Matching

Published: 01 Jan 2024, Last Modified: 22 May 2025SMC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Text-to-speech (TTS) technologies have recently expanded to incorporate natural language prompts for user-friendly control of speech styles, driven by significant advancements in language models. Traditional prompt-based TTS research, however, typically requires large-scale prompt generation that often necessitates costly human annotations. To address this challenge, we propose PromotiCon, a model that leverages prompts generated without human annotations to control emotions in speech. Our model utilizes abundant prompts generated using a large language model. Additionally, we propose an emotion distance-based prompt-speech matching method to appropriately pair the generated prompts with the most resembling speech data. To enhance speaker adaptation, we utilize a semi-supervised approach that allows the joint utilization of multi-speaker data without emotion labels. As a result, our model facilitates zero-shot emotional speech synthesis. Our experimental results confirm the effectiveness of our approach. Audio samples are available at https://promoticon.github.io/.
Loading