Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models
Keywords: Text-to-Speech, Style conversion, Prompting
Abstract: While prompt-based text-to-speech (TTS) models enable natural language-driven speaking style control, they often provide limited fine-grained control and apply a single global style across an utterance.
This restricts practical use cases that require continuous style attribute interpolation across utterances and time-varying style transitions within a single utterance.
In this paper, we propose novel techniques to achieve both capabilities in existing prompt-based TTS models.
For inter-utterance style interpolation, we compute direction vectors between contrastive style prompts in the embedding space and perform simple interpolation, enabling smooth transitions between style characteristics.
For intra-utterance style transition, we first identify a strong attention bias toward early tokens in autoregressive TTS decoders, causing the initial audio realization to dominate subsequent generation.
To mitigate this effect, we introduce KV-cache swapping and sliding-window attention masking.
Experiments demonstrate that our proposed inter-utterance interpolation achieves a 99-100% success rate in gender conversion, up to 36 Hz pitch variation, and up to 1.6 syllables-per-second speed change.
Our intra-utterance transition maintains a speaker similarity of 0.81–0.91 and achieves perceptual smoothness scores of 3.48–4.48.
Paper Type: Long
Research Area: Speech Processing and Spoken Language Understanding
Research Area Keywords: Text-to-Speech
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 1714
Loading