Emotional Text-to-Speech via Style Decoder with Emotion Shared Styleformer Block and RoPE Prior Encoder
Abstract: Emotional Text-to-Speech (E-TTS) aims to generate speech that not only sounds natural but also conveys rich emotional expressions. Unlike traditional TTS, E-TTS must capture complex elements such as pitch, prosody, rhythm, and timbre variations to accurately convey emotions. Recently, some classical deep learning-based methods, such as Tacotron2, Transformer-TTS, FastSpeech2, and VITS, have significantly improved speech synthesis quality. However, these models still face challenges like alignment instability, strict duration constraints, and difficulties in generalizing across emotions and styles. The VITS model, while capable of high-quality speech synthesis, struggles with integrating emotional information due to its complex architecture. To address this, we propose RoStyleVITS, an end-to-end emotional TTS model built on VITS. RoStyleVITS incorporates emotion-infused styleformer blocks and replaces the standard attention layer with a self-attention layer using Rotary Position Embedding (RoPE) to enhance text sequence modeling. Our method outperforms existing state-of-the-art emotional speech synthesis models in both subjective and objective evaluations, demonstrating improved emotional expression and synthesis quality.
External IDs:dblp:conf/icann/YaoXXLCW25
Loading