Abstract: Text-to-speech (TTS) synthesis using large language models (LLMs) has demonstrated promising performance and has recently garnered significant attention. Despite their impressive naturalness, these methods often lack monotonic alignment constraints, resulting in issues such as repetition, omissions, and misalignment in the synthesized output. This paper introduces a stepwise monotonic attention algorithm tailored for LLM-based architectures, to enhance the robustness of TTS synthesis and effectively address these issues. Compared with the best existing model, VALL-E R, the proposed approach requires no additional forced aligners and exhibits greater robustness on out-of-domain test sets. Furthermore, experiments show that the proposed method scales well to large model sizes and large-scale training sets.
External IDs:dblp:conf/interspeech/ZhangLCWCM25
Loading