Monotonic Attention for Robust Text-to-Speech Synthesis in Large Language Model Frameworks

Yike Zhang, Yiming Li, Jie Chen, Qinghua Wu, Songjun Cao, Long Ma

Published: 2025, Last Modified: 15 Jan 2026INTERSPEECH 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Text-to-speech (TTS) synthesis using large language models (LLMs) has demonstrated promising performance and has recently garnered significant attention. Despite their impressive naturalness, these methods often lack monotonic alignment constraints, resulting in issues such as repetition, omissions, and misalignment in the synthesized output. This paper introduces a stepwise monotonic attention algorithm tailored for LLM-based architectures, to enhance the robustness of TTS synthesis and effectively address these issues. Compared with the best existing model, VALL-E R, the proposed approach requires no additional forced aligners and exhibits greater robustness on out-of-domain test sets. Furthermore, experiments show that the proposed method scales well to large model sizes and large-scale training sets.

External IDs:dblp:conf/interspeech/ZhangLCWCM25