Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment

ACL ARR 2025 February Submission984 Authors

12 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Modern zero-shot text-to-speech (TTS) systems, despite using extensive pre-training, often underperform in challenging scenarios such as tongue twisters, repeated words, code-switching, and cross-lingual synthesis, leading to intelligibility issues. This paper proposes to use preference alignment to address these challenges. Our approach leverages a newly proposed Intelligibility Preference Speech Dataset (INTP) and applies Direct Preference Optimization (DPO), along with our designed extensions, for diverse TTS architectures. After INTP alignment, in addition to intelligibility, we observe overall improvements including naturalness, similarity, and audio quality for multiple TTS models across diverse domains. Based on that, we also verify the weak-to-strong generalization ability of INTP for more intelligible models such as CosyVoice 2 and Ints. Moreover, we showcase the potential for further improvements through iterative alignment based on Ints. Audio samples are available at https://intalign.github.io/.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: text to speech, preference alignement, reinforcement learning, large language model
Contribution Types: Approaches to low-resource settings, Data resources
Languages Studied: English, Chinese
Submission Number: 984
Loading