RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

Detai Xin; Xu Tan; Kai Shen; Zeqian Ju; Dongchao Yang; Yuancheng Wang; Shinnosuke Takamichi; Hiroshi Saruwatari; Shujie LIU; Jinyu Li; sheng zhao

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie LIU, Jinyu Li, sheng zhao

26 Sept 2024 (modified: 07 Nov 2024)ICLR 2025 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: robust text-to-speech synthesis, codec language models, chain-of-thought prompting

TL;DR: We propose RALL-E, a robust codec language modeling method for text-to-speech (TTS) synthesis that uses chain-of-thought prosody prompts and duration-guided masking to improve the robustness.

Abstract: We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous codec language modeling methods have demonstrated impressive performance in zero-shot TTS, they often struggle with robustness issues, such as unstable prosody (irregular pitch and rhythm/duration) and high word error rates (WER), largely due to their autoregressive prediction style. RALL-E addresses these issues through chain-of-thought (CoT) prompting, which breaks the task into simpler steps to improve the stability of TTS. First, RALL-E predicts prosody tokens (pitch and duration) from the input text and uses them as intermediate conditions to guide the prediction of speech tokens in a CoT manner. Second, RALL-E utilizes the predicted duration prompt to guide the computing of self-attention weights in Transformer, enforcing the model to focus on the corresponding phonemes and prosody tokens during speech token prediction. Comprehensive objective and subjective evaluations show that RALL-E significantly improves robustness in zero-shot TTS compared to the baseline method VALL-E, reducing WER from $5.6\\%$ to $2.5\\%$ without reranking, and from $1.7\\%$ to $1.0\\%$ with reranking. Furthermore, RALL-E outperforms several prior approaches aimed at improving the robustness of codec language models, and successfully synthesizes challenging sentences that VALL-E struggles with, lowering the error rate from $68\\%$ to $4\\%$.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6335

Loading