Comprehend and Talk: Text to Speech Synthesis  via Dual Language Modeling

Junjie Cao; yichen Han; Ruonan Zhang; xiaoyang hao; Shuaijiang Zhao; Hongxiang Li; Yue Liu; Xiao-Ping Zhang

Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling

Junjie Cao, yichen Han, Ruonan Zhang, xiaoyang hao, Shuaijiang Zhao, Hongxiang Li, Yue Liu, Xiao-Ping Zhang

14 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Text to Speech; Speech Signal Processing; Speech Language Modeling; Audio Language Models

TL;DR: Propose a two stage method for audio language modeling

Abstract: Existing Large Language Model (LLM) based autoregressive (AR) text-to-speech (TTS) systems, while achieving state-of-the-art quality, still face critical challenges. The foundation of this LLM-based paradigm is the discretization of the continuous speech waveform into a sequence of discrete tokens by neural audio codec. However, single codebook modeling is well suited to text LLMs, but suffers from significant information loss; hierarchical acoustic tokens, typically generated via Residual Vector Quantization (RVQ), often lack explicit semantic structure, placing a heavy learning burden on the model. Furthermore, the autoregressive process is inherently susceptible to error accumulation, which can degrade generation stability. To address these limitations, we propose CaT-TTS, a novel framework for robust and semantically-grounded zero-shot synthesis. First, we introduce S3Codec, a split RVQ codec that injects explicit linguistic features into its primary codebook via semantic distillation from a state-of-the-art ASR model, providing a structured representation that simplifies the learning task. Second, we propose an ``Understand-then-Generate'' dual-Transformer architecture that decouples comprehension from rendering. An initial ``Understanding'' Transformer models the cross-modal relationship between text and the prompt's semantic tokens to form a high-level utterance plan. A subsequent ``Generation'' Transformer then executes this plan, autoregressively synthesizing hierarchical acoustic tokens. Finally, to enhance generation stability, we introduce Masked Audio Parallel Inference (MAPI), a nearly parameter-free inference strategy that dynamically guides the decoding process to mitigate local errors. Extensive experiments demonstrate that the synergy of our principled architecture and semantically-aware codec allows CaT-TTS to achieve new state-of-the-art performance in zero-shot voice cloning, with MAPI providing a measurable boost in generation robustness on benchmark datasets. Project page: \href{https://anonymous.4open.science/r/CaT-TTS-66A1/}{https://anonymous.4open.science/r/CaT-TTS-66A1}.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 5123

Loading