TTL: Language-balanced Data Selection for LLM-based Text-to-Speech via Codec Token Scoring

TTL: Language-balanced Data Selection for LLM-based Text-to-Speech via Codec Token Scoring

ACL ARR 2026 January Submission8091 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Data Selection, LLM-based TTS

Abstract: Recent progress in large language model (LLM)-based text-to-speech (TTS) has enabled increasingly natural and versatile speech generation, including zero-shot voice synthesis and multilingual capability. However, these gains are often accompanied by a growing reliance on large-scale data, making training resource-intensive, costly, and time-consuming. In this paper, we introduce TTL, a novel data selection framework tailored for LLM-based TTS that leverages a probability gap computed on codec tokens. Our selection uses two scoring models of different sizes, each aggregating conditional probabilities of global and semantic tokens given the text. A larger score gap indicates more challenging and informative speech samples, which we prioritize to construct data-efficient training subsets. On top of that, we further show that naïve selection can severely bias the selected subset toward a high-resource language in multilingual settings, and propose a simple language-balanced strategy to mitigate this bias and effectively improve multilingual generalization. Extensive experiments on multiple benchmarks demonstrate consistent improvements in naturalness and intelligibility, with substantial gains in word error rate (WER) and strong multilingual performance under data-constrained settings. Source codes are available at \href{https://github.com/TTL-cell/TTL.git}.

Paper Type: Long

Research Area: Speech Processing and Spoken Language Understanding

Research Area Keywords: Text-to-Speech and Spoken Language Understanding

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English, Chinese

Submission Number: 8091

Loading