Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

ACL ARR 2025 February Submission7229 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent advances in large language models (LLMs) have enabled remarkable progress in zero-shot text-to-speech (TTS) synthesis, yet existing foundation models face significant limitations. While these models excel at reproducing voices from reference audio, they lack fine-grained control over voice attributes and, in single-stream approaches, suffer from the entanglement of semantic and acoustic information within tokens. This entanglement makes independent manipulation of speech characteristics challenging and hinders the creation of entirely new voices. To address these limitations, we introduce Spark-TTS, a novel system built upon our proposed BiCodec, a single-stream speech codec that strategically decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker-specific attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained attribute control (e.g., gender, speaking style) and fine-grained parameter adjustment (e.g., precise pitch values, speaking rate). To advance research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations. Extensive experiments demonstrate that Spark-TTS not only achieves state-of-the-art performance in zero-shot voice cloning but also excels at generating novel, highly customizable voices that transcend the limitations of reference-based synthesis\footnote{Source code and checkpoint will be released.}. Audio samples are available at~\url{https://spark-tts.github.io/}.

Paper Type: Long

Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding

Research Area Keywords: speech technologies

Contribution Types: Publicly available software and/or pre-trained models, Data resources

Languages Studied: English, Chinese

Submission Number: 7229

Loading