TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling

Yuancheng Wang; Dekun Chen; Xueyao Zhang; Junan Zhang; Jiaqi Li; Zhizheng Wu

TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling

Yuancheng Wang, Dekun Chen, Xueyao Zhang, Junan Zhang, Jiaqi Li, Zhizheng Wu

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Speech Tokenizer, Text-to-Speech, Speech Language Modeling, Diffusion Model

TL;DR: A novel speech tokenizer with an end-to-end diffusion autoencoder and text-aware decoding, operating at 6.25 Hz and 0.0875 kbps

Abstract: Speech tokenizers serve as foundational components for speech language models, yet current designs exhibit several limitations, including: (1) dependence on multi-layer residual vector quantization structures or high frame rates, (2) reliance on auxiliary pre-trained models for semantic distillation, and (3) requirements for complex two-stage training processes. In this work, we introduce the **T**ext-**a**ware **Di**ffusion Transformer Speech **Codec** (***TaDiCodec***), a novel approach designed to overcome these challenges. TaDiCodec employs end-to-end optimization for quantization and reconstruction through a diffusion autoencoder, while integrating text guidance into the diffusion decoder to enhance reconstruction quality and achieve optimal compression. TaDiCodec achieves an extremely low frame rate of **6.25 Hz** and a corresponding bitrate of **0.0875 kbps** with a **single-layer codebook** for 24 kHz speech, while maintaining superior performance on critical speech generation evaluation metrics such as Word Error Rate (WER), speaker similarity (SIM), and speech quality (UTMOS), Notably, TaDiCodec employs a single-stage, end-to-end training paradigm, and obviating the need for auxiliary pre-trained models. We also validate the compatibility of TaDiCodec in language model based zero-shot text-to-speech with both autoregressive modeling and masked generative modeling, demonstrating its effectiveness and efficiency for speech language modeling, as well as a significantly small *reconstruction-generation gap*. To facilitate reproducibility and further research, we will make our source code and pre-trained checkpoints publicly available. Audio samples are are available at https://tadicodec.github.io/. We release code and model checkpoints at https://github.com/AmphionTeam/TaDiCodec.

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 19999

Loading