Keywords: Speech Tokenizer, Text-to-Speech, Speech Language Modeling, Diffusion Model
TL;DR: A novel speech tokenizer with an end-to-end diffusion autoencoder and text-aware decoding, operating at 6.25 Hz and 0.0875 kbps
Abstract: Speech tokenizers serve as foundational components for speech language models, yet current designs exhibit several limitations, including:
(1) dependence on multi-layer residual vector quantization structures or high frame rates,
(2) reliance on auxiliary pre-trained models for semantic distillation, and
(3) requirements for complex two-stage training processes.
In this work, we introduce the **T**ext-**a**ware **Di**ffusion Transformer Speech **Codec** (***TaDiCodec***), a novel approach designed to overcome these challenges.
TaDiCodec employs end-to-end optimization for quantization and reconstruction through a diffusion autoencoder, while integrating text guidance into the diffusion decoder to enhance reconstruction quality and achieve optimal compression.
TaDiCodec achieves an extremely low frame rate of **6.25 Hz** and a corresponding bitrate of **0.0875 kbps** with a **single-layer codebook** for 24 kHz speech,
while maintaining superior performance on critical speech generation evaluation metrics such as Word Error Rate (WER), speaker similarity (SIM), and speech quality (UTMOS),
Notably, TaDiCodec employs a single-stage, end-to-end training paradigm, and obviating the need for auxiliary pre-trained models.
We also validate the compatibility of TaDiCodec in language model based zero-shot text-to-speech with both autoregressive modeling and masked generative modeling, demonstrating its effectiveness and efficiency for speech language modeling, as well as a significantly small *reconstruction-generation gap*.
To facilitate reproducibility and further research, we will make our source code and pre-trained checkpoints publicly available.
Audio samples are are available at https://tadicodec.github.io/. We release code and model checkpoints at https://github.com/AmphionTeam/TaDiCodec.
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 19999
Loading