Keywords: speech tokenization, neural audio codec, disentangled speech representation, spoken language model
TL;DR: A speech tokenizer that produces linguistically rich compact representations while enabling high-quality reconstruction.
Abstract: A good language model starts with a good tokenizer. Tokenization is especially important for speech modeling, which must handle noisy continuous speech recordings. A speech tokenizer should produce compact, linguistically rich representations while still enabling high-quality synthesis. We present Kanade, a tokenizer that realizes this ideal. Kanade separates out acoustic constants like speaker identity from the signal to create a single-stream discrete representation of speech that captures linguistic content, including suprasegmental features. Experiments show that Kanade achieves state-of-the-art speaker disentanglement and linguistic availability while maintaining competitive reconstruction quality.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 24559
Loading