Kanade: Compact Linguistically Rich Speech Tokens for Spoken Language Models

Zhijie Huang; Stephen McIntosh; Daisuke Saito; Nobuaki Minematsu

Kanade: Compact Linguistically Rich Speech Tokens for Spoken Language Models

Zhijie Huang, Stephen McIntosh, Daisuke Saito, Nobuaki Minematsu

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: speech tokenization, neural audio codec, disentangled speech representation, spoken language model

TL;DR: A speech tokenizer that produces linguistically rich compact representations while enabling high-quality reconstruction.

Abstract: A good language model starts with a good tokenizer. Tokenization is especially important for speech modeling, which must handle noisy continuous speech recordings. A speech tokenizer should produce compact, linguistically rich representations while still enabling high-quality synthesis. We present Kanade, a tokenizer that realizes this ideal. Kanade separates out acoustic constants like speaker identity from the signal to create a single-stream discrete representation of speech that captures linguistic content, including suprasegmental features. Experiments show that Kanade achieves state-of-the-art speaker disentanglement and linguistic availability while maintaining competitive reconstruction quality.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 24559

Loading