SubAlign: Speech Tokenization Aligned with LLM Vocabularies for Spoken Language Modeling

Published: 26 Aug 2025, Last Modified: 26 Aug 2025SpeechAI TTIC 2025 OralorPosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: speech tokenization, spoken language modeling, speech-subword alignment, LLM vocabulary, speech-text alignment
Presentation Preference: Yes
Abstract: One factor contributing to the performance discrepancy between large language models and spoken language models is the modality gap in their representations. To address this issue, we introduce SubAlign, the first speech tokenization framework to explicitly segment speech at the subword level corresponding to large language model vocabularies. Each resulting SubAlign unit is composed of the textual content, acoustic features, and duration associated with its respective subword. Building on this framework, we present SubAlign-SLM, a spoken language model trained on SubAlign units, and demonstrate the effectiveness of SubAlign on downstream tasks. Extensive automatic and human evaluations show that SubAlign-SLM surpasses baseline models, demonstrating the potential of SubAlign for speech processing applications.
Submission Number: 32
Loading