Abstract: Building speech processing models with Large Language Models (LLMs) has become a new effective paradigm. A key challenge in this approach is representing speech features align well with LLMs. While continuous speech features from self-supervised learning (SSL) models capture rich information, they pose alignment challenges and lead to high computational costs. Discrete tokenization using K-means improves efficiency but suffers from fixed cluster constraints and limited adaptability to diverse speech signals. In this paper, we propose SED, a novel Structural Entropy-based Speech Discretization method that models speech features as graph nodes and performs adaptive clustering by minimizing 2D Structural Entropy. SED automatically determines the optimal number of clusters, captures robust acoustic correlations to improve clustering quality. Experimental results demonstrate that SED achieves lower word error rates (WER) and higher clustering purity compared to K-means, highlighting its effectiveness for discrete token-based ASR.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: Speech Recognition,Discrete Token, Structural Entropy, LLM
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 2117
Loading