SED: A New Method for Discrete Token-based ASR via Structural Entropy

SED: A New Method for Discrete Token-based ASR via Structural Entropy

ACL ARR 2025 February Submission2117 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Building speech processing models with Large Language Models (LLMs) has become a new effective paradigm. A key challenge in this approach is representing speech features align well with LLMs. While continuous speech features from self-supervised learning (SSL) models capture rich information, they pose alignment challenges and lead to high computational costs. Discrete tokenization using K-means improves efficiency but suffers from fixed cluster constraints and limited adaptability to diverse speech signals. In this paper, we propose SED, a novel Structural Entropy-based Speech Discretization method that models speech features as graph nodes and performs adaptive clustering by minimizing 2D Structural Entropy. SED automatically determines the optimal number of clusters, captures robust acoustic correlations to improve clustering quality. Experimental results demonstrate that SED achieves lower word error rates (WER) and higher clustering purity compared to K-means, highlighting its effectiveness for discrete token-based ASR.

Paper Type: Long

Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding

Research Area Keywords: Speech Recognition,Discrete Token, Structural Entropy, LLM

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 2117

Loading