HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling

Rongkun Xue; Yazhe Niu; Shuai Hu; Zixin Yin; Yongqiang Yao; Jing Yang

HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling

Rongkun Xue, Yazhe Niu, Shuai Hu, Zixin Yin, Yongqiang Yao, Jing Yang

Published: 10 Jun 2025, Last Modified: 08 Jan 2026TokShopEveryoneRevisionsBibTeXCC BY 4.0

Archiving Submission: No (non-archival)

Previous Venue If Non Archival: N/A

Keywords: Speech Codec, Discrete Speech Tokenization

Abstract: Discrete speech tokenization is a fundamental component in speech codecs. However, in large-scale speech-to-speech systems, the complexity of parallel streams from multiple quantizers and the computational cost of high-time-dimensional codecs pose significant challenges. In this paper, we introduce HH-Codec, a neural codec that achieves extreme compression at 24 tokens per second for 24 kHz audio while relying on single-quantizer inference. Our approach involves a carefully designed Vector Quantization space for Spoken Language Modeling, optimizing compression efficiency while minimizing information loss. Building on this, we propose an asymmetric encoder-decoder architecture (Audio-VQ-Mel-Audio) that leverages dual supervision and progressive training to enhance reconstruction stability and fidelity. HH-Codec achieves state-of-the-art performance in speech reconstruction with an ultra-low bandwidth of 0.3 kbps. We further evaluate its effectiveness in codebook utilization and generative model adaptation, with extensive ablations validating the necessity of each module.

Submission Number: 21

Loading