LLM-Codec: Neural Audio Codec Meets Language Model Objectives

ACL ARR 2026 January Submission9960 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: spoken language model, neural audio codec, speech tokenization, multi-step prediction, semantic alignment, speech generation
Abstract: Neural audio codecs are widely used as tokenizers for spoken language models, but they are optimized for waveform reconstruction rather than autoregressive prediction. This mismatch injects acoustically driven uncertainty into the discrete token space and increases language-model perplexity. We propose LLM-Codec, which trains the codec encoder with language-model-facing objectives while keeping both codec and LLM architectures unchanged. LLM-Codec introduces (i) future token prediction with Medusa-style multi-step heads to encourage multi-step predictability, and (ii) semantic alignment that matches audio and text representations via a memory-bank contrastive loss. A differentiable Gumbel bridge enables end-to-end gradients from these objectives to the codec encoder. On SALMon speech coherence, token LMs trained on LLM-Codec reach 62.3\% accuracy (+13.0 points over AUV) while reducing perplexity 34$\times$. On Codec-SUPERB-tiny, LLM-Codec improves speech Mel distance by 10.4\%.
Paper Type: Long
Research Area: Speech Processing and Spoken Language Understanding
Research Area Keywords: speech technologies, spoken language understanding, spoken language grounding
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 9960
Loading