Keywords: spoken language model, neural audio codec, speech tokenization, multi-step prediction, semantic alignment, speech generation
Abstract: Neural audio codecs are widely used as tokenizers for spoken language models, but they are optimized for waveform reconstruction rather than autoregressive prediction.
This mismatch injects acoustically driven uncertainty into the discrete token space and increases language-model perplexity.
We propose LLM-Codec, which trains the codec encoder with language-model-facing objectives while keeping both codec and LLM architectures unchanged.
LLM-Codec introduces (i) future token prediction with Medusa-style multi-step heads to encourage multi-step predictability, and (ii) semantic alignment that matches audio and text representations via a memory-bank contrastive loss.
A differentiable Gumbel bridge enables end-to-end gradients from these objectives to the codec encoder.
On SALMon speech coherence, token LMs trained on LLM-Codec reach 62.3\% accuracy (+13.0 points over AUV) while reducing perplexity 34$\times$.
On Codec-SUPERB-tiny, LLM-Codec improves speech Mel distance by 10.4\%.
Paper Type: Long
Research Area: Speech Processing and Spoken Language Understanding
Research Area Keywords: speech technologies, spoken language understanding, spoken language grounding
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 9960
Loading