Keywords: speech dialogue
Abstract: End-to-end speech dialogue models typically inherit knowledge from pretrained text LLMs by continually learning to model speech tokens. Thus, alignment between speech tokens and text tokens becomes critical: effective alignment enables more efficient transfer of LLM knowledge from text to speech. A well-known challenge lies in the sequence length mismatch between speech tokens and text tokens. But is matching them on average (i.e., making one second of speech map to the same number of tokens as the corresponding text) truly optimal? The answer remains unknown.
In this paper, we aim to discover the optimal speech sequence length—equivalently, the frame rate of speech tokens—for inheriting text LLM knowledge. First, we observe that when forcing speech length to approach text length under a standard LLM architecture (single VQ + single LM head), information loss becomes a bottleneck. To address this, we propose a new method that enables scaling the VQ codebook capacity up to nearly 300 bits per frame, coupled with an efficient audio LM head. This design preserves sufficient speech information under aggressive downsampling while aligning sequence lengths more closely with text tokens, all without modifying the base text LLM backbone. Next, we explore three alignment tasks—speech→text, text→speech, and speech→speech—under different speech token frame rates, while keeping the text LLM parameters frozen. Furthermore, we find that speech and text representation still occupy distinct latent subspaces in the LLM. To mitigate this gap, we introduce a representation alignment objective to further strengthen cross-modal alignment.
Experiments show that with only alignment, a frozen text LLM can already perform pure speech-to-speech QA, achieving comparable results on speech QA benchmarks.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 204
Loading