AlignChat: Endowing LLMs with End-to-End Speech-to-Text Chat Capability through Token-Level Representation Alignment

Wei Chen; Yunzhong Zheng; Tu Zheng; Deng Cai; Wenxiao Wang; Jieping Ye

AlignChat: Endowing LLMs with End-to-End Speech-to-Text Chat Capability through Token-Level Representation Alignment

Wei Chen, Yunzhong Zheng, Tu Zheng, Deng Cai, Wenxiao Wang, Jieping Ye

08 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: speech language model, large multimodal model, speech processing, representation alignment

TL;DR: AlignChat is an efficient yet robust framework that equips frozen LLM backbones with end-to-end speech understanding capability through token-level representation alignment.

Abstract: The advent of large multimodal models (LMMs) such as GPT-4o has intensified interest in equipping large language models (LLMs) with end-to-end speech understanding capabilities. Existing methods typically employ encoder-based audio tokenizers to map speech into audio tokens served as LLM inputs; while effective, the frequency discrepancy between audio and text tokens demands large quantities of speech data and costly LLM finetuning to achieve cross-modality alignment, while potentially harming the original capability of the LLM backbone. In this work, we introduce *AlignChat*, a simple yet effective framework that bridges speech and text modality via a speech tokenizer with encoder–decoder-based Transformer architecture, ensuring precise one-to-one token-level alignment and efficient cross-modality knowledge transfer without the need for finetuning the LLM backbone. AlignChat adopts a two-stage training scheme. The computation-efficient pretraining stage only requires the speech tokenizer and embeddings of the LLM for preliminary cross-modality alignment, while the instruction-tuning stage proposes to use self-generated speech-instruction-response pairs to ensure consistency between speech- and text-conditioned behavior of AlignChat. Experiments demonstrate that AlignChat achieves strong performance on speech-to–text chat benchmarks with only ~1/20 of the speech data used by previous methods.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 2926

Loading