Towards True Speech-to-Speech Models Without Text Guidance

Xingjian Zhao; Zhe Xu; Luozhijie Jin; Yang Wang; Hanfu Chen; Yaozhou Jiang; Ke Chen; Ruixiao Li; Mingshu Chen; Ruiming Wang; Wenbo Zhang; Qinyuan Cheng; Zhaoye Fei; Shimin Li; Xipeng Qiu

Towards True Speech-to-Speech Models Without Text Guidance

Xingjian Zhao, Zhe Xu, Luozhijie Jin, Yang Wang, Hanfu Chen, Yaozhou Jiang, Ke Chen, Ruixiao Li, Mingshu Chen, Ruiming Wang, Wenbo Zhang, Qinyuan Cheng, Zhaoye Fei, Shimin Li, Xipeng Qiu

Published: 26 Jan 2026, Last Modified: 13 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multimodal large language model, large language model, speech language model

TL;DR: We present a true speech-to-speech LLM that understands and generates speech directly, without text intermediates, achieving state-of-the-art spoken QA.

Abstract: Spoken dialogue systems often rely on cascaded pipelines that transcribe, process, and resynthesize speech. While effective, this design discards paralinguistic cues and limits expressivity. Recent end-to-end methods reduce latency and better preserve these cues, yet still rely on text intermediates, creating a fundamental bottleneck. We present a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, preserving the reasoning and knowledge of pretrained text LLMs while adding native speech capabilities. Experiments show that our model achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance relative to existing text-guided systems, while still maintaining competitive text performance. By narrowing the gap between text-guided and direct speech generation, our work establishes a new paradigm for expressive and efficient end-to-end speech interaction. We will release our code and models to support further research in true speech-to-speech foundation models.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 11530

Loading