Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models

ICLR 2026 Conference Submission13254 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Speech Models; Real-time Spoken Interaction; Token-level Interleaved Generation; Thinking-in-Speaking; Reasoning Models
TL;DR: MINI-OMNI-REASONER introduces Token-level Thinking-in-Speaking to Large Speech Models, achieving zero-latency for complex real-tim reasoning.
Abstract: Reasoning is essential for effective communication and decision-making. While recent advances in large language models (LLMs) and multimodal models (MLLMs) have shown that incorporating explicit reasoning significantly improves understanding and generalization, reasoning in large speech models (LSMs) remains in a nascent stage. Early efforts attempt to transfer the “thinking-before-speaking” paradigm from textual models to speech. However, this sequential formulation introduces notable latency, as spoken responses are delayed until reasoning is fully completed, impairing real-time interaction and communication efficiency. To address this, we propose Mini-Omni-Reasoner, a framework that enables reasoning within speech via a novel “thinking-in-speaking” formulation. Rather than completing reasoning before producing any verbal output, Mini-Omni-Reasoner interleaves silent reasoning tokens with spoken response tokens at the token level. This design allows continuous speech generation while embedding structured internal reasoning, leveraging the model’s high-frequency token processing capability. Although interleaved, local semantic alignment is enforced to ensure that each response token is informed by its preceding reasoning. To support this framework, we introduce SPOKEN-MATH-PROBLEMS-3M, a large-scale dataset tailored for interleaved reasoning and response. The dataset ensures that verbal tokens consistently follow relevant reasoning content, enabling accurate and efficient learning of speech-coupled reasoning. Built on a hierarchical Thinker–Talker architecture, Mini-Omni-Reasoner delivers fluent yet logically grounded spoken responses, maintaining both naturalness and precision. On the Spoken-MQA benchmark, it achieves a +19.1% gain in arithmetic reasoning and +6.4% in contextual understanding, with shorter outputs and zero decoding latency. These results demonstrate that high-quality reasoning and real-time spoken interaction can be effectively unified in a single framework.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 13254
Loading