Shanks: Simultaneous Hearing and Thinking for Spoken Language Models

Shanks: Simultaneous Hearing and Thinking for Spoken Language Models

ICLR 2026 Conference Submission17696 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Spoken languge model, chain-of-thought reasoning

Abstract: Current large language models (LLMs) and spoken language models (SLMs) begin thinking and taking action only \textit{after} the user has finished their turn. This can create a high latency for waiting until the model ends the thinking process. Consequently, thinking \textit{after} receiving the full input is not suitable for speech-to-speech interaction, where real-time and low-latency interaction is important. We address the above issue by drawing inspiration from the fact that humans can naturally \textit{``think while listening''}. In this paper, we propose \textbf{Shanks}, a general inference framework that enables SLMs to generate unspoken chain-of-thought reasoning when listening to the user input. Shanks streams the input speech in fixed-duration chunks and, as soon as an input chunk is received, generates an unspoken reasoning based on all previous speech and reasoning; in the meantime, the user is still speaking. Shanks uses unspoken reasoning to perform intermediate calculations, make API calls to complete the task, and determine whether to interrupt the user. We demonstrate that Shanks enhances the real-time user-SLM interaction in two scenarios: (1) When the user is presenting their solution to a math problem, Shanks can listen to and reason over the user's speech and make interruption when the user makes a mistake. Shanks interrupts the user 37.1% more accurately compared with a baseline that interrupts the user without thinking. (2) In a task-oriented dialogue setting, where the user's request needs to be completed by calling hotel and flight booking APIs, Shanks can complete 63.2% of the API calls before the user even ends their turn.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 17696

Loading