Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

Van Yang; Xiang Yue; Vipin Chaudhary; Xiaotian Han

Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

Van Yang, Xiang Yue, Vipin Chaudhary, Xiaotian Han

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Reasoning; LLM Inference; Speculative Decoding

TL;DR: a training-free framework that enables large reasoning models to guide smaller ones during inference at the reasoning level, distinct from speculative decoding, which operates at the token level

Abstract: Recent advances leverage post-training to enhance model reasoning performance, which typically requires costly training pipelines and still suffers from inefficient, overly lengthy outputs. We introduce **Speculative Thinking**, a training-free framework that enables large reasoning models to guide smaller ones during inference at the reasoning level, distinct from speculative decoding, which operates at the token level. Our approach is based on two observations: (1) reasoning-supportive tokens such as $''wait''$ frequently appear after structural delimiters like $''\n\n''$, serving as signals for reflection or continuation; and (2) larger models exhibit stronger control over reflective behavior, reducing unnecessary backtracking while improving reasoning quality. By strategically delegating reflective steps to a more capable model, our method significantly boosts the reasoning accuracy of reasoning models while shortening their output. With the assistance of the 32B reasoning model, the 1.5B model’s accuracy on $MATH500$ increases from 83.2\% to 89.4\%, marking a substantial improvement of 6.2\%. Simultaneously, the average output length is reduced from $5439$ tokens to $4583$ tokens, representing a 15.7\% decrease. Moreover, when applied to a non-reasoning model (Qwen-2.5-7B-Instruct), our framework boosts its accuracy from 74.0\% to 81.8\% on the same benchmark, achieving a relative improvement of 7.8\%.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 636

Loading