Where do Reasoning Models Make a Difference? Follow the Reasoning Leader for Efficient Decoding

Ming Li; Zhengyuan Yang; Xiyao Wang; Dianqi Li; Linjie Li; Kevin Lin; Tianyi Zhou; Lijuan Wang

Where do Reasoning Models Make a Difference? Follow the Reasoning Leader for Efficient Decoding

Ming Li, Zhengyuan Yang, Xiyao Wang, Dianqi Li, Linjie Li, Kevin Lin, Tianyi Zhou, Lijuan Wang

13 Sept 2025 (modified: 05 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language Model, Reasoning, Generation, Speculative Decoding

TL;DR: We systematically analyze the token distribution misalignment for large reasoning models and propose a novel collaborative decoding method.

Abstract: Large reasoning models (LRMs) achieve strong reasoning performance by emitting long chains of thought (CoT), yet these verbose traces slow down inference and often drift into unnecessary detail, known as the overthinking phenomenon. To better understand LRMs' decoding behavior, we systematically analyze the token distribution misalignment for the recent capable LRMs. We observe a similar superficial alignment phenomenon in which misaligned tokens are mostly the stylistic tokens related to thinking patterns that probably occur at the beginning of sentences, further leading to a novel \textit{sentence-level misalignment diminishing} phenomenon. Exploiting this insight, we propose a collaborative fast-slow thinking decoding method for cost-quality trade-off, FoReaL-Decoding, in which a Leading model leads the first few tokens for each sentence, and then a weaker Drafting model completes the following tokens to the end of each sentence, controlled by a stochastic gate. FoReaL-Decoding smoothly interpolates between the small and the large model. On four popular math-reasoning benchmarks (AIME24, GPQA-Diamond, MATH500, AMC23), FoReaL-Decoding cuts theoretical FLOPs by 30 – 50 and trims CoT length by up to 40, while preserving 86 - 100 of model performance. These results establish FoReaL-Decoding as a simple, plug-and-play route to controllable cost-quality trade-offs in reasoning-centric tasks.

Primary Area: generative models

Submission Number: 4882

Loading