Keywords: Large language Model, Reasoning, Generation, Speculative Decoding
TL;DR: We systematically analyze the token distribution misalignment for large reasoning models and propose a novel collaborative decoding method.
Abstract: Large reasoning models (LRMs) achieve strong reasoning performance by emitting long chains of thought (CoT), yet these verbose traces slow down inference and often drift into unnecessary detail, known as the overthinking phenomenon. To better understand LRMs' decoding behavior, we systematically analyze the token distribution misalignment for the recent capable LRMs. We observe a similar superficial alignment phenomenon in which misaligned tokens are mostly the stylistic tokens related to thinking patterns that probably occur at the beginning of sentences, further leading to a novel \textit{sentence-level misalignment diminishing} phenomenon. Exploiting this insight, we propose a collaborative fast-slow thinking decoding method for cost-quality trade-off, FoReaL-Decoding, in which a Leading model leads the first few tokens for each sentence, and then a weaker Drafting model completes the following tokens to the end of each sentence, controlled by a stochastic gate. FoReaL-Decoding smoothly interpolates between the small and the large model. On four popular math-reasoning benchmarks (AIME24, GPQA-Diamond, MATH500, AMC23), FoReaL-Decoding cuts theoretical FLOPs by 30 – 50 and trims CoT length by up to 40, while preserving 86 - 100 of model performance. These results establish FoReaL-Decoding as a simple, plug-and-play route to controllable cost-quality trade-offs in reasoning-centric tasks.
Primary Area: generative models
Submission Number: 4882
Loading