Keywords: Test Time Scaling, Efficient Decoding
TL;DR: We use sub-argmax probability spikes of control tokens (EOT) to dynamically control optimal stopping in LLMs. This boosts reasoning accuracy and efficiency on complex math tasks like AIME-2025 and MATH-500.
Abstract: We introduce a framework of Adaptive Control Token Sampling (ACTS) policies that leverage probability signals from specific tokens in the LLM vocabulary to dynamically regulate optimal stopping in the generation process. Specifically, ACTS combats over-thinking and under-thinking in LLMs by leveraging adaptive signals about the generation trace at test-time offering superior test-time scaling properties. Our experiments show that ACTS effectively mitigates under-thinking on complex reasoning tasks using adaptive stopping-time policies. Furthermore, we propose an \textbf{Adaptive Self-Critique Sampler} that uses end-of-thinking spikes as triggers for self-evaluation, boosting reasoning accuracy upto $\sim 9.8$\% on the MATH-500. On instruction-following tasks, ACTS leverages end-of-sequence spikes to improve the quality-efficiency trade-off. Finally, we used spikes to propose a novel parallel sampling technique that intelligently initiates high-quality parallel reasoning trajectories from a shared sequentially generated thinking trace. Our work establishes control token probabilities as a powerful, untapped signal for creating more robust and efficient inference policies, offering a new paradigm to control test-time scaling.
Submission Number: 125
Loading