SPECS: Faster Test-Time Scaling through Speculative Drafts and Dynamic Switching

06 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large language model, test time compute, speculative decoding
Abstract: Scaling test-time compute has driven the recent advances in the reasoning capabilities of large language models (LLMs). However, increased compute often comes at the expense of higher user-facing latency, directly impacting user experience. Current test-time scaling methods primarily optimize for accuracy based on total compute resources (FLOPs), often overlooking latency constraints. To address this gap, we propose SPECS, a latency-aware test-time scaling method. SPECS builds upon beam search, which generates multiple reasoning traces for each step with a reasoning model, and selects one to continue from based on the scores from a dedicated reward model. Inspired by speculative decoding, SPECS uses a smaller, faster model to generate candidate traces efficiently, and evaluates these candidates with both the reasoning model and the reward model. We design novel strategies to select candidate drafts using these model evaluations, including reward-guided soft verification, and a dynamic switching mechanism to defer to the larger model on harder steps. Empirical results on MATH500, AMC23 and OlympiadBench datasets show that SPECS matches or surpasses the accuracy of beam search while reducing latency by up to $\sim$18\%. Our theoretical analysis shows that our algorithm converges to the solution of a KL-regularized reinforcement learning objective as the beam width grows.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 2494
Loading