SPECS: Faster Test-Time Scaling through Speculative Drafts and Dynamic Switching

Mert Cemri; Nived Rajaraman; Rishabh Tiwari; Xiaoxuan Liu; Kurt Keutzer; Ion Stoica; Kannan Ramchandran; Ahmad Beirami; Ziteng Sun

SPECS: Faster Test-Time Scaling through Speculative Drafts and Dynamic Switching

Mert Cemri, Nived Rajaraman, Rishabh Tiwari, Xiaoxuan Liu, Kurt Keutzer, Ion Stoica, Kannan Ramchandran, Ahmad Beirami, Ziteng Sun

06 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language model, test time compute, speculative decoding

Abstract: Scaling test-time compute has driven the recent advances in the reasoning capabilities of large language models (LLMs). However, increased compute often comes at the expense of higher user-facing latency, directly impacting user experience. Current test-time scaling methods primarily optimize for accuracy based on total compute resources (FLOPs), often overlooking latency constraints. To address this gap, we propose SPECS, a latency-aware test-time scaling method. SPECS builds upon beam search, which generates multiple reasoning traces for each step with a reasoning model, and selects one to continue from based on the scores from a dedicated reward model. Inspired by speculative decoding, SPECS uses a smaller, faster model to generate candidate traces efficiently, and evaluates these candidates with both the reasoning model and the reward model. We design novel strategies to select candidate drafts using these model evaluations, including reward-guided soft verification, and a dynamic switching mechanism to defer to the larger model on harder steps. Empirical results on MATH500, AMC23 and OlympiadBench datasets show that SPECS matches or surpasses the accuracy of beam search while reducing latency by up to $\sim$18\%. Our theoretical analysis shows that our algorithm converges to the solution of a KL-regularized reinforcement learning objective as the beam width grows.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 2494

Loading