SPECS: Faster Test-Time Scaling through Speculative Drafts

Published: 11 Jun 2025, Last Modified: 10 Jul 2025ES-FoMo IIIEveryoneRevisionsBibTeXCC BY 4.0
Keywords: test-time scaling, speculative decoding, beam sarch, inference time alignment
TL;DR: We introduce SPECS, a fast test-time scaling method that accelerates LLM reasoning, reducing latency while maintaining comparable accuracy, by using a small model to draft candidate sequences and a larger model with a reward model to evaluate them.
Abstract: Scaling test-time compute has driven the recent advances in the reasoning capabilities of large language models (LLMs), typically by allocating additional computation for more thorough exploration. However, increased compute often comes at the expense of higher user facing latency, directly impacting the user experience. Current test-time scaling methods primarily optimize accuracy based on total compute (FLOPS), often overlooking latency constraints. To address this gap, we propose SPECS, a latency-aware test-time scaling method inspired by speculative decoding. SPECS uses a smaller, faster model to generate candidate sequences efficiently, and evaluates these candidates using signals from both a larger target model and a dedicated reward model. We introduce new integration strategies, including reward-guided soft verification and a reward-based deferral mechanism. Empirical results on MATH-500 and AMC23 datasets show that SPECS matches or surpasses beam search accuracy while reducing latency by up to 15.3%. Our theoretical analysis shows that our algorithm converges to the solution of a KL-regularized reinforcement learning objective with increasing beam width.
Submission Number: 141
Loading