Scaling Test-Time Compute Without Verification or RL is Suboptimal

Published: 06 Mar 2025, Last Modified: 11 Apr 2025ICLR 2025 Workshop VerifAI OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: test-time compute, LLM, reasoning, RL, RL theory, verification-generation gap
TL;DR: We show that using verification signal scales test-time compute better than verification free methods, asymptotically.
Abstract: Despite substantial improvements in LLM capabilities by scaling test-time compute, an ongoing debate in the community is how it should be scaled up so as to enable continued and efficient improvements with scaling. There are largely two approaches for this: first, distilling successful search procedures; and second, using verification (e.g., 0/1 correctness rewards or trained reward models, verifiers) to guide reinforcement learning (RL) and search algorithms. In this paper, we prove that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given an equal amount of data budget. Concretely, we show that suboptimality of VF methods scales poorly with test-time compute budget (measured as the output token length or horizon) compared to VB when the base pre-trained LLM presents a heterogeneous distribution over correct solution traces (e.g., different lengths, styles, etc) and admits a non-sharp distribution over rewards on traces sampled from it. We formalize this condition using anti-concentration [Erdos, 1945]. This implies a stronger result that VB methods scale better asymptotically, with the performance gap between VB and VF methods widening as test-time compute budget grows. We corroborate our theoretical results empirically on both didactic and math reasoning problems with 3B/8B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.
Submission Number: 12
Loading