TL;DR: We show that using verification signal scales test-time compute better than verification free methods, asymptotically.
Abstract: Despite substantial advances in scaling test-time compute, an ongoing debate in the community is how it should be scaled up to enable continued and efficient improvements with scaling. There are largely two approaches: (i) distilling successful search or thinking traces; and (ii), using verification (e.g., 0/1 outcome rewards, or verifiers) to guide reinforcement learning (RL) and search algorithms. In this paper, we prove that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget. Further, we show that as we scale test-time compute (measured as the output token length) and training data, suboptimality of VF methods scales poorly compared to VB when the base pre-trained LLM presents a heterogeneous distribution over correct solution traces (e.g., different lengths, styles, etc.) and admits a non-sharp distribution over rewards on traces sampled from it. We formalize this condition using anti-concentration [Erdős 1945], implying a stronger result that VB methods scale better asymptotically, with the performance gap between VB and VF widening as test-time budget grows.
We corroborate our theory empirically on didactic and math reasoning problems with 3/8/32B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.
Lay Summary: Modern language models can "think longer” at test time by generating many lines of reasoning before deciding on a final answer. But does simply letting a model think for twice as many tokens actually make it twice as smart?
We show that the answer depends on how the model was trained. Training approaches that just copy expert solutions ("supervised fine-tuning (SFT)" or “imitation learning”) struggle to turn extra tokens into better answers: their error rate shrinks slowly—even for very large models—because they suffer from expert heterogeneity, i.e., try to mimic every style and length of expert reasoning they see.
In contrast, training that verifies each attempt—using a simple checker that says “right” or “wrong” and then rewards the model accordingly—scales far better. Our theory proves that, as you increase both (i) the model’s test‑time budget of tokens H and (ii) the amount of training data n, the performance gap between verification‑based and imitation‑based methods widens roughly like √H.
Experiments on math benchmarks with 3/8/32B parameter models confirm the prediction: verification‑trained models keep improving when given longer to think, while imitation‑trained (SFT) models plateau. The takeaway is simple—if we want bigger models to keep getting smarter by “thinking longer,” we must train them with explicit feedback, not just imitation.
Primary Area: Deep Learning->Large Language Models
Keywords: LLMs, test-time compute, verification, RL, finetuning, reasoning, RL theory
Submission Number: 12195
Loading