TL;DR: We study the scaling trends governing sampling-based search, a test-time compute scaling paradigm.
Abstract: Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one---typically by verifying each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation that uses only random sampling and direct self-verification results in sustained performance improvements that, for example, elevate the Gemini v1.5 Pro model's reasoning capabilities past that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts---chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.
Lay Summary: In sampling-based search (also known as parallel test-time compute scaling), language models are used to generate many candidate responses in parallel. The hope is that choosing from a large pool of responses is better than sampling only one. However, the utility of this approach is bottlenecked by verification: just because there is a good response (e.g., we know Pass@k often scales nicely), does not mean that you'll be able to pick it out.
We study the scaling trends for sampling-based search when models need pick out good responses through self-verification. Contrary to the common belief that model self-verification is insufficient and demands interventions like custom reward models, PRMs, reinforcement learning, etc., we show that just scaling self-verification in a principled manner is remarkably effective---able to boost non-reasoning models to o1-level performance without finetuning, RL, or distillation. On the AIME exam for example, self-verification is able to pick out correct answers when <1% of generated responses are correct. We identify a counterintuitive trend behind these observations: self-verification becomes easier the larger the pool of candidate responses is, contrary to the intuition that choosing from a larger pool is a harder selection problem.
Primary Area: Deep Learning->Large Language Models
Keywords: reasoning, search, verification
Submission Number: 14201
Loading