Keywords: inference-time compute scaling, beam search, verifiers, ovm, prm
TL;DR: scaling flaws hinder the effectiveness of verifier-guided beam search
Abstract: Large language models (LLMs) struggle with multi-step mathematical reasoning, for which inference-time scaling—via sequential or parallel scaling—has emerged as a promising strategy. While recent advances have focused on sequential scaling, we revisit the less-explored parallel scaling approach, verifier-guided beam search, to examine its limitations. In this paper, we argue that its strength is, paradoxically, also its limitation: verifiers can boost performance under limited sample sizes by elevating promising reasoning paths, yet the same mechanism can also hide or cut off the valid paths that lead to correct answers. Empirically, we uncover a systematic issue--scaling flaws--in verifier-guided beam search, across models, benchmarks (GSM8K, MATH, AIME25), and verifier types (outcome value models, process reward models). Specifically, the search outperforms repeated sampling at small sample sizes but its advantage diminishes—and ultimately reverses—as the sample size grows. We attribute this to verifier failures: imperfect verifiers misrank candidates and can erroneously prune all valid paths, with these effects exacerbated on more challenging scenarios. To mitigate verifier failures, we explore reducing reliance on verifiers and conduct preliminary investigations using two simple methods.
Overall, our findings expose fundamental limitations of verifier-guided beam search and explain why this line has struggled to realize its potential.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 24173
Loading