Keywords: generative verification, large language model, test-time scaling
TL;DR: We study the factors influence LLM-based generative verification, and apply findings to verifier-based test-time scaling.
Abstract: Recent advances in large language models (LLMs) have produced increasingly capable generators that can solve complex problems across diverse domains. Evaluating these generators' outputs has shifted from human assessment to automated verification using LLMs as verifiers. In this paradigm, verifier models assess the correctness of solutions produced by generator models, a framework now central to applications such as test-time scaling (TTS). In this study, we study generative verifiers, which perform verification as a next-token prediction task by generating chain-of-thought (CoT) reasoning followed by a binary verdict. We systematically analyze verification dynamics across three dimensions: problem difficulty, generator capability, and verifier generation capability, conducting empirical studies on 2.3k mathematical problems using 14 open-source models (2B to 72B parameter range) and GPT-4o. Our experiments reveal three key findings about verification effectiveness: (1) Problem difficulty affects recognizing correct responses; (2) Weak generators produce errors that are easier to detect than strong generators; (3) Verification ability is generally correlated with verifier generation ability, but correlation relationship varies with problem difficulty. These findings enable cost-effective strategies in TTS applications. Specifically, we identify two patterns that weak models can substitute for strong ones. First, given the same verifier, weak generators can nearly match stronger generators in post-verification TTS performance (e.g., a 9B model matches a 27B model). Second, weak verifiers can approximate strong verifiers in regimes where both achieve similar verification performance.
Submission Number: 155
Loading