Keywords: Large language models, Evaluation, Verification, Self-Improvement
TL;DR: We systematically examine solver–verifier interactions across 37 LLMs and 9 datasets, introducing Verifier Gain as a metric to analyze how model size, family, and post-training shape the effectiveness of LLM-based verification.
Abstract: Large language models (LLMs) can act as both problem solvers and solution verifiers, where the latter select high-quality answers from a pool of solver-generated candidates. This raises the question of under what conditions verification pays off in solver–verifier systems. In their limited study of the factors influencing verification performance, prior work focused primarily on self-verification and tended to examine neither the relationship between solver and verifier model families nor post-training. To rectify this, we present a systematic study across 37 models spanning multiple families, sizes, and base vs. post-trained variants, evaluated on 9 benchmarks covering logical reasoning, structured puzzles, symbolic computation, mathematics, commonsense, factual recall, and domain knowledge. In order to support our analysis, we introduce and empirically validate verifier gain, a metric that predicts the performance improvements from test-time verifier-based rejection sampling. Our experiments find that 1) verification across model families is more effective than either self-verification or verification within the same family, and more generally that the benefits of verification decrease as the solver and verifier become more similar, 2) post-training weakens self-improvement abilities but strengthens cross-family improvement, and 3) some tasks are inherently more amenable to improvement through verification, particularly mathematical and logical tasks.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 59
Loading