Beyond Solving: A Closer Look at LLMs as Solution Verifiers

Jack Lu; Ryan Teehan; Jinran Jin; Mengye Ren

Beyond Solving: A Closer Look at LLMs as Solution Verifiers

Jack Lu, Ryan Teehan, Jinran Jin, Mengye Ren

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models, Evaluation, Verification, Self-Improvement

TL;DR: We systematically examine solver–verifier interactions across 37 LLMs and 9 datasets, introducing Verifier Gain as a metric to analyze how model size, family, and post-training shape the effectiveness of LLM-based verification.

Abstract: Large language models (LLMs) can act as both problem solvers and solution verifiers, with verifiers improving solver performance by selecting high-quality answers from a pool of candidates. However, prior studies of solver–verifier interactions have been limited, focusing mainly on self-verification and rarely examining how verifiers judge outputs from models in their own or in another model family. Modern LLMs also undergo extensive post-training, but its effect on verification remains unclear. We present a systematic study across 37 models spanning multiple families, sizes, and base vs. post-trained variants, evaluated on 9 benchmarks covering logical reasoning, structured puzzles, symbolic computation, mathematics, commonsense, factual recall, and domain knowledge. We compare self-verification with verification within the same family and across different families. To support this, we introduce and empirically validate *verifier gain*, a metric that predicts the performance improvements from *test-time verifier-based rejection sampling*. We analyze how metrics like verifier gain and false positive rate scale with model size and post-training, and characterize differences in dataset verifiability. Our findings show that cross-family verification is especially effective; post-training reduces self-improvement but strengthens cross-family improvement; and mathematical and logical tasks exhibit the highest inherent verifiability.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 23267

Loading