Rethinking LLMs as Verifiers: When Verification is Harder Than Solving

Published: 01 Apr 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 10 pages)
Keywords: LLM-as-a-Judge, LLM Reasoning
Abstract: Large Language Models (LLMs) are increasingly used as evaluators of model outputs, a paradigm commonly referred to as LLM-as-a-judge. A natural assumption underlying this paradigm is that verification should be easier than solving: given a candidate solution, a model should reliably determine its correctness. In this work, we empirically test this assumption. Across multiple benchmarks and model families, we find that LLMs are often less accurate at verification than at solving the same tasks. This gap persists across domains, including multiple-choice reasoning, program synthesis, and multi-step problem solving. To understand this failure, we study verification along three axes. First, we identify \emph{epistemic bias}, where models are more reliable at accepting correct solutions than rejecting incorrect ones. Second, we show \emph{perturbation insensitivity}, where models fail to detect localized errors in near-correct solutions. Third, we demonstrate that verification accuracy improves with \emph{rubric conditioning}, highlighting the role of explicit evaluation criteria. Our results show that LLM-based evaluation is not a straightforward proxy for correctness. Instead, it exhibits systematic failure modes that must be accounted for when using LLMs as evaluators in post-training and benchmarking pipelines.
Presenter: ~Varul_Srivastava1
Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 182
Loading