Benchmarking Code Verification Strategies with LLMs-as-a-judge

Published: 02 Mar 2026, Last Modified: 11 Mar 2026ICLR 2026 Workshop VerifAI-2EveryoneRevisionsBibTeXCC BY 4.0
Track: tiny / short paper (up to 4 pages)
Keywords: Code Generation, LLM-as-a-judge
Abstract: Code generation has attracted attention because of its verifiable completion: code solutions can be passed through unit-tests, mimicking test-driven development to verify correctness. Since obtaining human-written unit tests is expensive, most methods rely on some way of automatic evaluation for rejection sampling or reinforcement learning. In this work, we conduct a comprehensive benchmark of LLM-as-a-judge methods, evaluating their effectiveness and limitations in verifying the correctness of generated code. We show that common approaches to LLM code judges, such as unit test generation or correctness prediction, struggle on harder coding problems. To address this, we propose an approach that combines implicit verification with test generation and consistently outperforms either approaches. Moreover, for unit tests suite based verification, we compare independent and auto-regressive generation and show that the latter method provides more accurate and diverse test suites.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 59
Loading