VeriBench: End-to-End Formal Verification Benchmark for AI Code Generation in Lean 4

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: formal verification, Lean 4, language models, benchmark, code synthesis, correctness proofs, software reliability, generative AI, secure programming, trustworthy AI
TL;DR: VeriBench introduces a benchmark for evaluating language models on end-to-end Lean 4 code verification, showing current models struggle while self-optimizing agents demonstrate meaningful progress toward provably correct software generation.
Abstract: Formal code verification offers a path to provably correct software, but evaluating language models' capabilities in this domain requires comprehensive benchmarks. Provably correct code would eliminate entire classes of vulnerabilities, mitigate critical system failures, and potentially transform software engineering practices through inherently trustworthy implementation methodologies. We present \textsc{VeriBench}, a benchmark for assessing \textbf{end-to-end} formal code verification in Lean 4, requiring models to generate complete program implementations with tests, specifications/theorems, and machine-checked proofs from Python references. Our benchmark comprises 140 tasks across five difficulty levels: 56 HumanEval problems, 41 foundational programming exercises, 10 classical algorithms, 28 security-critical programs adapted from real-world vulnerabilities and 5 programs from the Python standard library. To enable comprehensive capability assessment, we establish four hierarchical evaluation subtasks with explicit metrics: (1) Lean 4 compilation success, (2) proportion of unit test passing, (3) correctness-theorem synthesis quality, and (4) proof success rates with pass@1. However, evaluation reveals significant limitations in current models: Claude 3.7 Sonnet achieves only 35.0\% compilation success but 40.6\% of unit test passing – while LLaMA-70B fails to compile any programs despite 50 feedback-guided attempts on a previous version of \textsc{VeriBench}. Models demonstrate similar performance on theorem evaluation, reaching 0.615\% theorem accuracy as measured by a LLM judge. However, proof generation remains particularly challenging—our DSP (Draft Sketch Proof) proving agent achieves only 28.9\% pass@1. In contrast, our trace-based self-debug agent architecture achieves 49.3\% compilation success, demonstrating the potential of iterative, feedback-driven approaches. To enable scalable evaluation, we introduce a novel methodology for certifying the trustworthiness of LLM judges. We validate our theorem/specification LLM judge by applying a novel trustworthiness methodology that verifies adherence to fundamental logical properties, such as consistency and monotonicity, thereby facilitating reliable, automated generation of theorems and specifications. \textsc{VeriBench} establishes a rigorous foundation for developing AI systems capable of synthesizing provably correct, bug-free code, thereby advancing the trajectory toward more secure and dependable software infrastructure.
Supplementary Material: pdf
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Submission Number: 22996
Loading