VeriBench: An End-to-End Formal Verification Benchmark for AI Coding Agents in Lean 4

Published: 16 Jun 2026, Last Modified: 16 Jun 2026ICML 2026 Workshop DL4CEveryoneRevisionsBibTeXCC BY-NC 4.0
Keywords: formal verification, Lean 4, language models, benchmark, code synthesis, correctness proofs, software reliability, generative AI, secure programming, trustworthy AI
TL;DR: VeriBench introduces a benchmark for evaluating language models on end-to-end Lean 4 code verification, showing current models struggle while self-optimizing agents demonstrate meaningful progress toward provably correct software generation.
Abstract: Test-based coding benchmarks have repeatedly required strengthening passes---HumanEval gave way to HumanEval+, SWE-Bench to SWE-Bench Verified to SWE-Bench+, each release exposing test-invisible bugs in code that passed the prior suite---a structural limitation of finite testing: passing tests routinely leave behaviorally important bugs unobserved. Existing formal-verification benchmarks (e.g., VERINA, FVAPPS, CLEVER, DafnyBench) provide stronger machine-checkable signals, but many focus on proof completion, scaffolded verified generation, or isolated formal subtasks rather than the full path from developer-written source code to a verified formal artifact. We argue that trustworthy code-verification benchmarks must be end-to-end and agentic, scoring full Python-to-Lean autoformalization under verifier feedback, and must aggregate verification stages conjunctively ---weakness at any stage penalizing the composite while preserving smooth evaluation signals that retain discriminative power across models still far from the frontier. We introduce VeriBench, a 452-task end-to-end Python-to-Lean~4 autoformalization benchmark spanning HumanEval-style programs, classical algorithms, Python standard-library functions, security examples, and 282 high-assurance-inspired tasks across 14 domains such as cryptography, aerospace, medical devices, and compilers. We score agents with the Smooth Conjunctive Score for Code verification (SCSC), a log-domain geometric mean over five per-task factors: $\mathrm{SCSC} = \exp\!\bigl(\tfrac{1}{5}\sum_i \log f_i\bigr)$, combining (i) the agent's Lean file typechecks, (ii) the agent's theorems verify without \textsf{sorry}, (iii) the agent's theorems semantically cover the gold theorems, and (iv,v) gold-side benchmark-validity gates ($D_1, D_2$) ensuring the gold reference itself compiles and proves cleanly before being used to score agents. Under an agentic verifier-feedback loop, Codex, Claude Code, and Leanstral-v2 reach SCSC of only 0.42, 0.36, and 0.23; iterative self-correction adds 14.3\% over single-shot baselines, yet theorem--gold coverage stalls uniformly at $\leq 0.11$ across all three agents---a specification gap validated by an LLM judge calibrated against five independent human raters (Pearson $r=0.70$, $p<10^{-11}$). VeriBench reframes code-verification evaluation from isolated proof search to end-to-end conjunctive grounding, surfacing specification synthesis as a bottleneck at least as severe as proof search and offering a measurable target for the next generation of verifiable AI coding agents.
Submission Number: 88
Loading