Keywords: formal verification, formal mathematics, ai math, reasoning, code verification, theorem proving, formal methods, lean 4, program synthetis, large language models, agentic reasoning, agents, self-improving agents, software security
TL;DR: VeriBench is a Lean-4 benchmark that tests whether LLM-powered agents can translate real Python code—including security-critical programs—into fully proved, machine-checkable implementations, revealing large gaps in today’s models.
Abstract: Formal verification of software is a promising and potentially transformative application of generative AI.
Provably correct code would eliminate entire classes of vulnerabilities, mitigate critical system failures, and potentially transform software engineering practices through inherently trustworthy implementation methodologies.
To advance this domain, we present **VeriBench**, a carefully curated benchmark for evaluating language-model capabilities in end-to-end code verification, requiring the generation of complete Lean 4 programs—implementations, unit tests, correctness theorems, and formal proofs—derived from reference Python functions or their docstrings.
Our evaluation on the 113-task suite—51 HumanEval problems, 42 easy exercises, 10 classical algorithms, and 11 security challenges—shows that current frontier models compile only a small fraction of programs.
Claude 3.7 Sonnet achieves compilation on only 12.5%, while LLaMA-70B fails to compile any programs in the Lean 4 HumanEval subset, even with 50 feedback-guided attempts.
Notably, among the evaluated approaches, our experiments reveal that a self-optimizing Trace
agent architecture achieves compilation rates approaching 60%.
**VeriBench** establishes a rigorous foundation for developing AI systems capable of synthesizing provably correct, bug-free code, thereby advancing the trajectory toward more secure and dependable software infrastructure.
Submission Number: 128
Loading