VeriBench-FTP: A Formal Theorem Proving Benchmark in Lean 4 for Code Verification

Published: 17 Oct 2025, Last Modified: 21 Nov 2025MATH-AI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Formal Verification, Code Verification, Theorem Proving, Lean 4, Benchmark, Large Language Models (LLMs), Automated Theorem Proving (ATP), DeepSeek-Prover, Software Verification, Pass@k
TL;DR: We present VeriBench-FTP, a Lean 4 benchmark showing AI provers excel at math but fail at code verification, revealing a critical gap in their practical reasoning skills.
Abstract: Theorem proving in Lean 4 offers a promising avenue for advancing the reasoning capabilities of large language models. Evaluating current provers is crucial, as many achieve near-perfect accuracy on existing benchmarks such as MiniF2F, highlighting the need for novel evaluation tasks. We introduce VeriBench-FTP, a benchmark designed to assess formal theorem proving in Lean 4 through code verification. The task requires models to generate proofs for theorems that capture key aspects of program verification. Our benchmark consists of 857 theorems derived from 140 problems across five difficulty levels: 56 HumanEval problems, 41 foundational programming exercises, 10 classical algorithms, 28 security-critical programs adapted from real-world vulnerabilities, and 5 problems from the Python standard library. On our benchmark, Goedel-Prover V2-8B achieves 39.56% Pass@32, highlighting the difficulty of the tasks. VeriBench-FTP provides a rigorous alternative to existing datasets, enabling more realistic evaluation of formal provers in Lean 4. VeriBench-FTP translates theorem-proving ability into a measurable route toward trustworthy code, advancing progress toward secure, dependable software infrastructure.
Supplementary Material: pdf
Submission Number: 163
Loading