FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?

Nikil Ravi; Kexing Ying; Vasilii Nesterov; Rayan Krishnan; Elif Uskuplu; Bingyu Xia; Janitha Aswedige; Langston Nashold

FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?

Nikil Ravi, Kexing Ying, Vasilii Nesterov, Rayan Krishnan, Elif Uskuplu, Bingyu Xia, Janitha Aswedige, Langston Nashold

Published: 02 Mar 2026, Last Modified: 11 Mar 2026ICLR 2026 Workshop VerifAI-2EveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 8 pages)

Keywords: formal theorem-proving, math benchmark, evaluation, Lean, formal verification

TL;DR: A challenging agentic benchmark of frontier-models for graduate level formal theorem-proving capability

Abstract: We present FormalProofBench, a private benchmark designed to evaluate whether AI models can produce formally verified mathematical proofs at the graduate level. Each task pairs a natural-language problem with a Lean 4 formal statement, and a model must output a Lean proof accepted by the Lean 4 checker. FormalProofBench targets advanced undergraduate and graduate mathematics, with problems drawn from qualifying exams and standard textbooks across topics including analysis, algebra, probability, and logic. We evaluate a range of frontier models with an agentic harness, and find that the best-performing foundation model achieves 33.5% accuracy, with performance dropping rapidly after that. In addition to the accuracy numbers, we also provide empirical analysis of tool-use, failure modes, cost and latency, thereby providing a thorough evaluation of the formal-theorem proving abilities of frontier models.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 49

Loading