VerifyThisBench: Generating Code, Specifications, and Proofs All at Once

ICLR 2026 Conference Submission13450 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models (LLMs), Formal Verification, Program Synthesis, Machine Learning for Formal Methods
Abstract: Large language models (LLMs) have demonstrated remarkable progress in code generation, but many existing benchmarks are approaching saturation and offer little guarantee on the trustworthiness of the generated programs. To improve visibility into model reasoning on formal correctness, we introduce $VerifyThisBench$, a new benchmark that evaluates end‑to‑end program verification from natural language descriptions: models must (i) extract formal specifications, (ii) implement in a verification‑aware language, and (iii) construct machine‑checkable proofs. Our evaluation reveals that even state-of-the-art (SOTA) models, such as o3-mini, achieve a pass rate of less than 4\%, with many outputs failing to compile. To isolate sources of difficulty, we further propose $VerifyThisBenchXS$, a relaxed variant in which partial implementations or proofs are provided. Across nine models and seven verification tools on both benchmarks, we observe consistent gains with feedback‑driven refinement, but overall pass rates remain low, underscoring substantial gaps in formal reasoning. We release the benchmark and the unified evaluation environment to catalyze the verification capabilities for future models.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 13450
Loading