A Benchmark for Vericoding: Formally Verified Program Synthesis

Sergiu Bursuc; Theodore Ehrenborg; Shaowei Lin; Lacramioara Astefanoaei; Ionel Emilian Chiosa; Jure Kukovec; Alok Singh; Oliver Butterley; Adem Bizid; Quinn Dougherty; Miranda Zhao; Max Tan; Max Tegmark

A Benchmark for Vericoding: Formally Verified Program Synthesis

Sergiu Bursuc, Theodore Ehrenborg, Shaowei Lin, Lacramioara Astefanoaei, Ionel Emilian Chiosa, Jure Kukovec, Alok Singh, Oliver Butterley, Adem Bizid, Quinn Dougherty, Miranda Zhao, Max Tan, Max Tegmark

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Formal Verification, Program Synthesis, Benchmark, LLM, Verus, Dafny, Lean

TL;DR: We present and test a benchmark for *vericoding*, AI-generation of formally verified code from formal specifications --- in contrast to *vibe coding*, which generates potentially buggy code from a natural language description.

Abstract: We present and test the largest benchmark for *vericoding*, LLM-generation of formally verified code from formal specifications --- in contrast to *vibe coding*, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in Verus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find vericoding success rates of 27\% in Lean, 44\% in Verus/Rust and 82\% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. We also find that LLM progress has improved progress on pure Dafny verification from 68\% to 96\% over the past year.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 6312

Loading