A Benchmark for Vericoding: Formally Verified Program Synthesis

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Formal Verification, Program Synthesis, Benchmark, LLM, Verus, Dafny, Lean
TL;DR: We present and test a benchmark for *vericoding*, AI-generation of formally verified code from formal specifications --- in contrast to *vibe coding*, which generates potentially buggy code from a natural language description.
Abstract: We present and test the largest benchmark for *vericoding*, LLM-generation of formally verified code from formal specifications --- in contrast to *vibe coding*, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in Verus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find vericoding success rates of 27\% in Lean, 44\% in Verus/Rust and 82\% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. We also find that LLM progress has improved progress on pure Dafny verification from 68\% to 96\% over the past year.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 6312
Loading