Abstract: We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications — in contrast to vibe coding, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in Verus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find vericoding success rates of 27% in Lean, 44% in Verus/Rust and 82% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. We also find that LLM progress has improved progress on pure Dafny verification from 68% to 97% over the past year.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Serguei_Barannikov1
Submission Number: 7696
Loading