CLEVER: A Curated Benchmark for Formally Verified Code Generation

Amitayush Thakur; Jasper Lee; George Tsoukalas; Meghana Sistla; Matthew Zhao; Stefan Zetzsche; Greg Durrett; Yisong Yue; Swarat Chaudhuri

CLEVER: A Curated Benchmark for Formally Verified Code Generation

Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, Swarat Chaudhuri

Published: 09 Jul 2025, Last Modified: 16 Jul 2025AI4Math@ICML25 PosterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: Formal Verification, Program Synthesis, Language Models, Formal Methods, Interactive Theorem Provers, Neural Theorem Proving, Code Generation, Verified Code Generation, Proof-Guided Generation, Formal Specification Mining, Lean Theorem Prover, Benchmark Design, End-to-End Verification, Natural Language to Code, Automated Software Verification

TL;DR: We introduce CLEVER, a hand-curated benchmark for verified code generation in Lean. It requires full formal specs and proofs. No few-shot method solves all stages, making it a strong testbed for synthesis and formal reasoning.

Abstract: We introduce ${\rm C{\small LEVER}}$, a high-quality, manually curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks,${\rm C{\small LEVER}}$ avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use ${\rm C{\small LEVER}}$ to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on [Anonymized Repository](https://anonymous.4open.science/r/clever-330A). All our evaluation code is also available [online](https://anonymous.4open.science/r/clever-prover-DD52).

Submission Number: 150

Loading