Keywords: Safe code generation, formal verification, neural theorem proving, LLMs
Abstract: AI agents have shown initial promise in automating mathematical theorem proving in proof assistants such as Lean. The same proof assistants can be used to verify the correctness of code by pairing code with specifications and proofs that the specifications hold. Automating the writing of code, specifications, and proofs could lower the cost of verification, or, ambitiously, enable an AI agent to output safe, provably correct code. However, it remains unclear whether current neural theorem provers can automatically verify even relatively simple programs. We present $\texttt{miniCodeProps}$, a benchmark of 201 program specifications in the Lean proof assistant, aimed at the subproblem of automatically generating a proof for a provided program and specification. $\texttt{miniCodeProps}$ contains specifications about simple, self-contained programs (e.g., lists, natural numbers, binary trees) with varied proof difficulty. Despite its simplicity, $\texttt{miniCodeProps}$ is sufficient to break current LLM-based provers, with state-of-the-art methods showing promise on the easy properties in $\texttt{miniCodeProps}$, yet failing to prove nearly all of the medium and hard properties. We publicly release $\texttt{miniCodeProps}$ as a benchmark for furthering automated theorem proving in the context of formally verified code.
Submission Number: 220
Loading