Keywords: code generation, formal verification, verifiable code generation, AI for math, theorem proving, AI for code
Abstract: Large language models (LLMs) are being increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging, which often requires manual review. Verifiable code generation---jointly generating code, specifications, and proofs of code-specification alignment---offers a promising path to address this limitation and further unleash LLMs’ benefits in programming. Yet, there exists a significant gap in evaluation: current benchmarks often lack support for end-to-end verifiable code generation. In this paper, we introduce VERINA (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their composition. VERINA consists of 189 manually curated coding tasks in Lean, with detailed problem description, reference implementations, formal specifications, and extensive test suites. Our comprehensive evaluation with state-of-the-art LLMs reveals significant challenges for verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o4-mini, achieves only 61.4% correct code, 51.0% sound and complete specifications, and 3.6% successful proofs. We hope VERINA will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark.
Submission Number: 75
Loading