Guided Proof Search Using Large Language Models and Lemma Extraction in Coq

Published: 06 Mar 2025, Last Modified: 20 Mar 2025ICLR 2025 Workshop VerifAI PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Formal Verification, Automated Theorem Proving, Proof Assistants, Interactive Theorem Proving, Large Language Models
TL;DR: We introduce a novel lemma extraction approach to LLM-based formal proof generation that allows breaking theorems down into easier subproblems, and demonstrate that our method leads to significant improvements over baseline whole-proof generation.
Abstract: Interactive theorem provers are powerful tools for formalizing mathematics and verifying the correctness of software. However, they require significant background and effort to use due to the tedious nature of writing formal proofs and have not seen widespread adoption. Recent advances in machine learning have enabled formal proofs to be auto-generated; most existing approaches that use language models, however, have focused on generating proofs step-by-step and using them in conjunction with an expensive search algorithm to find proofs. In 2023, First et al. introduced Baldur, a system that uses language models to generate entire proofs at once for theorems in the Isabelle proof assistant. Our work studies the feasibility of a similar whole-proof generation procedure for Coq and introduces a novel approach to automated theorem proving that recursively extracts lemmas at failure points in the proof generation process, allowing the system to break complex theorems down into simpler subproblems. We evaluate these approaches on a dataset of 724 theorems from the Software Foundations textbook and show that GPT-4 can generate whole-proofs for 66.44% of the theorems. Additionally, when augmented with our lemma extraction method, GPT-4 sees a 19.54% improvement to achieve a success rate of 79.42%, thus marginally outperforming CoqHammer—a state-of-the-art automated reasoning tool—which proves 78.73% of the theorems. We also evaluate the much smaller open-source model Phind CodeLlama, which depicts a 103.23% improvement over its baseline when utilizing lemma extraction. We release our Coq playground that contains an implementation of this procedure along with the dataset and evaluation results through an open-source repository to encourage further research in this area.
Submission Number: 6
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview