Lean4trace: Data augmentation for neural theorem proving in Lean

Vasilii Nesterov; Yermek Kapushev; Mikhail Burtsev

Lean4trace: Data augmentation for neural theorem proving in Lean

Vasilii Nesterov, Yermek Kapushev, Mikhail Burtsev

Published: 13 Jun 2024, Last Modified: 08 Jul 2024ICML 2024 Workshop AI4MATH PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Theorem Proving, Automated theorem proving, Data augmentation, AI for math

Abstract: Integrating large language models as proof assistants with theorem provers has shown great promise. However, one of the major challenges in this field is the scarcity of training data. To address this, we release a new open-source tool, *Lean4trace*, for training data extraction from Lean 4 sources. Unlike previous approaches, *Lean4trace* is deeply integrated into the Lean elaborator, allowing us to modify proofs on-the-fly. Leveraging this feature, we propose two methods of data augmentation in Lean: (1) decomposing composite proof steps into multiple simpler steps; (2) testing existing proof automation tactics at each proof state and collecting the successful ones. Models trained on this augmented data are capable of proving 58.0% of theorems from a hold-out subset of Mathlib and 35.6% of the test subset of the MiniF2F benchmark.

Submission Number: 25

Loading