Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

Huajian Xin; Daya Guo; Zhihong Shao; Z.Z. Ren; Qihao Zhu; Bo Liu; Chong Ruan; Wenda Li; Xiaodan Liang

Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

Huajian Xin, Daya Guo, Zhihong Shao, Z.Z. Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, Xiaodan Liang

Published: 10 Oct 2024, Last Modified: 31 Oct 2024MATH-AI 24EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Neural Theorem Proving, Language Model, Autoformalization, Formal Math, miniF2F

Abstract: Proof assistants like Lean have revolutionized mathematical proof verification by providing high levels of accuracy and reliability. Although large language models (LLMs) have demonstrated potential in mathematical reasoning, their advancement in formal theorem proving is hindered by the scarcity of large, high-quality training datasets. To address this challenge, we present a novel approach to generate extensive Lean 4 proof data from natural language mathematical problems at the high school and undergraduate levels. Specifically, we synthesize 8 million formal statements with corresponding proofs, leveraging this dataset to fine-tune the DeepSeekMath 7B model. The resulting model, DeepSeek-Prover, achieves a pass rate of 50\% on the Lean 4 miniF2F benchmark, surpassing the previous state-of-the-art result of 41.0\%. These findings underscore the potential of large-scale synthetic data in significantly enhancing the theorem-proving capabilities of LLMs.

Concurrent Submissions: NeurIPS 2024 Conference Submission

Submission Number: 52

Loading