DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning

Ziyin Zhang; Jiahao Xu; Zhiwei He; Tian Liang; Qiuzhi Liu; Yansi Li; Linfeng Song; Zhenwen Liang; Zhuosheng Zhang; Rui Wang; Zhaopeng Tu; Haitao Mi; Dong Yu

DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning

Ziyin Zhang, Jiahao Xu, Zhiwei He, Tian Liang, Qiuzhi Liu, Yansi Li, Linfeng Song, Zhenwen Liang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: natural language theorem proving, large language models, reasoning, reinforcement learning

TL;DR: We introduce DeepTheorem, the first comprehensive suite for advancing LLMs' informal theorem proving, including a large-scale dataset, an adaptation of RL-Zero training method, a benchmark, and comprehensive evaluation metrics.

Abstract: Theorem proving serves as a major testbed for evaluating complex reasoning abilities in large language models (LLMs). However, traditional automated theorem proving (ATP) approaches rely heavily on formal proof systems that poorly align with LLMs’ strength derived from informal, natural language knowledge acquired during pre-training. In this work, we introduce DeepTheorem, a comprehensive informal theorem-proving suite exploiting natural language to enhance LLM mathematical reasoning. DeepTheorem includes 1) a large-scale dataset of 121K high-quality IMO-level informal theorems and proofs spanning diverse mathematical domains, rigorously annotated for correctness, difficulty, and topic categories, accompanied by systematically constructed verifiable theorem variants; 2) adaptation of RL-Zero explicitly to informal theorem proving, leveraging the verified theorem variants to incentivize robust mathematical inference; 3) comprehensive outcome and process evaluation metrics examining proof correctness and the quality of reasoning steps; and 4) a new informal theorem proving benchmark consoliadted from three established math competitions, formatted for automatic evaluation. Extensive experimental analyses demonstrate DeepTheorem significantly improves LLM theorem-proving performance compared to existing datasets and supervised fine-tuning protocols, achieving state-of-the-art accuracy and reasoning quality. Our findings highlight DeepTheorem’s potential to fundamentally advance automated informal theorem proving and mathematical exploration.

Primary Area: datasets and benchmarks

Submission Number: 17319

Loading