ProofAug+: Boosting Reinforcement Learning for LLM Theorem Provers with Conditioned Proof Repair

ProofAug+: Boosting Reinforcement Learning for LLM Theorem Provers with Conditioned Proof Repair

ICLR 2026 Conference Submission25130 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Neural Theorem Proving; Reinforcement Learning; Large Language Models

TL;DR: we propose a novel RL training pipeline for LLM theorem provers, that boosts the training performance by acquiring more positive samples during rollout via a proof repair technique, ProofAug, and using a novel PPO variant algorithm PLPO.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) often suffers from the scarcity of positive samples on challenging tasks such as formal theorem proving. In this work, we propose ProofAug+, an RL training pipeline for LLM theorem provers that improves the training performance by acquiring more positive samples during rollout through ProofAug, a previously developed inference-time proof repair technique. The design of ProofAug+ is guided by two principles, progress guarantee and variance reduction, to address the performance degradation and policy collapse issues observed when integrating ProofAug into GRPO via naive direct replacement. These principles first lead to a novel LLM RLVR algorithm, Proximal Language Modeling Policy Optimization (PLPO), where in each iteration we use the exact objective as the optimization target instead of surrogate objectives used in TRPO/PPO and employ a gradient rejection mechanism to suppress large policy updates. Then, we integrate ProofAug into PLPO in a constrained way to achieve a balance between the exploitation of additional positive reward signals and the suppression of distribution shift that could violate the progress guarantee principle. Experiments show that PLPO achieves better stability than baseline GRPO-like algorithms while maintaining higher entropy during training. Building on PLPO, the resulting ProofAug+ pipeline further yields significant performance gains.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 25130

Loading