Advancing Formal Mathematical Reasoning with Explorative Reinforcement Learning

09 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Formal Reasoning, LLMs, RL
Abstract: Reinforcement learning with verifiable rewards is a promising direction for training large language models (LLMs) in formal reasoning. However, current approaches such as GRPO and expert iteration, which generate multiple solution candidates per problem and assign Pass@1 rewards to each candidate independently, struggle to balance exploration and exploitation, and thus fail to acquire new proving patterns (e.g., proof by contradiction, case analysis, mathematical induction). The resulting Pass@1 RL learned policies tend to over-rely on conservative actions (e.g., workaround Lean 4 proof completion by sorry) inherited from pretraining and supervised fine-tuning (SFT), thereby reinforcing misplaced confidence in these shortcuts during inference-time scaling. To address this limitation, we introduce T-RL, the first exploration-aware RL method in formal reasoning that leverages compiler rewards aligned with the Pass@K effect to enhance grouped Lean4 proof completion and self-improvement directly. Empirically, T-RL improves exploration by increasing the average number of tactics per proof and by encouraging the use of more diverse mathematical techniques. Our T-RL–trained prover, Qwen2.5-1.5B, outperforms DeepSeek-Prover-V1.5-7B on both MiniF2F and FormalMATH-Lite. Specifically, it achieves 70.1% on MiniF2F with only 1 × 32 × 4 sampling budgets higher than Deepseek-V1.5-RL with Pass@16 × 6400 by MCTS. T-RL is a primary reinforcement learning algorithm with explicit exploration-based learning objectives, demonstrating promising preliminary results and highlighting a potential direction for future research in formal reasoning.
Supplementary Material: zip
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Submission Number: 3305
Loading