Keywords: Formal Reasoning, Theorem Proving with LLMs, Lean4, Reinforcement Learning
TL;DR: We demonstrate that the Lean proof assistant itself can serve as a symbolic process oracle, supplying both outcome-level and fine-grained tactic-level verified reward during RL.
Abstract: Current reinforcement learning from verifiable rewards (RLVR) relies on sparse binary outcomes, whereas symbolic proof assistants can provide fine-grained, structured feedback. In this work, we demonstrate that the Lean proof assistant itself can serve as a symbolic process oracle, providing verified rewards at both the outcome level and the fine-grained tactic level during training. Proof attempts are parsed into tactic sequences, and Lean's elaboration marks both locally sound steps and the earliest failing step, yielding dense, verifier-grounded credit signals rooted in type theory. We incorporate these structured signals into a GRPO-style reinforcement learning objective with first-error propagation and first-token credit methods that balances outcome- and process-level advantages.Experiments with STP-Lean and DeepSeek-Prover-V1.5 show that tactic-level supervision outperforms outcome-only baselines in most settings, delivering improvements on benchmarks such as MiniF2F and ProofNet. Beyond empirical gains, our study highlights a broader perspective: symbolic proof assistants are not only verifiers at evaluation time, but can also act as process-level reward oracles during training. This opens a path toward reinforcement learning frameworks that combine the scalability of language models with the reliability of symbolic verification for formal reasoning.
Submission Number: 70
Loading