Keywords: Reinforcement Learning, Large Language Models, Formal Verification, Mathematical Reasoning, Verifiable Rewards
TL;DR: We introduce JURY-RL, a 'votes propose, proofs dispose' paradigm for label-free reinforcement learning, designed to robustly align LLM reasoning with verifiable correctness without human labels.
Abstract: Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but its scalability is hampered by the high cost of human-annotated labels. Label-free alternatives, such as majority voting or LLM-as-a-judge, are susceptible to false positives that lead to reward hacking and training collapse. We introduce JURY-RL, a label-free RLVR framework that separates answer proposal from reward disposal: votes from model rollouts propose a consensus answer, while a formal theorem prover disposes the final reward. Specifically, a rollout is rewarded only if the majority-voted answer is formally verified by a Lean prover. When verification is inconclusive, we activate our proposed ResZero (Residual-Zero) reward: it drops the unverifiable majority proposal and assigns a zero-mean, variance-preserving reward to the remaining (residual) answers. This design maintains a stable optimization gradient for RL algorithms without reinforcing spurious consensus. Experiments across mathematical reasoning, code generation, and multi-task benchmarks show that JURY-RL not only achieves more stable training but also consistently outperforms label-free baselines and even matches or surpasses supervised training with ground-truth rewards across pass@1 and pass@k.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 15756
Loading