JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

ICLR 2026 Conference Submission15756 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Large Language Models, Formal Verification, Mathematical Reasoning, Verifiable Rewards

TL;DR: We introduce JURY-RL, a 'votes propose, proofs dispose' paradigm for label-free reinforcement learning, designed to robustly align LLM reasoning with verifiable correctness without human labels.

Abstract: Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but its scalability is hampered by the high cost of human-annotated labels. Label-free alternatives, such as majority voting or LLM-as-a-judge, are susceptible to false positives that lead to reward hacking and training collapse. We introduce JURY-RL, a label-free RLVR framework that separates answer proposal from reward disposal: votes from model rollouts propose a consensus answer, while a formal theorem prover disposes the final reward. Specifically, a rollout is rewarded only if the majority-voted answer is formally verified by a Lean prover. When verification is inconclusive, we activate our proposed ResZero (Residual-Zero) reward: it drops the unverifiable majority proposal and assigns a zero-mean, variance-preserving reward to the remaining (residual) answers. This design maintains a stable optimization gradient for RL algorithms without reinforcing spurious consensus. Experiments across mathematical reasoning, code generation, and multi-task benchmarks show that JURY-RL not only achieves more stable training but also consistently outperforms label-free baselines and even matches or surpasses supervised training with ground-truth rewards across pass@1 and pass@k.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 15756

Loading