Breaking the Invisible Leash: Support Expansion in RLVR via Off-Policy Transport

20 Sept 2025 (modified: 25 Sept 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, RLVR, LLM Reasoning
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) reliably raises pass@1, yet standard on-policy updates often amplify a base model’s existing modes, curbing exploration and missing correct traces with negligible prior mass. We introduce an off-policy transport objective that evaluates guided rollouts at the plain prompt via a token-wise transport ratio. Under mild conditions, we prove a strictly positive drift in the plain-prompt log-likelihood of correct traces, implying support expansion in expectation. We achieve this objective through a GRPO-compatible pipeline that draws guidance from either an external policy or the model itself. The analysis explains why stepping beyond purely on-policy RLVR expands empirical support, unifying recent observations across off-policy guidance, replay-based optimization, and stepwise hint scaffolding under a single transport mechanism.
Primary Area: reinforcement learning
Code Of Ethics: true
Submission Guidelines: true
Anonymous Url: true
No Acknowledgement Section: true
Submission Number: 25213
Loading