Breaking the Invisible Leash: Support Expansion in RLVR via Off-Policy Transport

Md Tanvirul Alam; Nidhi Rastogi

Breaking the Invisible Leash: Support Expansion in RLVR via Off-Policy Transport

Md Tanvirul Alam, Nidhi Rastogi

20 Sept 2025 (modified: 25 Sept 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, RLVR, LLM Reasoning

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) reliably raises pass@1, yet standard on-policy updates often amplify a base model’s existing modes, curbing exploration and missing correct traces with negligible prior mass. We introduce an off-policy transport objective that evaluates guided rollouts at the plain prompt via a token-wise transport ratio. Under mild conditions, we prove a strictly positive drift in the plain-prompt log-likelihood of correct traces, implying support expansion in expectation. We achieve this objective through a GRPO-compatible pipeline that draws guidance from either an external policy or the model itself. The analysis explains why stepping beyond purely on-policy RLVR expands empirical support, unifying recent observations across off-policy guidance, replay-based optimization, and stepwise hint scaffolding under a single transport mechanism.

Primary Area: reinforcement learning

Code Of Ethics: true

Submission Guidelines: true

Anonymous Url: true

No Acknowledgement Section: true

Submission Number: 25213

Loading