RLPIR: Reinforcement Learning with Prefix and Intrinsic Reward

ICLR 2026 Conference Submission11062 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Unsupervised Training, Reinforcement Learning, Efficiency, Low-Cost
TL;DR: RLPIR is a verifier-free RL framework using intra-group consistency rewards and prefix rollouts to match RLVR performance without ground truth, reducing training time by $6.96\times$ and reasoning length by $45\%$.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) for large language models faces two critical limitations: (i) reliance on verifiable rewards restricts applicability to domains with accessible ground truth answers; (ii) training demands long rollouts (e.g., 16K tokens for complex math problems). We propose \textbf{R}einforcement \textbf{L}earning with \textbf{P}refix and \textbf{I}ntrinsic \textbf{R}eward (\textbf{RLPIR}), a verifier‑free reinforcement learning framework that learns from intrinsic rewards while reducing compute. RLPIR includes (1) a \textbf{prefix rollout} paradigm that avoids long rollouts by optimizing only the first $L$ tokens, and (2) an \textbf{intra‑group consistency reward} that eliminates reliance on verifiable rewards by measuring consistency among multiple sampled outputs. Across mathematical and general benchmarks, \textbf{RLPIR} matches RLVR's performance without ground truth, while substantially reducing training time by $6.96\times$. Moreover, \textbf{RLPIR} reduces reasoning sequence length by 45\%, significantly improving the reasoning efficiency of LLMs.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 11062
Loading