PROS: Towards Compute-Efficient RLVR via Rollout Prefix Reuse

PROS: Towards Compute-Efficient RLVR via Rollout Prefix Reuse

ICLR 2026 Conference Submission18293 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: RLVR, reasoning

Abstract: Large reasoning models (LRMs) trained with *Reinforcement Learning with Verifiable Rewards* (RLVR) have achieved remarkable progress on complex reasoning tasks. However, RLVR heavily relies on on-policy rollout generation, whose cost grows rapidly with rollout length and model size, eventually becoming the training bottleneck. Our empirical analysis reveals that independent rollouts for the same query often share similar early steps, indicating substantial redundancy. To address this, we propose **Pros** (**P**refix **R**euse for **O**n-policy **S**ampling), a paradigm that reuses promising prefixes of historical rollouts in RLVR training. **Pros** appends these self-generated partial rollouts to the original queries to form *Augmented Queries*, which are then used as regular training inputs in subsequent iterations, thereby reducing redundant computation. To select training batch from augmented queries, **Pros** adopts a hierarchical Bayesian model to estimate their pass rates and prioritize those with the highest reward uncertainty. Experiments across diverse settings show that **Pros** consistently improves training efficiency and achieves higher accuracy than strong baselines. These results highlight **Pros** as a practical path toward scalable and compute-efficient RLVR.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 18293

Loading