Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

ICLR 2026 Conference Submission13161 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: GRPO, down-sampling, reinforcement learning, large language models, RLVR

TL;DR: We propose a method that makes GRPO training significantly faster by generating many rollouts in parallel but only training on the most informative subset.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models. However, it faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive. To address this, we introduce **PODS** (**P**olicy **O**ptimization with **D**own-**S**ampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts, maintaining learning quality while dramatically reducing update costs. We propose a principled subset selection criterion—*max-variance down-sampling*—that maximizes reward diversity, and provide an efficient $O(n\log n)$ implementation. Empirically, Group Relative Policy Optimization (GRPO) with PODS achieves the peak test accuracy of vanilla GRPO at least $\mathbf{1.7\times}$ **faster** across the different reasoning benchmarks and hardware configurations we tested.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 13161

Loading