Variance-Reduced Reinforcement Learning for Large Reasoning Models via James-Stein Baselines

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Reasoning Model, Reinforcement Learning, Variance Reduction
TL;DR: By applying James-Stein shrinkage to the baseline, we reduce the variance of policy gradient and improve reinforcement learning of large reasoning models.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is becoming an impactful paradigm in large reasoning model (LRM) post-training. To stabilize training, control variates (baselines) are commonly introduced, canonically chosen to approximate the value function. Popular approaches such as RLOO and GRPO estimate baselines with per-prompt empirical averages of generated response, which can exhibit high variance under limited rollout budgets. Recognizing that value functions must be estimated simultaneously across all prompts in a batch, we propose a James–Stein estimator as the baseline. This approach leverages statistical shrinkage to reduce the mean squared error in the overall value function estimation, without additional computational overhead while maintaining the unbiasedness of the policy gradient estimator. We provide theoretical justification for James-Stein baselines and validate it empirically. Across diverse models, tasks, and rollout budgets, our approach consistently outperforms existing baselines, demonstrating robust variance reduction and improved training stability.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 12574
Loading