Keywords: Scaling, LLMs, Reasoning
TL;DR: We study compute scaling properties of RL methods on LLMs
Abstract: Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training.
Despite rapidly rising compute budgets, there is no principled understanding of
how to evaluate algorithmic improvements for scaling RL compute.
We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs.
We fit sigmoidal compute-performance curves for RL training and ablate a wide range of common design choices to analyze their effects on asymptotic performance and compute efficiency. We observe:
(1)
Not all recipes yield similar asymptotic performance,
Details such as loss aggregation, normalization, curriculum, and off-policy algorithm primarily modulate compute efficiency without materially shifting the asymptote, and
(3) Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs.
Combining these insights, we propose a _best-practice_ recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours.
Our work provides both a _scientific framework_ for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 7941
Loading