The Art of Scaling Reinforcement Learning Compute for LLMs

Fnu Devvrit; Lovish Madaan; Rishabh Tiwari; Rachit Bansal; Sai Surya Duvvuri; Manzil Zaheer; Inderjit S Dhillon; David Brandfonbrener; Rishabh Agarwal

The Art of Scaling Reinforcement Learning Compute for LLMs

Fnu Devvrit, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S Dhillon, David Brandfonbrener, Rishabh Agarwal

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Scaling, LLMs, Reasoning

TL;DR: We study compute scaling properties of RL methods on LLMs

Abstract: Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training. Despite rapidly rising compute budgets, there is no principled understanding of how to evaluate algorithmic improvements for scaling RL compute. We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs. We fit sigmoidal compute-performance curves for RL training and ablate a wide range of common design choices to analyze their effects on asymptotic performance and compute efficiency. We observe: (1) Not all recipes yield similar asymptotic performance, Details such as loss aggregation, normalization, curriculum, and off-policy algorithm primarily modulate compute efficiency without materially shifting the asymptote, and (3) Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs. Combining these insights, we propose a _best-practice_ recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours. Our work provides both a _scientific framework_ for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 7941

Loading