Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning
Keywords: large language model, neural scaling law, reinforcement learning, LLM reasoning
Abstract: While scaling laws for large language models (LLMs) during pre-training have
been extensively studied, their behavior under reinforcement learning (RL) post-
training remains largely unexplored. This paper presents a systematic empirical
investigation of scaling behaviors in RL-based post-training, with a particular fo-
cus on mathematical reasoning. Based on a set of experiments across the full
Qwen2.5 dense model series (0.5B to 72B), we characterize how model scale,
data volume, and computational budget interact to shape performance. Our anal-
ysis leads to four key findings: (1). Under a fixed computational budget, larger
models trained for fewer steps consistently outperform smaller models trained
for more steps. (2). Given a fixed amount of training data, larger models achieve
superior sample efficiency, yielding lower loss. (3). In data-constrained regimes,
repeated reuse of high-quality data proves highly effective, as final performance
is primarily governed by the total number of optimization steps rather than the
uniqueness of samples. (4) These scaling behaviors are robust across both base
and instruction-tuned models, which share similar learning dynamics (e.g., larger
models show faster convergence) even while differing in absolute accuracy. We
further show that the relationship between test loss, compute, and data can be
modeled by a predictive power-law with an analytic learning efficiency term k(N )
that demonstrates an efficiency saturation effect as model size increases. Collec-
tively, these results provide a principled foundation and practical guidelines for
efficiently scaling the reasoning capabilities of LLMs through RL post-training.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 3479
Loading