Efficient and Stable Scaling of Reinforcement Learning for LLMs via Dynamic Allocation and Gradient Modulation

Yangyi Fang; Jiaye Lin; Xiaoliang Fu; Cong Qin; Haolin Shi; Chaowen Hu

Efficient and Stable Scaling of Reinforcement Learning for LLMs via Dynamic Allocation and Gradient Modulation

Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chaowen Hu

Published: 03 Mar 2026, Last Modified: 31 Mar 2026SPOTEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Reinforcement Learning, Gradient Stabilization

Abstract: Post-training Large Language Models (LLMs) via Reinforcement Learning with Verifiable Rewards (RLVR) is a compute-intensive process where efficiency and stability are paramount for scaling. Current methods suffer from suboptimal resource allocation, distributing rollout budgets uniformly regardless of problem difficulty, and token-level optimization instability caused by the softmax policy structure. We propose \textbf{DynaMO}, a dual-pronged framework designed to scale RLVR effectively. At the sequence level, we introduce a variance-minimizing dynamic rollout allocation that concentrates compute on high-informativeness problems. At the token level, we develop gradient-aware advantage modulation to compensate for gradient attenuation in high-confidence actions while stabilizing excessive updates. Experiments on Qwen2.5-Math (1.5B, 7B) and Qwen3 (14B) across six benchmarks demonstrate that DynaMO significantly improves performance and training stability, offering a scalable pathway for reasoning optimization.

Submission Number: 102

Loading