Efficient and Stable Scaling of Reinforcement Learning for LLMs via Dynamic Allocation and Gradient Modulation
Keywords: Large Language Models, Reinforcement Learning, Gradient Stabilization
Abstract: Post-training Large Language Models (LLMs) via Reinforcement Learning with Verifiable Rewards (RLVR) is a compute-intensive process where efficiency and stability are paramount for scaling. Current methods suffer from suboptimal resource allocation, distributing rollout budgets uniformly regardless of problem difficulty, and token-level optimization instability caused by the softmax policy structure. We propose \textbf{DynaMO}, a dual-pronged framework designed to scale RLVR effectively. At the sequence level, we introduce a variance-minimizing dynamic rollout allocation that concentrates compute on high-informativeness problems. At the token level, we develop gradient-aware advantage modulation to compensate for gradient attenuation in high-confidence actions while stabilizing excessive updates. Experiments on Qwen2.5-Math (1.5B, 7B) and Qwen3 (14B) across six benchmarks demonstrate that DynaMO significantly improves performance and training stability, offering a scalable pathway for reasoning optimization.
Submission Number: 102
Loading