Keywords: Large Language Models, LLM Reasoning, Reinforcement Learning with Verifiable Rewards
TL;DR: DGPO stabilizes RLVR training and promotes exploration by utilizing probability gradients and a decoupled decay mechanism instead of traditional hard clipping.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed a leap in Large Language Model (LLM) reasoning, yet its optimization dynamics remain fragile. Standard algorithms like GRPO enforce stability via "hard clipping", which inadvertently stifles exploration by discarding gradients of tokens outside the trust region. While recent "soft clipping" methods attempt to recover these gradients, they suffer from a critical challenge: relying on \textit{log-probability gradient} ($\nabla_\theta\log \pi_\theta$) yields divergent weights as probabilities vanish, destabilizing LLM training. We rethink this convention by establishing \textit{probability gradient} ($\nabla_\theta \pi_\theta$) as the superior optimization primitive. Accordingly, we propose **D**ecoupled **G**radient **P**olicy **O**ptimization (DGPO), which employs a decoupled decay mechanism based on importance sampling ratios. By applying asymmetric, continuous decay to boundary tokens, DGPO resolves the conflict between stability and sustained exploration. Extensive experiments across DeepSeek-R1-Distill-Qwen series models (1.5B/7B/14B) demonstrate that DGPO consistently outperforms strong baselines on various mathematical benchmarks, offering a robust solution for RLVR.
Submission Number: 108
Loading