DCPO: Dynamic Clipping Policy Optimization

Shihui Yang; Chengfeng Dou; Peidong Guo; Kai Lu; lifeng Liu; Fei Deng; Rihui Xin

DCPO: Dynamic Clipping Policy Optimization

Shihui Yang, Chengfeng Dou, Peidong Guo, Kai Lu, lifeng Liu, Fei Deng, Rihui Xin

03 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning from Verifiable Rewards, large language models, Reinforcement Learning

TL;DR: DCPO (Dynamic Clipping Policy Optimization) solves RL‑LLM zero‑gradient issues by using token‑wise adaptive clipping and smooth cumulative advantage standardization, boosting response utilization and achieving SOTA results on benchmarks.

Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising framework for enhancing the reasoning capabilities of large language models. However, existing approaches, such as GRPO, often suffer from zero gradients. This problem mainly stems from (i) fixed clipping bounds for token-level probability ratios and (ii) the standardization of identical rewards, which can lead to ineffective gradient updates and underutilization of generated responses. In this work, we propose **D**ynamic **C**lipping **P**olicy **O**ptimization (**DCPO**). DCPO (i) introduces a dynamic clipping strategy that adaptively adjusts clipping bounds based on token-specific prior probabilities to enhance token-level exploration, and (ii) employs a smooth advantage standardization technique that standardizes rewards across cumulative training steps to improve the response-level effective utilization of generated responses. DCPO achieved state-of-the-art performance on four benchmarks based on four different models. Specifically, on the AIME-24 benchmark, DCPO reaches an Avg@1 of 46.7 (greedy decoding) and an Avg@32 of 38.8 (32-sample decoding) with the Qwen2.5-Math-7B model, surpassing DAPO (36.7/31.6), GRPO (36.7/32.1) and GSPO (40.0/34.9). On the AIME25 benchmark based on Qwen2.5-14B, DCPO achieves a performance of (23.3/19.0), surpassing GRPO (13.3/10.5), DAPO (20.0/15.3) and GSPO (16.7/9.9). Furthermore, DCPO achieved an average 28% improvement in the nonzero advantage over GRPO in four models, doubled the training efficiency compared to DAPO, and significantly reduced the token clipping ratio by an order of magnitude compared to both GRPO and DAPO, while achieving superior performance. These results demonstrate DCPO's effectiveness in leveraging generated data more efficiently for reinforcement learning in large language models.

Primary Area: reinforcement learning

Submission Number: 1505

Loading