CPGD: Toward Stable Reinforcement Learning for Language Models

Zongkai Liu; Fanqing Meng; Lingxiao Du; Zhixiang Zhou; Chao Yu; Wenqi Shao; Qiaosheng Zhang

CPGD: Toward Stable Reinforcement Learning for Language Models

Zongkai Liu, Fanqing Meng, Lingxiao Du, Zhixiang Zhou, Chao Yu, Wenqi Shao, Qiaosheng Zhang

18 Sept 2025 (modified: 06 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, post training

TL;DR: We propose a novel rule-based RL algorithm to solve the training instability issue in existing RL methods.

Abstract: Recent advances in rule-based reinforcement learning (RL) have significantly improved the reasoning capability of language models (LMs) with rule-based rewards. However, existing RL methods---such as GRPO, REINFORCE++, and RLOO---often suffer from training instability, where large policy updates and improper clipping can lead to training collapse. To address this issue, we propose Clipped Policy Gradient Optimization with Policy Drift (CPGD), a novel algorithm designed to stabilize policy learning in LMs. CPGD introduces a policy drift constraint based on KL divergence to dynamically regularize policy updates, and leverages a clip mechanism on the logarithm of the ratio to prevent excessive policy updates. We provide theoretical justification for CPGD and demonstrate through empirical analysis that it mitigates the instability observed in prior approaches. Furthermore, we show that CPGD significantly improves performance while maintaining training stability. Our implementation balances theoretical rigor with practical usability, offering a robust alternative for RL in the post-training of LMs.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 12164

Loading