Keywords: Reinforcement Learning, Group Centering, Unbiased and Consistent Estimates, Theoretical Convergence Guarantees
Abstract: Large language models (LLMs) have shown strong performance in diverse tasks but require post-training alignment, where reinforcement learning plays a key role. Existing methods such as proximal policy optimization (PPO) and direct preference optimization (DPO) suffer from limitations like high computational overhead and overfitting. Although group relative policy optimization (GRPO) addresses some of these issues, its reliance on weighted negative log-likelihood lacks theoretical convergence guarantees. Furthermore, mirror descent policy optimization (MDPO), while more stable, requires computationally expensive partition function estimation. To overcome these challenges, this study introduces centered mirror descent policy optimization (CMDPO), a policy optimization framework that eliminates the need for explicit partition function estimation through group centering. CMDPO ensures unbiased and consistent estimates with strong theoretical guarantees. Optionally, we add two lightweight utilities for improved stability: dynamic reward weighting to balance heterogeneous rewards and token-level discriminative learning to reduce shared-segment dominance. Comprehensive experiments across multiple benchmark datasets demonstrate the effectiveness and robustness of CMDPO, which is further proven theoretically as a promising approach for LLMs' post-training. The code is accessible at https://anonymous.4open.science/r/CMDPO-0C26.
Primary Area: reinforcement learning
Submission Number: 8964
Loading