Keywords: Large languge models, Post-training
TL;DR: We propose CoDaPO, a confidence- and difficulty–adaptive policy optimization framework for language models
Abstract: Reinforcement learning (RL) post-training strengthens reasoning in large language models (LLMs), yet the prevailing GRPO algorithm exhibits persistent issues. Using a PRAG lens (Probability, Reward, Advantage, Gradient), we diagnose three mechanisms: (i) _probability inflation_—clipping induces one-way confidence drift with weak KL correction, collapsing entropy; (ii) _advantage contraction_—group normalization dulls update signals as accuracy rises; and (iii) _hierarchical convergence_—easy questions improve quickly while hard ones advance slowly via rare discoveries. We then introduce _CoDaPO_, a confidence- and difficulty–adaptive policy optimization framework that rescales per-trajectory advantages by confidence (curbing overconfidence and drift) and difficulty (sustaining learning on hard questions). Across seven benchmarks, CoDaPO demonstrates improvements on mathematical reasoning benchmarks for small and middle-scale models.
Submission Number: 212
Loading