Keywords: Post-training, Large language models
TL;DR: We propose CoDaPO, a method that focuses on correctness-based rewards and adaptive exploration to stabilize training and improve performance.
Abstract: Large language models (LLMs) increasingly rely on reinforcement learning (RL) post-training to improve step-by-step reasoning. Therein, Group Relative Policy Optimization (GRPO) emerges as a prevailing approach that avoids the need for fully supervised traces. However, GRPO can struggle with high-difficulty tasks, overfit to easy problems, and suffer from sensitivity to reward design. To diagnose these weaknesses, we introduce a general analysis framework that maps training trajectories onto an advantage-confidence plane, revealing three critical phenomena: (1) advantage contraction: reward-normalized advantages collapse as accuracy improves; (2) confidence saturation: policies become overconfident even on incorrect outputs; and (3) hierarchical convergence: easy problems are quickly mastered while harder ones lag. Based on these insights, we propose CoDaPO (Confidence- and Difficulty-Adaptive Policy Optimization), an RL algorithm that adopts correctness-based reward and advantage reweighting w.r.t. confidence and difficulty. Experiments on several benchmarks demonstrate that CoDaPO achieves higher reasoning accuracy and better generalization than existing RL approaches.
Submission Number: 97
Loading