CoDaPO: Confidence and Difficulty-Adaptive Policy Optimization for LLM Reasoning
Keywords: Reinforcement learning, Large languge models, Reasoning
Abstract: RL with verifiable rewards can substantially improve LLM reasoning, yet standard GRPO-style training often uses uniform sampling and near-uniform weighting, leading to inefficient computation allocation. We study GRPO by tracking token log-probabilities, group-normalized advantages, and induced token-level update weights. This reveals three recurring dynamics: probability inflation, advantage contraction as accuracy rises, and hierarchical convergence, where easy questions quickly saturate while hard questions remain discovery-limited due to rare correct rollouts. These findings imply that the benefit of each update depends strongly on both question difficulty and the model’s current competence. Motivated by this, we propose Confidence and Difficulty-adaptive Policy Optimization (CoDaPO), which assigns each question a bounded value from rollout confidence and empirical difficulty, then uses it to reweight policy updates and resample high-value questions within minibatches to increase discovery under a fixed compute budget. Across seven benchmarks, CoDaPO consistently improves accuracy over other RL methods.
Submission Number: 93
Loading