CoDaPO: Confidence and Difficulty-Adaptive Policy Optimization for LLM Reasoning

Zhanke Zhou; Xiangyu Lu; Chentao Cao; Brando Miranda; Tongliang Liu; Bo Han; Sanmi Koyejo

CoDaPO: Confidence and Difficulty-Adaptive Policy Optimization for LLM Reasoning

Zhanke Zhou, Xiangyu Lu, Chentao Cao, Brando Miranda, Tongliang Liu, Bo Han, Sanmi Koyejo

Published: 02 Mar 2026, Last Modified: 10 Apr 2026LLA 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement learning, Large languge models, Reasoning

Abstract: RL with verifiable rewards can substantially improve LLM reasoning, yet standard GRPO-style training often uses uniform sampling and near-uniform weighting, leading to inefficient computation allocation. We study GRPO by tracking token log-probabilities, group-normalized advantages, and induced token-level update weights. This reveals three recurring dynamics: probability inflation, advantage contraction as accuracy rises, and hierarchical convergence, where easy questions quickly saturate while hard questions remain discovery-limited due to rare correct rollouts. These findings imply that the benefit of each update depends strongly on both question difficulty and the model’s current competence. Motivated by this, we propose Confidence and Difficulty-adaptive Policy Optimization (CoDaPO), which assigns each question a bounded value from rollout confidence and empirical difficulty, then uses it to reweight policy updates and resample high-value questions within minibatches to increase discovery under a fixed compute budget. Across seven benchmarks, CoDaPO consistently improves accuracy over other RL methods.

Submission Number: 93

Loading