CoDaPO: Confidence and Difficulty-Adaptive Policy Optimization for Post-Training Language Models

Zhanke Zhou; Xiangyu Lu; Chentao Cao; Brando Miranda; Tongliang Liu; Bo Han; Sanmi Koyejo

CoDaPO: Confidence and Difficulty-Adaptive Policy Optimization for Post-Training Language Models

Zhanke Zhou, Xiangyu Lu, Chentao Cao, Brando Miranda, Tongliang Liu, Bo Han, Sanmi Koyejo

Published: 09 Jul 2025, Last Modified: 25 Jul 2025AI4Math@ICML25 PosterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: Post-training, Large language models

TL;DR: We propose CoDaPO, a method that focuses on correctness-based rewards and adaptive exploration to stabilize training and improve performance.

Abstract: Large language models (LLMs) increasingly rely on reinforcement learning (RL) post-training to improve step-by-step reasoning. Therein, Group Relative Policy Optimization (GRPO) emerges as a prevailing approach that avoids the need for fully supervised traces. However, GRPO can struggle with high-difficulty tasks, overfit to easy problems, and suffer from sensitivity to reward design. To diagnose these weaknesses, we introduce a general analysis framework that maps training trajectories onto an advantage-confidence plane, revealing three critical phenomena: (1) advantage contraction: reward-normalized advantages collapse as accuracy improves; (2) confidence saturation: policies become overconfident even on incorrect outputs; and (3) hierarchical convergence: easy problems are quickly mastered while harder ones lag. Based on these insights, we propose CoDaPO (Confidence- and Difficulty-Adaptive Policy Optimization), an RL algorithm that adopts correctness-based reward and advantage reweighting w.r.t. confidence and difficulty. Experiments on several benchmarks demonstrate that CoDaPO achieves higher reasoning accuracy and better generalization than existing RL approaches.

Submission Number: 97

Loading