UCPO: Uncertainty-Aware Policy Optimization
Abstract: The key to building trustworthy large language
models (LLMs) lies in endowing them with inherent uncertainty expression capabilities, thereby
mitigating overconfident errors in high-stakes applications. However, existing RL paradigms such
as GRPO often suffer from Advantage Bias due
to binary decision spaces and static uncertainty
rewards, inducing either excessive conservatism
or overconfidence. To tackle this challenge, this
paper unveils the root causes of reward hacking
and overconfidence in current RL paradigms incorporating uncertainty-based rewards, based on
which we propose the UnCertainty-Aware Policy
Optimization (UCPO) framework. UCPO employs Ternary Advantage Decoupling to separate
and independently normalize deterministic and
uncertain rollouts, thereby eliminating advantage
bias. Furthermore, a Dynamic Uncertainty Reward Adjustment mechanism adapts uncertainty
weights in real-time according to model evolution and instance difficulty. Experimental results in mathematical reasoning and general tasks
demonstrate that UCPO effectively resolves the
reward imbalance, significantly improving the reliability of the model beyond their knowledge
boundaries. The code is available at https:
//github.com/xzhouzeng/ucpo.
Loading