Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards
Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers.
Previous studies devote to directly incorporating calibration objective into existing optimization target.
However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error.
Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives.
Extensive experiments demonstrate that
our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue.
Our study provides valuable insights and practical solution for more reliable LLM deployment.
Lay Summary: Modern AI systems trained with reinforcement learning are becoming increasingly capable at solving complex problems through reinforcement learning. However, these systems often become overly confident in their answers, even when they are wrong. This overconfidence is especially concerning in high-stakes applications such as healthcare, finance, and law, where users may rely on AI predictions without realizing their uncertainty.
In this work, we study why reinforcement learning causes this problem and show that improving reasoning accuracy and improving confidence reliability can directly interfere with each other during training. Based on this insight, we propose a new training framework called DCPO that separates these two objectives instead of optimizing them together.
Our method teaches models to generate both an answer and an explicit confidence estimate, while training reasoning and confidence signals independently. This allows the model to maintain strong reasoning ability while producing more trustworthy confidence estimates.
Experiments on mathematical reasoning and code generation tasks show that DCPO substantially reduces overconfidence while preserving the performance gains brought by reinforcement learning. Our findings contribute toward building more reliable and trustworthy AI systems.
Originally Submitted Supplementary Material: zip
Link To Code: https://github.com/icip-cas/DCPO
Primary Area: Deep Learning->Large Language Models
Keywords: RLVR, Model Calibration, Over-confidence, Decoupled Estimation
Originally Submitted PDF: pdf
Submission Number: 21631
Loading