Keywords: Calibration, reinforcement learning, language models, reasoning
TL;DR: RL that jointly optimizes accuracy and calibrated confidence estimation yields better results across a variety of datasets
Abstract: Large language models (LLMs) have been observed to perform well when trained to output textual reasoning chains using reinforcement learning (RL). However, almost all successful applications of RL for reasoning use reward functions that are simply binary correctness checks. Because these rewards do not penalize guessing or low-confidence outputs, they often degrade calibration and increase hallucinations as side effects. We propose a new RL framework that jointly improves accuracy and calibrated confidence estimation by combining the correctness reward with the Brier score, a proper scoring rule that incentivizes truthful confidence reporting.
Our designed reward provably encourages models to produce predictions that are both accurate and well calibrated.
Across a variety of datasets, both in-domain and out-of-domain, our method dramatically improves calibration at no cost in accuracy, outperforming both RL training and classifiers trained only to assign confidence scores. While ordinary RL hurts calibration, our approach improves it. These results highlight the potential of calibrated RL for building more reliable and interpretable reasoning models.
Paper Published: No
Paper Category: Short Paper
Demography: No, I do not identify with any of these affinity groups
Academic: Year > 2 PhD Student
Submission Number: 26
Loading