CARE-RFT: Confidence-Anchored Reinforcement Finetuning for Reliable Reasoning in Large Language Models

Shuozhe Li; Jincheng Cao; Bodun Hu; Aryan Mokhtari; Liu Leqi; Amy Zhang

CARE-RFT: Confidence-Anchored Reinforcement Finetuning for Reliable Reasoning in Large Language Models

Shuozhe Li, Jincheng Cao, Bodun Hu, Aryan Mokhtari, Liu Leqi, Amy Zhang

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models, Hallucination mitigation, Reinforcement finetuning, Trustworthy AI, divergence constraints

TL;DR: We introduce CARE-RFT, a confidence-anchored reinforcement finetuning method that uses skew KL divergence to preserve calibration and reduce hallucinations while maintaining strong reasoning in large language models.

Abstract: Reinforcement finetuning (RFT) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, we identify a critical trade-off: while unconstrained RFT achieves strong reasoning performance, it severely compromises model trustworthiness by amplifying hallucination and worsening calibration; conversely, RKL-constrained RFT preserves trustworthiness but limits reasoning gains due to its unbounded penalty on exploratory deviations. To resolve this tension, we introduce CARE-RFT (Confidence-Anchored Regularized Reinforcement Finetuning), a novel method that replaces standard reverse KL regularization with a skew reverse KL divergence. CARE-RFT provides a confidence-sensitive penalty—bounded for confident, consistently-rewarded explorations to enable reasoning, while unbounded elsewhere to preserve calibration. Extensive experiments across multiple model scales and RFT algorithms show that CARE-RFT achieves a superior balance, matching the reasoning performance of unconstrained RFT while recovering the trustworthiness and calibration of the base model. Our work establishes that careful, confidence-aware regularization is key to building both capable and trustworthy reasoning models.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 7397

Loading