Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward

Hao An; Yang Xu

Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward

Hao An, Yang Xu

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Hallucination, Abstention, Reinforcement Learning

Abstract: Mitigating hallucinations in Large Language Models (LLMs) is critical for their reliable deployment. Existing methods typically fine-tune LLMs to abstain from answering questions beyond their knowledge scope. However, these methods often rely on coarse-grained signals to guide LLMs to abstain, such as overall confidence or uncertainty scores on multiple sampled answers, which may result in an imprecise awareness of the model's own knowledge boundaries. To this end, we propose a novel reinforcement learning framework built on Fine-Grained Semantic Confidence Reward (FSCR), which guides LLMs to abstain via sample-specific confidence. Specifically, our method operates by sampling multiple candidate answers and conducting semantic clustering, then training the LLM to retain answers within high-confidence clusters and discard those within low-confidence ones, thereby promoting accurate post-hoc abstention. Additionally, we propose a comprehensive metric for evaluating the reliability of abstention fine-tuning. Experimental results demonstrate that our method significantly enhances reliability in both in-domain and out-of-distribution benchmarks.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 5965

Loading