Learn to be Honest: Mitigate LLMs' Overconfidence for Improving Hallucination Detection with Self-Hesitation Activation

09 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Hallucination Detection, Uncertainty Estimation, Overconfidence, Self-Hesitation Activation
Abstract: While Large Language Models (LLMs) have demonstrated strong performance across a wide range of natural language processing tasks, plausible but unfaithful content is still inevitably generated, which is known as factual hallucination. Previous methods, such as classifier training and uncertainty estimation, have been proposed for hallucination detection. However, it is widely found that LLMs express overconfidence by attempting to rationalize the incorrect outputs, leading to a misalignment between perceived uncertainty and knowledge boundary perception. It significantly undermines the effectiveness of existing hallucination detection methods. We study the correlation between hallucination and overconfidence, arguing that they are systematically inseparable in traditional training strategies due to overtraining. To address this, a series of analyses are conducted and a method called Self-Hesitation Activation Fine-Tuning (SHAFT) is proposed to align the uncertainty with the factual correctness, making LLMs "More Honest". Experiments demonstrate that our approach significantly mitigates the overconfidence of LLMs and decouples overconfidence with hallucination, making the nonfactual instances more distinguishable. Furthermore, evaluations across three benchmarks reveal that SHAFT greatly improves the performance of various hallucination detection methods before generation, consistently indicating its generalizability and computational efficiency.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Submission Number: 3326
Loading