Fact or Hallucination? An Entropy-Based Framework for Attention-Wise Usable Information in LLMs

19 Sept 2025 (modified: 27 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, Hallucination detection
Abstract: Large language models (LLMs) often generate confident yet inaccurate outputs, posing serious risks in safety-critical applications. Existing hallucination detection methods typically rely on final-layer logits or post-hoc textual checks, which can obscure the rich semantic signals encoded across model layers. Thus, we propose **Shapley NEAR** (Norm-basEd Attention-wise usable infoRmation), a principled, entropy-based attribution framework grounded in Shapley values that assigns a confidence score indicating whether an LLM output is hallucinatory. Unlike prior approaches, Shapley NEAR decomposes attention-driven information flow across all layers and heads of the model, where higher scores correspond to lower hallucination risk. It further distinguishes between two hallucination types: parametric hallucinations, caused by the model’s pre-trained knowledge overriding the context, and context-induced hallucinations, where misleading context fragments spuriously reduce uncertainty. To mitigate parametric hallucinations, we introduce a test-time head clipping technique that prunes attention heads contributing to overconfident, context-agnostic outputs. Empirical results in four QA benchmarks (CoQA, QuAC, SQuAD, and TriviaQA), using Qwen2.5-3B, LLaMA3.1-8B, and OPT-6.7B, demonstrate that Shapley NEAR outperforms strong baselines, without requiring additional training, prompting, or architectural modifications.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 21361
Loading