Attention Head Entropy of LLMs Predicts Answer Correctness

09 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Entropy, LLM, Attention Heads, Correctness Prediction, Uncertain
TL;DR: We use attention head entropy as a measure to predict the correctness of a LLM's answer.
Abstract: Large language models (LLMs) generate plausible, yet possibly incorrect answers, posing risks in safety-critical settings, such as medical advice. Although both LLM-as-judge and human evaluations are useful, human evaluation is expensive, whereas LLM-as-judge approaches risk introducing additional hidden errors. To address this, we introduce Head Entropy, a white-box and scalable method that uses the attention patterns inside the model itself to determine the likelihood of a correct answer while generating the answer. Our key insight is that certain attention heads exhibit distinct entropy patterns when the model gives correct versus incorrect answers. Using a sparse logistic regression classifier on per-head entropies, Head Entropy achieves 0.07–0.15 AUROC improvements over baselines on 5 instruction-tuned LLMs and 3 QA datasets spanning general knowledge, multi-hop reasoning, and medicine. Through Shapley value analysis, we demonstrate that middle-layer attention heads contribute the most to prediction accuracy providing mechanistic insight into model failure modes. Head Entropy offers a practical, interpretable, and computationally efficient approach for real-time correctness estimation during LLM deployment.
Primary Area: interpretability and explainable AI
Submission Number: 3345
Loading