Attention Head Entropy of LLMs Predicts Answer Correctness

Sophie Ostmeier; Brian Axelrod; Maya Varma; Asad Aali; Magdalini Paschali; Jason Alan Fries; Yabin Zhang; Sanmi Koyejo; Curtis Langlotz; Akshay S Chaudhari

Attention Head Entropy of LLMs Predicts Answer Correctness

Sophie Ostmeier, Brian Axelrod, Maya Varma, Asad Aali, Magdalini Paschali, Jason Alan Fries, Yabin Zhang, Sanmi Koyejo, Curtis Langlotz, Akshay S Chaudhari

09 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Entropy, LLM, Attention Heads, Correctness Prediction, Uncertain

TL;DR: We use attention head entropy as a measure to predict the correctness of a LLM's answer.

Abstract: Large language models (LLMs) generate plausible, yet possibly incorrect answers, posing risks in safety-critical settings, such as medical advice. Although both LLM-as-judge and human evaluations are useful, human evaluation is expensive, whereas LLM-as-judge approaches risk introducing additional hidden errors. To address this, we introduce Head Entropy, a white-box and scalable method that uses the attention patterns inside the model itself to determine the likelihood of a correct answer while generating the answer. Our key insight is that certain attention heads exhibit distinct entropy patterns when the model gives correct versus incorrect answers. Using a sparse logistic regression classifier on per-head entropies, Head Entropy achieves 0.07–0.15 AUROC improvements over baselines on 5 instruction-tuned LLMs and 3 QA datasets spanning general knowledge, multi-hop reasoning, and medicine. Through Shapley value analysis, we demonstrate that middle-layer attention heads contribute the most to prediction accuracy providing mechanistic insight into model failure modes. Head Entropy offers a practical, interpretable, and computationally efficient approach for real-time correctness estimation during LLM deployment.

Primary Area: interpretability and explainable AI

Submission Number: 3345

Loading