LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization

ICLR 2026 Conference Submission21428 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Calibration, Uncertainty, Knowledge Tracing, Probing
TL;DR: The paper presents methods to predict the correctness of LLM outputs and the efficacy of external context by analyzing model-internal activations, enabling early detection of errors and better use of retrieval-augmented generation.
Abstract: Retrieval-augmented generation (RAG) has become a popular approach to improving large language models (LLMs), yet trustworthiness remains a central challenge: models may produce fluent but incorrect answers, and retrieved context can amplify errors when irrelevant or misleading. To address this, we study how model internals reflect the interplay between parametric knowledge and external context during generation. Specifically, we ask: (1) can the correctness of a model’s output be inferred directly from its internal activations, and (2) do these internals reveal whether external context is helpful, harmful, or irrelevant? We introduce metrics grounded in intermediate activations to capture both dimensions. Across six models, a simple classifier trained on hidden states of the first output token predicts output correctness with nearly 75% accuracy, enabling early auditing. Moreover, our internals-based metric substantially outperforms prompting baselines at distinguishing between correct and incorrect context, guarding against polluted retrieval. These findings highlight model activations as a promising lens for understanding and improving the reliability of RAG systems.
Primary Area: interpretability and explainable AI
Submission Number: 21428
Loading