LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization

Published: 24 Sept 2025, Last Modified: 25 Sept 2025INTERPLAYEveryoneRevisionsBibTeXCC BY 4.0
Keywords: calibration/uncertainty, knowledge tracing/discovering/inducing, probing
TL;DR: The paper presents methods to predict the correctness of LLM outputs and the efficacy of external context by analyzing model-internal activations, enabling early detection of errors and better use of retrieval-augmented generation.
Abstract: Although large language models (LLMs) have tremendous utility, trustworthiness is still a chief concern: models often generate incorrect information with high confidence. While contextual information can help guide generation, identifying when a query would benefit from retrieved context and assessing the effectiveness of that context remains challenging. In this work, we operationalize interpretability methods to ascertain whether we can predict the correctness of model outputs from the model’s activations alone. We also explore whether model internals contain signals about the efficacy of external context. We consider correct, incorrect, and irrelevant context and introduce metrics to distinguish amongst them. Experiments on six different models reveal that a simple classifier trained on intermediate layer activations of the first output token can predict output correctness with nearly 80\% accuracy, enabling early auditing. Our model-internals-based metric significantly outperforms prompting baselines at distinguishing between correct and incorrect context, guarding against inaccuracies introduced by polluted context. These findings offer a lens to better understand the underlying decision-making processes of LLMs.
Public: Yes
Track: Main-Long
Submission Number: 31
Loading