Keywords: Large Language Models, Correctness Prediction, Privileged Knowledge, Introspection, Factual Knowledge, Mathematical Reasoning, Hallucination, Probing
Abstract: Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether Large Language Models possess similar privileged knowledge—information unavailable through external observation. Specifically, we ask whether models have unique signals about answer correctness given only the question. We train correctness classifiers on question representations from both a model's own hidden states and external models, testing whether self-representations provide a performance advantage. Standard evaluations show no advantage: self-probes perform comparably to peer-model probes, which we attribute to high inter-model agreement. To isolate genuine privileged knowledge, we evaluate on disagreement subsets where models produce conflicting predictions. Here, self-representations consistently outperform peer representations in factual knowledge tasks, but mathematical reasoning shows no advantage. Our findings reveal domain-specific privileged knowledge: models possess genuine self-signals about factual correctness, while mathematical reasoning correctness appears universally observable. We explore potential mechanisms underlying this distinction, finding evidence that factual correctness relies on entity-driven memory retrieval while mathematical correctness may involve more universal computational patterns.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Interpretability, Probing, Evaluation, Question Answering, Reasoning
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 6297
Loading