Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

ACL ARR 2026 January Submission6297 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Correctness Prediction, Privileged Knowledge, Introspection, Factual Knowledge, Mathematical Reasoning, Hallucination, Probing

Abstract: Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether Large Language Models possess similar privileged knowledge—information unavailable through external observation. Specifically, we ask whether models have unique signals about answer correctness given only the question. We train correctness classifiers on question representations from both a model's own hidden states and external models, testing whether self-representations provide a performance advantage. Standard evaluations show no advantage: self-probes perform comparably to peer-model probes, which we attribute to high inter-model agreement. To isolate genuine privileged knowledge, we evaluate on disagreement subsets where models produce conflicting predictions. Here, self-representations consistently outperform peer representations in factual knowledge tasks, but mathematical reasoning shows no advantage. Our findings reveal domain-specific privileged knowledge: models possess genuine self-signals about factual correctness, while mathematical reasoning correctness appears universally observable. We explore potential mechanisms underlying this distinction, finding evidence that factual correctness relies on entity-driven memory retrieval while mathematical correctness may involve more universal computational patterns.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Interpretability, Probing, Evaluation, Question Answering, Reasoning

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 6297

Loading