Self-Knowledge Without a Self? Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns
Keywords: confidence calibration, correctness prediction, self-knowledge, introspection, Large Language Model Uncertainty
Abstract: Generating reliable, calibrated confidence estimates is critical for deploying LLMs in high-stakes or user-facing applications, and remains an open challenge. Prior research has often framed confidence as a problem of eliciting a model’s “self-knowledge”, i.e., the ability of an LLM to judge whether its own answers are correct; this approach implicitly assumes that there is some privileged information about the answer’s correctness that is accessible to the model itself. However, our experiments reveal that this assumption does not hold. Whether trained or training-free, an LLM attempting to predict the correctness of its own outputs generally performs no better than an unrelated model attempting the same task. In other words, LLMs have negligible self-knowledge for the purposes of correctness prediction. Moreover, we hypothesize that a key factor in predicting model correctness, i.e., building a “Correctness Model” (CM), is exposure to a target model’s historical predictions. We propose multiple methods to inject this historical correctness information, including training an LLM to predict the confidences of many other LLMs, i.e., creating a Generalized Correctness Model (GCM). We first show that GCMs can be trained on the correctness of historical predictions from many LLMs and learn patterns and strategies for correctness prediction applicable across datasets and models. We then use CMs as a lens to study the source of the generalization and correctness prediction ability, adjusting their training data and finding that answer phrasing is a strong predictor for correctness. Moreover, our results suggest that a CM’s ability to leverage world knowledge about answers for correctness prediction is a key enabler for generalization. We further explore alternative methods of injecting history without training an LLM, finding that including history as in-context examples can help improve correctness prediction, and post-hoc calibration can provide composable reductions in calibration error. We evaluate GCMs based on Qwen3-8B across 5 model families and the MMLU and TriviaQA datasets, as well as on a downstream selective prediction task, finding that reliable LLM confidence estimation is a generalizable and model-agnostic skill learned by systematically encoding correctness history rather than a model-specific skill reliant on self-introspection.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 20263
Loading