Keywords: Self-Evaluation Capacity, Introspective Reliability, Uncertainty Calibration, Probabilistic VC (PVC), Calibration-Aware PVC (C-PVC), Sample Complexity
TL;DR: We propose a calibration-aware probabilistic VC framework to measure LLMs' self-evaluation capacity, assess when they can reliably trust their own answers, and enable targeted self-improvement.
Abstract: As large language models (LLMs) find increasing use in critical applications, evaluating their ability to assess their own outputs has become crucial. Our work presents a theoretical and empirical framework that examines whether LLMs can differentiate between correct and incorrect solutions while maintaining properly calibrated confidence. We build upon classical Vapnik-Chervonenkis (VC) dimension theory, adapting it to probabilistic predictors through two new complexity metrics: Probabilistic VC (PVC), which measures a model's ability to confidently classify across problem types, and Calibration-aware PVC (C-PVC), which demands alignment between confidence scores and actual success rates. Unlike traditional metrics such as Expected Calibration Error (ECE) and Actual Error (AE), these measures provide an integrated assessment of self-evaluation expressiveness and calibration, yielding sample complexity bounds and generalization guarantees comparable to traditional VC theory. In our study, we tested eleven models (7-8B parameters) across three diverse benchmarks: 360 mathematical reasoning problems, TruthfulQA for factual accuracy, and CommonsenseQA for commonsense reasoning. Each model had to choose between two of its own generated solutions and report its confidence level—a direct test of self-evaluation capability—with ground-truth determined by a larger model ensemble. The experimental results empirically substantiate a systematic inverse correlation: models exhibiting enhanced self-evaluation expressiveness consistently demonstrate diminished calibration fidelity. Models like s1.1-7B and Qwen2.5-7B-Instruct achieve high PVC-VUS scores, indicating strong discriminative self-assessment capacity, while JiuZhang3.0-7B demonstrates superior calibration with the lowest ECE and smallest PVC-VUS gap. Interestingly, we observe domain-specific variations in self-evaluation abilities, with some models performing better on mathematical reasoning tasks while others excel in factual or commonsense domains. Our analysis suggests complex interactions between training methodologies and self-evaluation performance, indicating that multiple factors beyond training approach influence a model's ability to accurately assess its own outputs. The fundamental trade-off between calibration and expressiveness constitutes a persistent phenomenon transcending architectural variations, training paradigms, and cognitive domains, pointing to a fundamental challenge in developing self-reflective LLMs. The framework we've developed offers practical tools for identifying and addressing these limitations, helping create LLMs that can not only tackle complex problems but also recognize when they might be wrong—an essential capability for safe deployment and meaningful self-improvement in autonomous systems.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 12178
Loading