Can LLMs Reliably Evaluate Themselves? A Probabilistic VC Framework

Can LLMs Reliably Evaluate Themselves? A Probabilistic VC Framework

ICLR 2026 Conference Submission12178 Authors

18 Sept 2025 (modified: 04 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Self-Evaluation Capacity, Introspective Reliability, Uncertainty Calibration, Probabilistic VC (PVC), Calibration-Aware PVC (C-PVC), Sample Complexity

TL;DR: We propose a calibration-aware probabilistic VC framework to measure LLMs' self-evaluation capacity, assess when they can reliably trust their own answers, and enable targeted self-improvement.

Abstract: As Large Language Models (LLMs) are increasingly deployed in autonomous reasoning tasks, the capacity to reliably evaluate their own outputs becomes paramount. We address this challenge by establishing a formal framework grounded in statistical learning theory. By operationalizing self-evaluation as a property of the hypothesis class induced by prompting strategies and stochastic decoding, we extend the classical Vapnik-Chervonenkis (VC) dimension to the probabilistic setting. We introduce two novel complexity measures: the Probabilistic VC (PVC) dimension, which quantifies the discriminative expressiveness of self-assessment, and the Calibration-aware PVC (C-PVC) dimension, which imposes a strict alignment constraint between confidence and correctness. In contrast to isolated calibration metrics, our unified framework provides integrated complexity measurements with provable generalization guarantees. A systematic evaluation of eleven 7--8B models across mathematical, factual, and commonsense domains highlights a fundamental trade-off: enhanced discriminative capacity systematically incurs a degradation in calibration quality. This structural tension suggests that current reasoning optimization paradigms do not implicitly resolve, and may exacerbate, miscalibration. Our framework offers the necessary diagnostic tools to quantify these risks, laying the groundwork for the development of trustworthy autonomous systems.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 12178

Loading