Signals of Trust: From Classification to Confidence

Signals of Trust: From Classification to Confidence

ICLR 2026 Conference Submission17191 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Confidence estimation, Trustworthy AI, Deep neural networks, Post-hoc methods, Instance-level confidence

Abstract: Reliable confidence estimation is crucial for safety-critical and human-in-the-loop applications, where users must trust individual AI decisions under uncertainty. Existing methods, ranging from softmax-based probabilities to uncertainty quantification techniques, fall short in providing reliable confidence scores. These scores often suffer from poor calibration, high computational costs, or lack of interpretability, making them less effective at the instance level. To overcome these limitations, we introduce $\textbf{KDE Trust}$, a post-hoc method that models class-conditional distributions of correct and incorrect predictions in the deep neural network’s representation space. This yields a direct estimate of the posterior probability of correctness, $P(\text{correct}|\text{predicted}=c)$, without requiring architectural modifications. We also introduce a novel evaluation metric called Polarisation, which quantifies how close confidences scores are to 1 for correct classifications and 0 for incorrect ones, enabling human-centric interpretability. Across four datasets spanning high- and low-accuracy regimes, as real-word scenarios often involve suboptimal model performance, KDE Trust matches baseline performance in high-accuracy settings while achieving up to +14.2\% AUROC improvement in degraded scenarios. Precision–percentile curves show higher retained precision with reduced variance, and distributional metrics reveal polarised scores.

Primary Area: interpretability and explainable AI

Submission Number: 17191

Loading