Keywords: LLM, Calibration, Adversarial, Faithfulness, Robustness
TL;DR: We introduce CaliDist, a novel framework that calibrates LLM confidence by measuring the model's behavioral robustness to semantic distractions.
Abstract: For Large Language Models (LLMs) to be trusted in high-stakes applications, it is paramount that their confidence scores are well-calibrated. However, existing calibration methods often overlook a critical dimension of trustworthiness: a model's behavioral robustness to irrelevant or misleading information. In this paper, we argue that a model's true confidence should reflect its stability under cognitive pressure. We introduce \textsc{CaliDist}, a novel, post-hoc calibration framework that directly measures and penalizes a model's susceptibility to distraction. \textsc{CaliDist} quantifies how an LLM's predictions and certainty change when its input prompt is perturbed with semantic \textit{distractors}. This instability signal is then used to adaptively scale the model's initial confidence score.
Our extensive experiments on seven Natural Language Understanding (NLU) classification benchmarks using six distinct LLMs show that \textsc{CaliDist} consistently achieves lower Expected Calibration Error (ECE) than several baselines. Remarkably, our method reduces the ECE from 19\% to 11\% on average—a relative improvement of 47\%—demonstrating that behavioral stability is a powerful and practical signal for calibration.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22625
Loading