CaliDist: Calibrating Large Language Models via Behavioral Robustness to Distraction

CaliDist: Calibrating Large Language Models via Behavioral Robustness to Distraction

ICLR 2026 Conference Submission22625 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Calibration, Adversarial, Faithfulness, Robustness

TL;DR: We introduce CaliDist, a novel framework that calibrates LLM confidence by measuring the model's behavioral robustness to semantic distractions.

Abstract: For Large Language Models (LLMs) to be trusted in high-stakes applications, it is paramount that their confidence scores are well-calibrated. However, existing calibration methods often overlook a critical dimension of trustworthiness: a model's behavioral robustness to irrelevant or misleading information. In this paper, we argue that a model's true confidence should reflect its stability under cognitive pressure. We introduce \textsc{CaliDist}, a novel, post-hoc calibration framework that directly measures and penalizes a model's susceptibility to distraction. \textsc{CaliDist} quantifies how an LLM's predictions and certainty change when its input prompt is perturbed with semantic \textit{distractors}. This instability signal is then used to adaptively scale the model's initial confidence score. Our extensive experiments on seven Natural Language Understanding (NLU) classification benchmarks using six distinct LLMs show that \textsc{CaliDist} consistently achieves lower Expected Calibration Error (ECE) than several baselines. Remarkably, our method reduces the ECE from 19\% to 11\% on average—a relative improvement of 47\%—demonstrating that behavioral stability is a powerful and practical signal for calibration.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 22625

Loading