Rank-Preserving Calibration of LLMs Under Model and Distribution Shifts

Jian Hu; Qunli Zhang; Feng Liu; zheng hu; Changjae Oh; Shaogang Gong; Ziquan Liu

Rank-Preserving Calibration of LLMs Under Model and Distribution Shifts

Jian Hu, Qunli Zhang, Feng Liu, zheng hu, Changjae Oh, Shaogang Gong, Ziquan Liu

13 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: transfer learning; Question Answering; Cross-domain calibration

TL;DR: A calibration method for LLMs under both model and domain shift.

Abstract: A central barrier to deploying Large Language Models (LLMs) in safety-critical applications is hallucination, where models generate non-factual content with high confidence. Detecting hallucinations requires well-calibrated confidence estimates, yet calibration is brittle under domain and model shifts. The former renders confidence estimates unreliable in a new environment, while the latter arises because different LLMs exhibit distinct confidence scales, so calibration learned for one model often fails to transfer when another is used at deployment for efficiency or privacy. Addressing this vulnerability is critical for robust model generalisation, as failure to calibrate reliably confidence values across domains and models undermines trust in LLMs at deployment. Existing prompting-based approaches are label-free and flexible, they perform poorly when domain knowledge of a model is limited. In contrast, explicit calibration for a specialized domain achieves strong in-domain results but fails to generalize to a novel domain. This work discovers although absolute confidence values often fail to transfer across shifts, their relative rankings, which only rely on the relative reliability among samples within a dataset, can prevail in robustness across shifts. Based on this key insight, we propose a two-stage framework Rank-preserving Adaptive Peudo-Calibration (RAPCal). In source calibration, an Expectation-Maximisation stage converts one-hot correctness labels into soft supervisions by bin-wise accuracy estimations for a fine-grained calibration. In target calibration, a stage for preserved ranking of confidence scores is introduced to construct pseudo soft labels, enabling unsupervised cross-domain calibration adaptation without ground-truth labels in test domains. Experiments show that RAPCal reduces ECE by 6.15\% without sacrificing task performance, advancing the reliability of LLMs in label-scarce settings. Code is given in the supplementary materials.

Supplementary Material: zip

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 4819

Loading