Unlocking the Pre-Trained Model as a Dual-Alignment Calibrator for Post-Trained LLMs

17 Sept 2025 (modified: 27 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Calibration
Abstract: Post-training boosts the performance of large language models (LLMs) but systematically degrades their confidence calibration, making them frequently overconfident. Recent post-hoc LLM calibration methods circumvent the challenge by aligning the post-trained language model with its pre-trained counterpart; however, they treat calibration as a static output distribution matching problem, and thus fail to capture the complex dynamics of post-training induced on calibration. Our investigation into these dynamics reveals that calibration errors stem from two distinct regimes: (i) output drift, where final confidence is inflated while intermediate decision process remains consistent, and (ii) process drift, where the intermediate pathways themselves diverge. Based on this diagnosis, we propose DUAL-ALIGN, a dynamic unsupervised framework performing dual alignment for LLM confidence calibration. It applies output alignment to correct output drift by matching the final output distributions. For process drift, it introduces novel process alignment, a technique that first identifies the specific layer where the models' inference paths diverge and then realigns the stability of their subsequent trajectories. This dual strategy enables learning a temperature parameter that corrects both calibration error types that occur during post-training. Experiment results demonstrate that our method brings consistent improvement compared with representative baselines, reducing calibration error and approaching the performance of a supervised oracle.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 8736
Loading