BUDDY: Blending Training and Deployment Data with Weighted Expert Ensembles for Post-hoc LLM Calibration

Published: 02 Mar 2026, Last Modified: 02 Mar 2026ICLR 2026 Trustworthy AIEveryoneRevisionsBibTeXCC BY 4.0
Keywords: medical QA, calibration, distribution shift
TL;DR: We introduce a post-hoc calibration method for LLMs that uses mixtures of experts and explicitly accounts for distribution shift at test time.
Abstract: Large language models (LLMs) are increasingly deployed in high-stakes domains such as healthcare, but poor calibration and an inability to robustly handle distribution shift during deployment limit their utility. While existing approaches, including fine-tuning and conformal prediction, can improve calibration, they are often computationally expensive or unreliable in low-data target settings. We propose a post-hoc calibration framework that leverages data from both the training and deployment distributions. Our method learns an ensemble calibrator on the training distribution, applies group-conditional temperature scaling to correct structured miscalibration, and then performs lightweight target adaptation by optimizing a discriminator-weighted, tilted empirical risk that emphasizes target-like examples. We provide finite-sample guarantees for groupwise temperature estimation and the resulting groupwise calibration error; furthermore, we motivate the re-weighting step as a principled way to emphasize target-like examples under distribution shift. Empirically, we demonstrate robust calibration under distribution shift in scarce target settings. Across multiple-choice and open-ended question answering (QA) datasets, our method consistently improves calibration over existing baselines. Overall, our results demonstrate that ensembling multiple sources of expertise and explicitly accounting for dataset distribution shift results in more reliable confidence estimates, thus supporting safer deployment of LLMs in high-stakes settings.
Submission Number: 257
Loading