Improving Semantic Uncertainty Quantification in Language Models via Token-Level Temperature Scaling
TL;DR: This paper introduces new semantic confidence measures and shows that simple token-level temperature optimisation improves calibration, discrimination, and entropy-based uncertainty in LLMs, outperforming heuristic and complex calibration methods.
Abstract: Calibration is central to reliable semantic uncertainty quantification in language models, yet prior work has largely focused on the discriminative use of semantic uncertainty, neglecting calibration. In this paper, we address this gap in the literature and study both semantic calibration and discrimination across a broad set of semantic confidence measures. We conduct a careful empirical evaluation and find that optimising a single, token-level temperature parameter is a simple and effective method for improving semantic uncertainty quantification. Across semantic confidence measures, models, and QA datasets, token-level temperature optimisation consistently improves semantic calibration, discrimination, and semantic entropy. Notably, uncertainty-focused temperature optimisation outperforms both widely-used fixed-temperature baselines and more sophisticated calibration methods for semantic uncertainty quantification.
Submission Number: 144
Loading