Semantic-Level Confidence Calibration of Language Models via Temperature Scaling

Tom A. Lamb; Desi R. Ivanova; Philip Torr; Tim G. J. Rudner

Semantic-Level Confidence Calibration of Language Models via Temperature Scaling

Tom A. Lamb, Desi R. Ivanova, Philip Torr, Tim G. J. Rudner

Published: 19 Mar 2025, Last Modified: 25 Apr 2025AABI 2025 Workshop TrackEveryoneRevisionsBibTeXCC BY 4.0

Keywords: semantic calibration; temperature scaling; LLMs

TL;DR: We introduce semantic-level confidence measures for LLMs, and through extensive empirical investigation show that scalar temperature scaling serves as simple method for semantic calibration.

Abstract: Calibration of language models is typically studied at the token level, with scalar temperature scaling serving as the primary approach for recalibrating models. Recent multi-sampling techniques allow us to elicit semantic uncertainty measures from language models. However, these techniques focus on summary statistics of the limited existing semantic confidence distributions rather than on how well-calibrated these distributions are, a crucial factor in ensuring that the resulting semantic likelihoods are both meaningful and reliable. In this paper, we investigate whether and how temperature scaling, which directly influences generative diversity and token-level calibration, affects semantic calibration. We address these question by investigating semantic-level calibration in both pre-trained and fine-tuned models. In particular, we introduce a framework for assessing semantic confidence that incorporates both existing and novel confidence measures, comparing them to a single-generation confidence measure. Furthermore, we investigate various temperature scaling methods and their effect on semantic calibration. Our experiments span both open-book and closed-book question answering datasets. Our empirical findings demonstrate that scalar temperature scaling, when appropriately applied, provides a simple, widely applicable, and effective method for improving semantic calibration in language models.

Submission Number: 7

Loading