Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution's Characteristics

Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution's Characteristics

ACL ARR 2025 February Submission755 Authors

11 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Well-calibrated model confidence scores can improve the usefulness of text generation models. For example, users can be prompted to review predictions with low confidence scores, to prevent models from returning bad or potentially dangerous predictions. To be practically useful, these scores need to be well calibrated with the quality of the output. However, confidence metrics are not always well calibrated in text generation. One reason is that in generation, there can be many valid answers, which previous methods do not always account for. Hence, a confident model could assign probability to many sequences because they are all valid, and not because it is unsure about how to perform the task. We propose task-agnostic confidence metrics suited to generation, which rely solely on model probabilities without the need for further fine-tuning or heuristics. Using these, we are able to improve the calibration of BART and Flan-T5 on summarization, translation, and question answering datasets.

Paper Type: Short

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: calibration, uncertainty

Contribution Types: Model analysis & interpretability

Languages Studied: English, German, Russian, Filipino

Submission Number: 755

Loading