Calibrated on Average, but not Within Each Slice: Few-shot Calibration for All Slices of a Distribution

Xiang Lisa Li; Urvashi Khandelwal; Kelvin Guu

Calibrated on Average, but not Within Each Slice: Few-shot Calibration for All Slices of a Distribution

Xiang Lisa Li, Urvashi Khandelwal, Kelvin Guu

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Supplementary Material: pdf

Primary Area: general machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: calibration, language models, LMs, domains

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: While LMs appear to be well-calibrated on broad distributions, they remain miscalibrated for meaningful slices of that broader distribution. We propose fewshot recalibration method, to recalibrate the LM for each slice.

Abstract: Recent work has uncovered promising ways to extract well-calibrated confidence estimates from language models (LMs), in which the model’s confidence accurately reflects the probability that the answer is correct. However, while a model may be well-calibrated on average over some input distribution, the same model can actually be significantly miscalibrated within any narrower slice of the full distribution. For example, we find that a model may be well-calibrated over multiple-choice exam questions, but this calibration is the result of systematic overconfidence in one subject (e.g. math) getting balanced out by systematic underconfidence in another subject (e.g. history). In practice, being calibrated within narrower slices of a distribution is important because the full distribution is often formed from the queries of individual users who each only care about a narrower slice. In this work, we propose a new framework for calibrating models on any given slice of a distribution, using just a few unlabeled samples from that slice. Specifically, we train a model that approximates the precision-threshold curve for any given slice by using its few-shot samples to predict the LM’s empirical precision at various confidence thresholds. This allows us to directly identify slice-specific thresholds above which the LM’s predictions can be trusted (e.g. for a target precision of 90), and below which it should abstain. We also show that the precision curve can be mapped back to the classic calibration curve, which can guide adjusting the LM confidence to achieve lower calibration error. Experiments show that our fewshot recalibrator consistently outperforms existing calibration methods, for instance improving calibration error for PaLM2-Large on MMLU by 16%, as compared to temperature scaling.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4936

Loading