A Study on the Calibration of In-context Learning

Hanlin Zhang; YiFan Zhang; Yaodong Yu; Dhruv Madeka; Dean Foster; Eric P. Xing; Himabindu Lakkaraju; Sham M. Kakade

A Study on the Calibration of In-context Learning

Hanlin Zhang, YiFan Zhang, Yaodong Yu, Dhruv Madeka, Dean Foster, Eric P. Xing, Himabindu Lakkaraju, Sham M. Kakade

Published: 27 Oct 2023, Last Modified: 24 Apr 2024ICBINB 2023EveryoneRevisionsBibTeX

Keywords: Calibraiton, Language Models, In-context Learning

Abstract: Modern auto-regressive models are trained to minimize log loss by predicting the next token. As a result, they are expected to get calibrated answers when framing problems as next-token prediction tasks. We study this for in-context learning (ICL), a widely used way to adapt frozen large language models (LLMs) via crafting prompts and investigate the trade-offs between performance and calibration on a wide range of natural language understanding and reasoning tasks. We conduct extensive experiments to show that such trade-offs may get worse as we increase model size, incorporate more ICL examples, and fine-tune models using instruction or dialog tuning on carefully curated datasets. Furthermore, we find that common recalibration techniques that are widely effective such as temperature scaling may provide limited gains for calibration errors, suggesting that new methods may be required for settings where models are expected to be reliable.

Submission Number: 8

Loading