Keywords: Applications of interpretability, Understanding high-level properties of models, Other
Other Keywords: calibration, multilinguality
TL;DR: Multilingual calibration is worse than English, intermediate representations improves multilingual calibration.
Abstract: Confidence calibration, the alignment between a model's predicted confidence and its empirical correctness, is crucial for the trustworthiness of Large Language Models (LLMs). Previous studies on multilingual calibration mainly use machine-translated data and are limited to a small number of languages. In this work, we present the first systematic evaluation of multilingual calibration on 3 high-quality datasets over 100 languages with 7 model families. Our analysis reveals that LLMs exhibit significant disparities across languages, particularly underperforming in low-resource and non-Latin-script settings.
To understand the source of this miscalibration, we conduct a layer-wise analysis and uncovered a consistent pattern: intermediate layers often yield better-calibrated outputs than final layers, especially for low-resource languages.
Inspired by this observation, we propose leveraging intermediate representations to enhance multilingual calibration. Our methods significantly improve Expected Calibration Error (ECE), Brier Score, and AUROC, outperforming final-layer baselines by large margins. Importantly, our approach is orthogonal to existing calibration methods, and combining them leads to further improvements. This work challenges the conventional reliance on final-layer decoding and opens a new direction for achieving robust and equitable multilingual calibration.
Submission Number: 56
Loading