Keywords: Language Models, Calibration, Multi-Turn Debates, Finetuning.
TL;DR: We uncover overconfidence in debate judge models and address the issue by finetuning on the left tail of Gumbel distributions of ratings.
Abstract: Multi-turn debates have gained attention as language evaluation tasks for subject matter comprehension, critical reasoning and long-form responses. Language Models (LMs) play the role of judges for obtaining subjective ratings as a cheap alternative to human labor. However, similar to humans, LM judges may remain unsure of ratings and rate debate arguments either under or over-confidently. We empirically study judge calibration in multi-turn self debates, wherein a single LM debater debates with itself, and uncover that LM judges are often overconfident in their judgements. Model confidence ratings increase while rated scores may decrease over debate rounds. Judge confidence exceeds score ratings for both frontier as well as small models. We further show that while naive finetuning may improve calibration by increasing scores, it hurts the model's ability to provide faithful ratings by leading to mode collapse. Overfitted judges prefer exactly similar ratings as confidence and rate different arguments indistinguishably. Based on our empirical analysis and observations, we propose a practical finetuning strategy to calibrate LM judges. Since lower confidences and scores form the tail end of the dataset and are most desirable from a judge's perspective, we fit a Gumbel distribution on expected ratings of debate arguments. We then rejection sample from the tail of the distribution and finetune models to make calibrated judgements. Our strategy, termed Gumbel Finetuning (GFT), when compared to naive ratings and Supervised Finetuning (SFT) balances model confidence with scores while learning an expressive multi-modal distribution over ratings. Debate datasets and code will be released as part of the final version.
Primary Area: interpretability and explainable AI
Submission Number: 5928
Loading