Understanding Judge Calibration in Multi-Turn Debates

01 Mar 2026 (modified: 29 Apr 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multi-turn debates have gained attention as language evaluation tasks for subject matter comprehension, critical reasoning and long-form responses. Language Models (LMs) play the role of judges for obtaining subjective ratings as a cheap alternative to human labor. However, similar to humans, LM judges may remain unsure of ratings and rate debate arguments either under or over-confidently. We empirically study judge calibration in multi-turn self debates, wherein a single LM debater debates with itself, and uncover that LM judges are often overconfident in their judgements. Miscalibration occurs as model confidence ratings increase while rated scores may decrease over debate rounds. Judge confidence exceeds score ratings for both frontier as well as open-source models. We further show that while naive finetuning may improve calibration by increasing scores, it does not necessarily lower overconfidence in ratings. Finetuned overconfident judges prefer similar ratings as confidence and rate different arguments indistinguishably. Our empirical analysis leads us to an observation that helps mitigate overconfidence. Since lower confidences and scores form the tail end of the dataset and are most desirable from a judge’s perspective, sampling from this left tail must calibrate for confidence. We thus fit a mixture of Gumbels distribution on expected ratings of debate arguments and then rejection sample from its tail to finetune judge models. Sampling from the mixture of Gumbels, when compared to naive ratings and Supervised Finetuning (SFT), lowers judge confidence and yield well-calibrated juddges while learning an expressive multi-modal distribution over ratings. Debate datasets and code will be released as part of the final version.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Charles_Xu1
Submission Number: 7721
Loading