Distribution-Calibrated Inference Time Compute for Thinking LLM-as-a-Judge

ICLR 2026 Conference Submission21552 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Thinking LLM, LLM-as-a-Judge, Inference Time Compute
TL;DR: RLM judges are noisy; we allocate inference-time compute to sample n ratings and aggregate with a calibrated model, matching/exceeding baselines and human raters.
Abstract: Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate $n$ independent thinking--rating samples per item, and propose a principled, distribution-calibrated aggregation scheme. Our method models three-way preferences with a Bradley–Terry-Davidson formulation on rating counts, leveraging both polarity (margin among non-ties) and decisiveness (non-tie rate) to distinguish narrow margins from strong consensus. Across various evaluation benchmarks, our approach consistently reduces MAE and increases pairwise accuracy versus standard baselines, and when evaluated against human-consensus meta-labels, matches or exceeds individual human raters. These results show that carefully allocating ITC and aggregating with distribution-aware methods turns noisy individual model judgments into reliable ratings for evaluation.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21552
Loading