Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations

Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations

ICLR 2026 Conference Submission15839 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models (LLMs), Feedback Generation, LLM-as-a-Judge, Benchmarking, Ensemble, Calibration, Regression Analysis

Abstract: New Large Language Models (LLMs) become available every few weeks, and modern application developers confronted with the unenviable task of having to decide if they should switch to a new model. While human evaluation remains the gold standard, it is costly and unscalable. The state-of-the-art approach is to use LLMs as evaluators (LLM-as-a-judge), but this suffers from a critical flaw: LLMs exhibit a strong positive bias. We provide {\em empirical} evidence showing that while LLMs can identify valid outputs with high accuracy (i.e., True Positive Rate >96%), they are remarkably poor at identifying invalid ones (i.e., True Negative Rate <25%). This systematic bias, coupled with class imbalance, often leads to inflated reliability scores. While ensemble-based methods like majority voting can help, we show that they are not good enough. We introduce an optimal minority-veto strategy that is resilient to missing data and mitigates this bias to a large extent. For scenarios requiring even higher precision, we propose a novel regression-based framework that directly models the validator bias using a small set of human-annotated ground truth data. On a challenging code feedback task over 366 high-school Python programs, our regression approach reduces the maximum absolute error to just 1.4%, achieving a 2x improvement over the best-performing ensemble of 14 state-of-the-art LLMs.

Primary Area: datasets and benchmarks

Submission Number: 15839

Loading