Judging the Judges: A Systematic Evaluation of Bias Miti- gation Strategies in LLM-as-a-Judge Pipelines

TMLR Paper8350 Authors

10 Apr 2026 (modified: 06 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: LLM-as-a-Judge has become the dominant paradigm for evaluating language model outputs, yet LLM judges exhibit systematic biases that compromise evaluation reliability. We present a comprehensive empirical study comparing nine debiasing strategies across five judge models from four provider families (Google, Anthropic, OpenAI, Meta), three benchmarks (MT-Bench n=400, LLMBar n=200, custom n=225), and four bias types. Our key findings: (1) Style bias is the dominant bias (0.76–0.92 across all models), far exceeding position bias (≤ 0.04), yet has received minimal research attention. (2) All models show a conciseness preference on expansion pairs, but truncation controls confirm they correctly distinguish quality from length (0.92–1.00 accuracy), suggesting quality-sensitive evaluation rather than a simple length bias. (3) Debiasing is beneficial but model-dependent: the combined budget strategy significantly improves Claude Sonnet 4 by +11.2 pp (p < 0.0001), with directionally positive trends for other models. Only 2 of 20 non-baseline configurations show decreased agreement. We release our evaluation framework, controlled dataset, and all experimental artifacts.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We thank the three reviewers for their detailed and constructive feedback. This revision addresses every concrete request from each reviewer; a per-reviewer point-by-point response is posted in the corresponding review threads. Major changes are summarized below. **New empirical evidence** 1. **Round-robin MODEL_ORIGIN dataset (100 new pairs).** Adds Llama and GPT-4o pairings so every judge family has same-family pairs, addressing Reviewer p6d3's RC1 and Reviewer GR4A's Critical 3 (the original 50 Gemini-vs-Claude pairs only support self-preference inference for Gemini and Claude judges). 2. **Position-mirrored STYLE pairs (50 new pairs).** All STYLE bias measurements are now position-averaged across the original (markdown in slot A) and mirrored (markdown in slot B) halves, addressing Reviewer GR4A's Critical 4. 3. **Human-annotation study on STYLE pairs (Appendix F).** Two engaged annotators on a 30-pair subsample. Aggregate human markdown preference is 57%; four of five LLM judges prefer markdown 73--97% on the same pairs (gap of +17 to +40 pp). GPT-4o (53%) is the only judge aligned with humans. Addresses Reviewer p6d3's RC3. 4. **Per-topic style bias analysis (Appendix F.2).** Shows style bias is heterogeneous across question topics, with the strongest bias for technical content (math, factual QA, coding) and weakest for creative writing. Addresses Reviewer Wp6H's concern about content/topic impact. 5. **Quality vs verbosity ablation as a dedicated subsection (Section 4.1.2).** Formalizes the expansion + truncation paired analysis as the explicit ablation requested by Reviewer Wp6H, with three predictive profiles (pure verbosity bias, pure quality sensitivity, indifference) and per-model classification. **Methodological corrections** 6. **LENGTH bias is now length-aware, not slot-aware.** Reviewer GR4A correctly identified that the original analysis assumed response A was always longer, when in fact A is the longer response in only 34/50 expansion pairs. The corrected length-aware computation flips the headline verbosity finding: most models (Pro, Llama, Flash) show classical verbosity bias (prefer longer); Claude prefers shorter; GPT-4o is essentially neutral. We thank Reviewer GR4A for catching this. 7. **Mixed-effects logistic regression replaces the sign test** (per Reviewer p6d3 RC2). Per-model coefficients with instance random effects are reported in the new Table 5. 8. **LLMBar significance testing.** Bootstrap 95% CIs and McNemar tests with Holm-Bonferroni correction added to Table 2 (per Reviewer p6d3). 9. **Bias score per bias type explicitly defined** in Section 3.5 (per Reviewer GR4A's Critical 1). 10. **Figure 1 now uses signed values** (RdBu colormap, centered at zero) consistent with Section 4.1 (per Reviewer GR4A's Major 1). **Updated headline findings** The full experiment matrix was rerun (the original cached results were lost during a hardware migration). Aggregated numbers shifted slightly from the original due to model stochasticity at temperature 0.1; qualitative findings hold and several previously non-significant results are now significant. Five model-strategy pairs now reach $p < 0.05$ on MT-Bench via mixed-effects regression (was 2): Claude S8 ($+11.5$~pp, $p < 0.0001$), Flash S8 ($+7.5$~pp, $p < 0.0001$), Claude S5 ($+7.3$~pp, $p = 0.0009$), Flash S1 ($+4.7$~pp, $p = 0.004$), Llama S8 ($+4.5$~pp, $p = 0.011$). The new headline practical finding: **Gemini 2.5 Flash with the Combined Budget strategy achieves the highest agreement of any configuration tested ($71.0\%$, $\kappa = 0.549$, $p < 0.0001$) at approximately $1/15\times$ the per-evaluation cost of the strongest frontier configuration.** **Other updates** - Minimum detectable effect at $n = 400$ (approximately 4 to 5 pp) explicitly stated (Appendix D, per Reviewer p6d3). - Tie handling in McNemar's test documented (Appendix D, per Reviewer GR4A). - MT-Bench sampling described (Section 3.4: numpy seed 42, no stratification, per Reviewer GR4A). - S8 vs S1 tie-rate discrepancy explained in Section 4.2 (per Reviewer GR4A). - Six new 2025/2026 references added (per Reviewer Wp6H). - Strategy naming uses hybrid form throughout (e.g., "S8 (Combined Budget)") (per Reviewer GR4A). - Anonymous repository at https://anonymous.4open.science/r/llm-as-judge-2F5F/ (cited in conclusion footnote) provides browseable access to all 9 strategies, the full 375-pair controlled dataset, and per-instance cached results, addressing Reviewer GR4A's Major 4 about artifact availability.
Assigned Action Editor: ~Colin_Raffel1
Submission Number: 8350
Loading