Sage: A Scalable Framework for Evaluating LLM-as-a-Judge Without Human Effort

Yuanning Feng; Sinan Wang; Zhengxiang Cheng; Yao Wan; Dongping Chen

Sage: A Scalable Framework for Evaluating LLM-as-a-Judge Without Human Effort

Yuanning Feng, Sinan Wang, Zhengxiang Cheng, Yao Wan, Dongping Chen

02 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM-as-a-Judge, Large Language Models, Weak Evaluators, Evaluation Suite, Benchmark, Unsupervised Learning

TL;DR: Sage is a no-human-label evaluation suite for LLM-as-a-Judge that tests local and global consistency, aligns with supervised benchmarks, and shows SOTA judges falter from situational preferences.

Abstract: LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessing reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without necessitating any human annotation. Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across a full set of preferences). We curate a dataset of 650 questions by combining structured benchmark problems with real-world user queries. Our experiments demonstrate both the intrinsic stability of our metrics and their high correlation with supervised benchmarks like LLMBar and RewardBench2, confirming SAGE's reliability as an evaluation suite for LLM-as-a-Judge. Based on Sage, we reveal that current *state-of-the-art* LLMs exhibit significant robustness deficiencies when acting as judges; even the top-performing model, Gemini-2.5-Pro and GPT-5, fails to maintain consistent preferences in nearly a quarter of difficult cases. We attribute this to a new phenomenon called **situational preference** which explain why explicit rubrics or criteria can help model judge consistently across answer pairs. Our further analysis shows that fine-tuning LLM-as-a-Judge is an unreliable method which further induce human bias, while multi-agent judges, deep reasoning can enhance performance through different means.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 823

Loading