RouteJudge: Benchmarking LLM-as-a-Judge with Routing Strategies

Guiyao Tie; Tianyao Luo; Xueyang Zhou; Chaoran Hu; Yunhong He; Junran Wu; Yuanfan Yao; Pan Zhou; Lichao Sun

RouteJudge: Benchmarking LLM-as-a-Judge with Routing Strategies

Guiyao Tie, Tianyao Luo, Xueyang Zhou, Chaoran Hu, Yunhong He, Junran Wu, Yuanfan Yao, Pan Zhou, Lichao Sun

18 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Route; Judge; Large Reasoning Models; Benchmark

TL;DR: We introduce RouteJudge, a unified framework that addresses these issues through adaptive routing across a pool of judges.

Abstract: Large language models (LLMs) and large reasoning models (LRMs) are increasingly adopted as automated judges for pairwise evaluation of model outputs. However, their deployment faces three unresolved challenges: inconsistent reliability, high latency and token costs, and the lack of principled routing strategies. We introduce RouteJudge, the first unified framework for benchmarking and routing automated judges under accuracy–latency–cost trade-offs. Our contributions are threefold. (1) We construct six difficulty-aware datasets spanning reasoning (Math, Logic, Code) and non-reasoning (Knowledge, Roleplay, Writing) tasks, with human-verified gold standards. (2) We present the first benchmark of LRM-as-a-Judge, analyzing how intermediate thinking traces interact with final verdicts and uncovering systematic mismatches such as “good thinking but wrong verdict.” (3) We develop and evaluate both offline and online routing strategies that adaptively assign judges per instance, achieving strong accuracy–efficiency trade-offs. Experiments on 19 models show that LRMs improve reasoning accuracy at higher cost, while difficulty-aware online routing narrows this gap substantially. By unifying benchmarking and routing, RouteJudge establishes the first comprehensive framework for scalable and interpretable evaluation, positioning automated judges as a practical alternative to human experts.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 10774

Loading