Keywords: Route; Judge; Large Reasoning Models; Benchmark
TL;DR: We introduce RouteJudge, a unified framework that addresses these issues through adaptive routing across a pool of judges.
Abstract: Large language models (LLMs) and large reasoning models (LRMs) are increasingly adopted as automated judges for pairwise evaluation of model outputs. However, their deployment faces three unresolved challenges: inconsistent reliability, high latency and token costs, and the lack of principled routing strategies. We introduce RouteJudge, the first unified framework for benchmarking and routing automated judges under accuracy–latency–cost trade-offs. Our contributions are threefold. (1) We construct six difficulty-aware datasets spanning reasoning (Math, Logic, Code) and non-reasoning (Knowledge, Roleplay, Writing) tasks, with human-verified gold standards. (2) We present the first benchmark of LRM-as-a-Judge, analyzing how intermediate thinking traces interact with final verdicts and uncovering systematic mismatches such as “good thinking but wrong verdict.” (3) We develop and evaluate both offline and online routing strategies that adaptively assign judges per instance, achieving strong accuracy–efficiency trade-offs. Experiments on 19 models show that LRMs improve reasoning accuracy at higher cost, while difficulty-aware online routing narrows this gap substantially. By unifying benchmarking and routing, RouteJudge establishes the first comprehensive framework for scalable and interpretable evaluation, positioning automated judges as a practical alternative to human experts.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 10774
Loading