Who Routes the Router: Rethinking the Evaluation of LLM Routing Systems

ICLR 2026 Conference Submission13566 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM router, Evaluation
TL;DR: We present an open, reproducible evaluation framework that addresses key limitations in current router evaluations, such as limited task diversity, imbalanced model pools, and oversimplified metrics.
Abstract: The growing ecosystem of Large Language Models (LLMs) with diverse capabilities and costs has motivated the need for LLM routing systems that dynamically select the most appropriate model for each query. Evaluating these routing systems is important yet inherently challenging due to the complex interplay of multiple factors: the selection of representative input queries, the composition of the model pool, and the definition of comprehensive evaluation metrics for optimal routing decisions. Through extensive analysis of existing benchmarks, we identify critical limitations that may lead to incomplete results and/or misleading conclusions about router performance: (1) limited task diversity, (2) imbalanced model pools, and (3) oversimplified evaluation methodologies. To address these limitations, we propose a novel evaluation framework that incorporates diverse task distributions (33,337 queries across 68 categories), a balanced model pool of 85 models with complementary model strengths, and multi-faceted metrics that reflect real-world deployment scenarios. We implement this framework as an open-source benchmark, enabling researchers to rigorously assess routing strategies under realistic conditions. The code and dataset are shared anonymously at: https://anonymous.4open.science/r/rethinking-routing-evaluation-DE30
Primary Area: datasets and benchmarks
Submission Number: 13566
Loading