Keywords: Evaluation, LLM Routing
TL;DR: We identified problems with current LLM routing evaluation and propose a new evaluation framework to address these limitations.
Abstract: The growing ecosystem of Large Language Models (LLMs) with diverse capabilities and costs has motivated the need for LLM routing systems that dynamically select the most appropriate model for each query.
Evaluating these routing systems is important yet inherently challenging due to the complex interplay of multiple factors: the selection of representative input queries, the composition of the model pool, and the definition of comprehensive evaluation metrics for optimal routing decisions.
Through extensive analysis of existing benchmarks, we identify critical limitations that may lead to incomplete results and/or misleading conclusions about router performance:
(1) limited task diversity, (2) imbalanced model pools, and (3) oversimplified evaluation methodologies.
To address these limitations, we propose a novel evaluation framework that incorporates diverse task distributions (33,337 queries across 68 categories), a balanced model pool of 85 models with complementary model strengths, and multi-faceted metrics that reflect real-world deployment scenarios.
We implement this framework as an open-source benchmark, enabling researchers to rigorously assess routing strategies under realistic conditions.
The code is shared here: \url{https://github.com/jy-yuan/rethinking-routing-evaluation}
Submission Number: 77
Loading