Keywords: Evaluation, LLM Routing
TL;DR: We identified problems with current LLM routing evaluation and propose a new evaluation framework to address these limitations.
Abstract: The proliferation of Large Language Models (LLMs), each with different capabilities and costs, has driven the need for LLM routers that intelligently and dynamically select the best model for a given query.
Evaluating these routing systems is important yet inherently challenging due to the complex interplay of multiple factors: the selection of representative input queries, the composition of the model pool, and the definition of comprehensive evaluation metrics for optimal routing decisions.
Through extensive analysis of existing benchmarks, we identify critical limitations that may lead to incomplete results and/or misleading conclusions about router performance:
(1) limited task diversity, (2) imbalanced model pools, and (3) oversimplified evaluation methodologies.
To address these limitations, we propose a novel evaluation framework that incorporates diverse task distributions, a balanced model pool with complementary model strengths, and multi-faceted metrics that reflect real-world deployment scenarios.
We implement this framework as an open-source benchmark, the code and dataset are shared anonymously at: \url{https://anonymous.4open.science/r/rethinking-routing-evaluation-DE30}
Submission Number: 77
Loading