Towards Fair And Comprehensive Evaluation Of Routers In Collaborative LLM Systems

Towards Fair And Comprehensive Evaluation Of Routers In Collaborative LLM Systems

ACL ARR 2026 January Submission161 Authors

22 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models, Efficient ML, Query Routing

Abstract: Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud-based models. Existing router evaluations are unsystematic, overlooking scenario-specific requirements and out-of-distribution robustness. We propose a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness. Unlike prior work that relies on output probabilities or external embeddings, we utilize internal hidden states that capture model uncertainty before answer generation. We introduce ProbeDirichlet, a lightweight router that aggregates cross-layer hidden states via input-dependent Dirichlet distributions. Trained on multi-domain data, it generalizes robustly across in-domain and out-of-distribution scenarios. Our results show ProbeDirichlet outperforms the best baselines by 16.68% in router ability and 18.86% in high-accuracy scenarios, with strong generalization across heterogeneous tasks and agentic workflows.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: Large language models,Efficient ML,Query Routing

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 161

Loading