Attention, Please: Single-Head Cross-Attention for Unified LLM Routing

Roshini Pulishetty; Mani Kishan Ghantasala; Keerthy Kaushik Dasoju; Niti Mangwani; Vishal Garimella; Aditya Mate; Somya Chatterjee; Yue Kang; Ehi Nosakhare; Sadid A. Hasan; Soundararajan Srinivasan

Attention, Please: Single-Head Cross-Attention for Unified LLM Routing

Roshini Pulishetty, Mani Kishan Ghantasala, Keerthy Kaushik Dasoju, Niti Mangwani, Vishal Garimella, Aditya Mate, Somya Chatterjee, Yue Kang, Ehi Nosakhare, Sadid A. Hasan, Soundararajan Srinivasan

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM router, evaluation, model selection

TL;DR: Lightweight cross-attention router that balances LLM quality and cost, outperforming prior predictive routers.

Abstract: The growing diversity of language models, ranging from lightweight (small), cost-efficient models to powerful (large) but expensive ones, has made dynamic model selection essential for scalable and cost-effective deployment. We propose a unified LLM routing framework that jointly models query and model embeddings using a single-head cross-attention mechanism. We evaluate router's decision-making capabilities on a large-scale, publicly available dataset called RouterBench, that enables evaluation across multiple LLM pools and domains. By capturing fine-grained query-model interactions, our router learns to predict both response quality and generation cost, outperforming existing predictive routers by up to 6.6\% in Average Improvement in Quality (AIQ) and 2.9\% in maximum performance. To better reflect the trade-off between performance and cost, we adopt a new exponential reward function with improved robustness. Our architecture is lightweight, generalizes well across various domains, and is more efficient than existing ones.

Submission Number: 39

Loading