Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs

Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs

ICLR 2026 Conference Submission20338 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Routing, Contextual Bandit

Abstract: Efficient use of large language models (LLMs) is critical for deployment at scale: without adaptive routing, systems either overpay for strong models or risk poor performance from weaker ones. Selecting the right LLM for each query is fundamentally an online decision problem: models differ in strengths, prices fluctuate, and users value performance and cost differently. Yet most routers are trained offline with labels for all candidate models, an assumption that breaks in deployment, where only the outcome of the chosen model is observed. We bridge this gap with a bandit-feedback routing approach that trains under the same partial-feedback restriction as deployment, while supporting preference-tunable inference: operators can dial the performance–cost trade-off at test time without retraining. Framed as a contextual bandit over prompt features and a user preference vector, our method simulates an online feedback setting during training and adapts its routing decisions to each new prompt, rather than depending on full-information offline supervision. Comprehensive experiments on RouterBench show that our method consistently outperforms strong offline routers, including GraphRouter and RouterDC, in terms of performance and cost, and generalizes robustly for unseen tasks.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 20338

Loading