A Unified Approach to Routing and Cascading for LLMs

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We combine cascading and routing into a more powerful model selection method called cascade routing.
Abstract: The availability of a wide range of large language models (LLMs) embedded in various agentic systems has significantly increased the potential of model selection strategies to improve the cost-performance tradeoff. Existing strategies involve either routing, where a single model is chosen per query, or cascading, which sequentially runs increasingly larger models until a satisfactory answer is found. However, current approaches face three key limitations: they (1) lack formal proofs of optimality, (2) fail to identify the conditions under which these strategies are most effective to improve the cost-performance tradeoff, and (3) are unable to combine both paradigms for further improvements. To address these issues, we first derive a novel optimal strategy for cascading and prove the optimality of an existing routing strategy. Further, we propose *cascade routing*, a unified framework that integrates routing and cascading into a theoretically optimal strategy. Through our analysis, we identify good quality estimators as the critical factor for the success of model selection paradigms. Finally, in our experiments, we show that cascade routing consistently outperforms the individual approaches by a large margin and we analyze quality estimators to determine when routing and/or cascading are useful paradigms for model selection.
Lay Summary: Language models—often called "chatbots"—come in many sizes. But always using the largest model wastes money, time, and energy. This raises an important question: can we decide when a question actually requires the full power of a large model, and when a smaller, cheaper model would be enough? Existing solutions to this problem either pick a single model upfront ("routing") or step through models from smallest to largest ("cascading"). However, both have limitations: they lack theoretical foundations and can be expensive. In our work, we develop a thorough mathematical understanding when each of strategy works best and where they fall short. Building on this, we introduce cascade routing: a flexible approach that combines the strengths of both routing and cascading. Instead of always running models in a fixed sequence or sticking to just one, cascade routing iteratively picks the best model, and thus can skip models, reorder them, or running only as few as needed. It turns out that this method can outperform the existing strategies by up to 14%! This makes AI systems more affordable and unlocks their use in settings where resources are limited. By providing a principled approach to model selection, our work lays the groundwork for smarter deployment of AI at scale.
Link To Code: https://github.com/eth-sri/cascade-routing
Primary Area: Deep Learning->Large Language Models
Keywords: large language models, routing, cascading
Submission Number: 12637
Loading