Abstract: Large language models (LLMs) are powerful tools but are often expensive to deploy at scale. LLM query routing mitigates this by dynamically assigning queries to models of varying cost and quality to obtain a desired tradeoff. Prior query routing approaches generate only one response from the selected model and a single response from a small (inexpensive) model was often not good enough to beat a response from a large (expensive) model due to which they end up overusing the large model and missing out on potential cost savings. However, it is well known that for small models, generating multiple responses and selecting the best can enhance quality while remaining cheaper than a single large-model response. We leverage this idea to propose BEST-Route, a novel routing framework that chooses a model and the number of responses to sample from it based on query difficulty and the quality thresholds. Experiments on real-world datasets demonstrate that our method reduces costs by up to 60% with less than 1% performance drop.
Lay Summary: Imagine you have a groups of AI tools for answering a question: ranging from brilliant-but-expensive expert AIs to quicker-and-cheaper assistant AIs. This research introduces a new system, called BEST-Route, that intelligently decides which one to use based on how difficult your question is. For simpler questions, instead of always turning to the costly expert AI, the system asks the cheaper assistant AI to provide several different answers and then selects the best one. This clever approach often yields a response just as good as the expert's but at a much lower cost. For truly tough questions, the system still relies on the top-tier expert AI to ensure a high-quality answer. By using this method, the researchers were able to reduce operational costs by up to 60% while maintaining nearly the same level of performance, making powerful AI much more affordable.
Link To Code: https://github.com/microsoft/best-route-llm
Primary Area: Deep Learning->Large Language Models
Keywords: Large language models, Efficient ML, Query Routing, Test-time Optimal Compute
Submission Number: 13334
Loading