Faster, Cheaper, Just as Good: Cost- and Latency-Constrained Routing for LLMs

Javid Lakha; Minlan Yu; Rana Shahout

Faster, Cheaper, Just as Good: Cost- and Latency-Constrained Routing for LLMs

Javid Lakha, Minlan Yu, Rana Shahout

Published: 05 Mar 2025, Last Modified: 02 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0

Track: tiny / short paper (up to 4 pages)

Keywords: Large Language Models, model routing, compound AI systems, cost-aware inference, latency constraints, multi-model systems

Abstract: Large Language Models (LLMs) span a broad spectrum of sizes, each presenting distinct trade-offs in cost, latency, and performance that complicate large-scale deployment. Although larger models often provide higher-quality responses for complex prompts, their increased computational overhead and slower inference can degrade user experience in real-time applications. Meanwhile, AI development is moving toward compound AI systems integrating multiple LLMs of different sizes. In such environments, deciding when to invoke smaller or larger models becomes critical, especially under shifting system loads, as we must balance high output quality, tight cost budgets, and acceptable response times. We propose SCORE, a routing system that maximizes response quality while respecting user-specified cost and latency constraints. For each incoming request, SCORE predicts each model’s response quality and length and selects the option that best meets current cost, latency, and quality requirements – whether a less expensive model with a lighter load or a more resource-intensive model for complex prompts. By continually adapting these decisions as requests arrive, SCORE balances the system load, enforces budget limits, and maintains user satisfaction through timely, cost-effective, and accurate responses.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 55

Loading