Near-Optimal Online Deployment and Routing for Streaming LLMs

Published: 26 Jan 2026, Last Modified: 02 Mar 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: online learning, bandits, LLM routing, staged deployment, streaming model arrivals, regret bounds, budget/capacity constraints
TL;DR: StageRoute periodically redeploys LLMs and cost-aware routes queries online to track a streaming model frontier with near-optimal regret.
Abstract: The rapid pace at which new large language models (LLMs) appear, and older ones become obsolete, forces providers to manage a streaming inventory under a strict concurrency cap and per-query cost budgets. We cast this as an online decision problem that couples *stage-wise deployment* (at fixed maintenance windows) with *per-query routing* among live models. We introduce *StageRoute*, a hierarchical algorithm that (i) optimistically selects up to $M_{\max}$ models for the next stage using reward upper-confidence and cost lower-confidence bounds, and (ii) routes each incoming query by solving a budget- and throughput-constrained bandit subproblem over the deployed set. We prove a regret of $\tilde{\mathcal{O}}(T^{2/3})$ with a matching lower bound, establishing near-optimality, and validate the theory empirically: *StageRoute* tracks a strong oracle under tight budgets across diverse workloads.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 1169
Loading