Keywords: Large language models, Cloud computing, Resource management, Task scheduling, Quality of service, Latency, Probabilistic modeling, Optimization
Abstract: The demand for large language model (LLM) inference is gradually dominating artificial intelligence workloads, creating an urgent need for cost-efficient inference serving. While prior work focuses on single-worker optimization, it often overlooks cluster-level coordination across both queries and computing resources. Scheduling requests without considering their uncertainty can lead to SLO violations or overprovisioning, resulting in excessive cost.
In this paper, we present Aladdin, a scheduler that co-adaptively places inference queries and scales computing resources under probabilistic SLO constraints. Aladdin explicitly models request-level uncertainty through stage-wise latency distributions, and places queries based on their statistical profiles to maximize per-worker utilization. To improve robustness and cost-efficiency, we design a flexible constraint interface that supports distribution-aware tail modeling and risk-adjusted capacity allocation. Experiments show that Aladdin reduces serving cost by up to 71\% under the same SLO level compared to standard baselines, which can translate to millions of dollars in annual savings.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 21519
Loading