ATHENA-Serve: An Intelligent Scheduling LLM Serving System via Horizon-Cost Prediction and Hierarchical RL
Keywords: LLM serving systems, Tail-latency SLOs, Bursty traffic scheduling, Resource budgeting
TL;DR: ATHENA-Serve fuses calibrated horizon-to-budget mapping (ORACLE) with hierarchical scheduling (HERA) to deliver robust, low-latency LLM serving under bursty workloads.
Abstract: Online inference serving for large language models (LLMs) is foundational infrastructure for conversational agents, retrieval-augmented generation, and multi-tenant intelligent applications. Its core objective is to meet strict latency SLOs under heterogeneous and bursty workloads. However, existing systems suffer from bursty arrivals and long-tailed output lengths that drive peak cache pressure and bandwidth contention, as well as the brittleness of FCFS or shortest-job heuristics under noisy length regression and distribution shift—ultimately compounding tail-latency violations and head-of-line (HoL) blocking. We present ATHENA-Serve, a deployable, horizon–cost–aware LLM serving scheduler. ATHENA-Serve converts predicted generation horizons into calibrated memory and compute budgets. Rather than forecasting exact trajectories, it senses each request’s KV-cache usage patterns and peak-footprint signals. Guided by these budgeted signals, ATHENA-Serve proactively constrains batching and concurrency to smooth memory peaks, while conditioning scheduling decisions on global system signals.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 10330
Loading