ATHENA-Serve: An Intelligent Scheduling LLM Serving System via Horizon-Cost Prediction and Hierarchical RL

Jiamei Liang; Huaming Wu

ATHENA-Serve: An Intelligent Scheduling LLM Serving System via Horizon-Cost Prediction and Hierarchical RL

Jiamei Liang, Huaming Wu

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM serving systems, Tail-latency SLOs, Bursty traffic scheduling, Resource budgeting

TL;DR: ATHENA-Serve fuses calibrated horizon-to-budget mapping (ORACLE) with hierarchical scheduling (HERA) to deliver robust, low-latency LLM serving under bursty workloads.

Abstract: Online inference serving for large language models (LLMs) is foundational infrastructure for conversational agents, retrieval-augmented generation, and multi-tenant intelligent applications. Its core objective is to meet strict latency SLOs under heterogeneous and bursty workloads. However, existing systems suffer from bursty arrivals and long-tailed output lengths that drive peak cache pressure and bandwidth contention, as well as the brittleness of FCFS or shortest-job heuristics under noisy length regression and distribution shift—ultimately compounding tail-latency violations and head-of-line (HoL) blocking. We present ATHENA-Serve, a deployable, horizon–cost–aware LLM serving scheduler. ATHENA-Serve converts predicted generation horizons into calibrated memory and compute budgets. Rather than forecasting exact trajectories, it senses each request’s KV-cache usage patterns and peak-footprint signals. Guided by these budgeted signals, ATHENA-Serve proactively constrains batching and concurrency to smooth memory peaks, while conditioning scheduling decisions on global system signals.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 10330

Loading