Training-Free Self-Scheduling for Efficient LLM Inference Serving

ICLR 2026 Conference Submission18847 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model, Inference Serving, Scheduling Algorithms, Efficient Inference
TL;DR: We introduce a self-scheduling approach for LLM inference that avoids head-of-line blocking and achieves very high improvements in latency and throughput without extra models or retraining.
Abstract: The ability to deliver fast responses under strict latency requirements is critical for Large Language Model (LLM) inference serving. Most existing systems rely on a first-come-first-served (FCFS) scheduling policy, which often suffers from head-of-line blocking. While a number of solutions have been proposed, they typically require training additional models or auxiliary predictors, such as BERT, to estimate decoding lengths. These approaches limit generalization and necessitate retraining for new domains or distributions. To address these limitations, we propose self-scheduling with LLM, a novel approach that leverages the reasoning capabilities of the LLM itself without requiring extra training or auxiliary models. We systematically investigate a range of feasible strategies and conduct extensive analyses. Experimental results show that our method achieves up to a 5$\times$ improvement in TTFT, a 3$\times$ improvement in TPOT, a 6$\times$ reduction in latency, and a 9$\times$ increase in throughput under both general and domain-specific workloads, with negligible overhead. This work offers a lightweight yet intelligent scheduling paradigm, demonstrating both practicality and strong potential for LLM inference serving.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 18847
Loading