Taming the Titans: A Survey of Efficient LLM Inference Serving

ACL ARR 2025 May Submission7637 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) for Generative AI have achieved remarkable progress, evolving into sophisticated and versatile tools widely adopted across various domains and applications. However, the substantial memory overhead caused by their vast number of parameters, combined with the high computational demands of the attention mechanism, poses significant challenges in achieving low latency and high throughput for LLM inference services. This paper presents a comprehensive survey of recent advances in LLM inference optimization, categorized into: (1) fundamental techniques (model placement, request scheduling, decoding prediction, storage management, disaggregation, and multiplexing); (2) specialized architectures (MoE, LoRA, and speculative decoding); and (3) scenario-specific optimizations (long-context problem, RAG, Augmented LLMs, test-time reasoning, and multimodal integration). Finally, we outline potential research directions to further advance the field of LLM inference serving.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: inference methods, model architectures, applications, natural language inference
Contribution Types: Surveys
Languages Studied: English
Submission Number: 7637
Loading