Abstract: With the rapid advancement of large language models (LLMs) in both academia and industry, their growing size and complexity have introduced significant challenges in terms of computational cost and deployment efficiency. To address these issues, a wide range of inference optimization techniques—including but not limited to model compression—have been proposed to accelerate LLM inference while preserving model performance. This survey provides a comprehensive overview of LLM inference acceleration strategies, analyzing them from multiple perspectives, including foundational principles, algorithmic techniques, real-world applications, and open research challenges. We begin by introducing core concepts underlying inference optimization and propose a new taxonomy that categorizes existing approaches, including quantization, pruning, distillation, efficient architectures, compilation, and hardware-aware methods. Following the lifecycle of LLM development and deployment, we examine how these techniques interact with model training, fine-tuning, and serving. Furthermore, we highlight key applications of efficient LLMs and discuss emerging trends and unresolved issues in the field. By synthesizing recent advances, this survey aims to provide actionable insights and practical guidance for researchers and practitioners working with scalable and efficient LLM systems.
External IDs:doi:10.1109/tnnls.2025.3628671
Loading