Abstract: The advent of Large Language Model (LLM)-based agent systems represents a significant paradigm shift in Artificial Intelligence, enabling unprecedented capabilities in autonomous reasoning, planning, and interaction. However, the transformative potential of these agents is fundamentally constrained by a critical operational challenge: high response latency. This latency, which impairs usability, restricts real-time applicability, and challenges economic viability, is a paramount barrier to their widespread adoption in complex, real-world scenarios. This survey provides a comprehensive and structured analysis of the multifaceted latency problem in LLM-based agent systems. We present a structured taxonomy that deconstructs end-to-end latency into its constituent sources, offering a holistic view of the entire agent stack. Our review spans four primary layers of optimization: 1) Core LLM Inference, covering techniques to accelerate the foundational model, including quantization, pruning, efficient attention mechanisms, and speculative decoding; 2) Agent-Level Frameworks, detailing strategies to refine the agent’s cognitive loop, such as accelerating planning cycles, optimizing tool use, and implementing latency-aware memory management; 3) System-Level and Infrastructure, encompassing optimizations for the underlying deployment environment, including advanced serving systems, hardware acceleration, dynamic resource allocation, and distributed architectures; and 4) Multi-Agent Systems, addressing specialized methods for mitigating communication and coordination overhead in collaborative agent ensembles. We identify key unresolved challenges and outline promising future research directions—including hardware-software co-design, novel agent architectures, and AI-driven system orchestration—to guide the ongoing quest for ultra-low latency intelligent systems.
External IDs:doi:10.1109/access.2026.3664226
Loading