Fast Inference for Augmented Large Language Models

Rana Shahout; Cong Liang; Shiji Xin; Qianru Lao; Yong Cui; Minlan Yu; Michael Mitzenmacher

Fast Inference for Augmented Large Language Models

Rana Shahout, Cong Liang, Shiji Xin, Qianru Lao, Yong Cui, Minlan Yu, Michael Mitzenmacher

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM inference and serving, Augmented LLM requests

Abstract: Augmented Large Language Models (LLMs) enhance standalone LLMs by integrating external data sources through API calls. In interactive applications, efficient scheduling is crucial for maintaining low request completion times, directly impacting user engagement. However, these augmentations introduce new scheduling challenges: the size of augmented requests (in tokens) no longer correlates proportionally with execution time, making traditional size-based scheduling algorithms like Shortest Job First less effective. Additionally, requests may require different handling during API calls, which must be incorporated into scheduling. This paper presents MARS, a novel inference framework that optimizes augmented LLM latency by explicitly incorporating system- and application-level considerations into scheduling. MARS introduces a predictive, memory-aware scheduling approach that integrates API handling and request prioritization to minimize completion time. We implement MARS on top of vLLM and evaluate its performance against baseline LLM inference systems, demonstrating improvements in end-to-end latency by 27%-85% and reductions in TTFT by 4%-96% compared to the existing augmented-LLM system, with even greater gains over vLLM. Our implementation is available online.

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 17718

Loading