Keywords: Agent Serving, LLM Serving, LLM infrastructure, Large language models, inference optimizations
TL;DR: agent serving framework with high KV cache hit rate and session-aware cascading
Abstract: Large Language Model (LLM) agents are capable of task execution across various domains by autonomously interacting with environments and refining LLM responses based on feedback.
However, existing model serving systems are not optimized for the unique demands of serving agents. Compared to classic model serving, agent serving has different characteristics:
predictable request pattern, increasing quality requirement, and unique prompt formatting. We identify a key problem for agent serving: LLM serving systems lack session-awareness. They neither perform effective KV cache management nor precisely select the cheapest yet competent model in each round.
This leads to a cost-quality tradeoff, and we identify an opportunity to surpass it in an agent serving system.
To this end, we introduce AgServe for AGile AGent SERVing.
AgServe features a session-aware server that boosts KV cache reuse via Estimated-Time-of-Arrival-based eviction and in-place positional embedding calibration, a quality-aware client that performs session-aware model cascading through real-time quality assessment, and a dynamic resource scheduler that maximizes GPU utilization.
With AgServe, we allow agents to select and upgrade models during the session lifetime, and to achieve similar quality at much lower costs, effectively transcending the tradeoff. Extensive experiments on real testbeds demonstrate that AgServe (1) achieves comparable response quality to GPT-4o at a 16.5\% cost. (2) delivers 1.8$\times$ improvement in quality relative to the tradeoff curve.
Primary Area: Infrastructure (e.g., libraries, improved implementation and scalability, distributed solutions)
Submission Number: 6954
Loading