Abstract: Large language models (LLMs) based on transformer architectures have demonstrated exceptional performance across various generative tasks. However, the significant GPU resources required for LLM inference pose financial challenges for large-scale deployment. Context caching has been proposed to enhance cost-efficiency by storing intermediate key and value (KV) pairs in cost-effective storage mediums, which can be reused to accelerate inference when requests share prefixes. While promising in single-instance applications, context caching in distributed LLM serving systems introduces unique challenges. Firstly, context caching decisions across time slots are inter-dependent, affecting overall system efficiency due to potential cache misses. Secondly, request scheduling complexities arise, leading to load balancing issues among instances. We address these challenges by formulating an online optimization problem that jointly decides KV cache placement and request scheduling to minimize inference costs across time slots. Given the NP-hard nature of this problem, we propose a framework leveraging regularization, linear relaxation, and randomized rounding techniques. Our solution achieves a competitive ratio near the offline optimum. Experimental results in a distributed LLM serving system demonstrate significant performance improvements over baseline methods.
External IDs:dblp:conf/infocom/GaoHYL0W25
Loading