Token-PD: Portfolio-Optimal KV-Cache Eviction for Multi-Tenant LLM Inference

Published: 04 Jul 2025, Last Modified: 22 Jul 2025KDD 2025 Workshop on Inference Optimization for GenAI PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM inference, KV cache eviction, online convex optimisation, primal–dual algorithms, memory optimisation, generative AI
Abstract: The memory footprint of key–value (KV) caches has become the primary bottleneck for serving large language models (LLMs) at scale. Existing eviction heuristics optimise each request independently and ignore the fact that hundreds of concurrent conversations must share the same GPU. We cast cache management as an online knapsack problem by viewing every cached token as an “asset’’ that offers a stochastic future return (expected attention weight) at a fixed memory cost. Building on online primal–dual theory, we develop \textsc{Token-PD}, a regret-bounded algorithm that prices tokens in sub-millisecond time and selects an optimal batch-level cache under a hard SRAM budget. Integrated into the \texttt{vLLM} engine, our method cuts peak memory by up to 60% and increases request throughput by 1.3× on public multi-tenant traces—without degrading perplexity—while adding less than 1ms runtime overhead. The approach is model-agnostic and complements kernel-level and quantisation advances.
Submission Number: 12
Loading