Sim-LLM: Optimizing LLM Inference at the Edge through Inter-Task KV Reuse

Ruikun Luo; Changwei Gu; Qiang He; Feifei Chen; Song Wu; Hai Jin; Yun Yang

Sim-LLM: Optimizing LLM Inference at the Edge through Inter-Task KV Reuse

Ruikun Luo, Changwei Gu, Qiang He, Feifei Chen, Song Wu, Hai Jin, Yun Yang

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, KV cache, Task Similarity, Edge Computing

Abstract: KV cache technology, by storing key-value pairs, helps reduce the computational overhead incurred by *large language models* (LLMs). It facilitates their deployment on resource-constrained edge computing nodes like edge servers. However, as the complexity and size of tasks increase, KV cache usage leads to substantial GPU memory consumption. Existing research has focused on mitigating KV cache memory usage through sequence length reduction, task-specific compression, and dynamic eviction policies. However, these methods are computationally expensive for resource-constrained edge computing nodes. To tackle this challenge, this paper presents Sim-LLM, a novel inference optimization mechanism that leverages task similarity to reduce KV cache memory consumption for LLMs. By caching KVs from processed tasks and reusing them for subsequent similar tasks during inference, Sim-LLM significantly reduces memory consumption while boosting system throughput and increasing maximum batch size, all with minimal accuracy degradation. Evaluated on both A40 and A100 GPUs, Sim-LLM achieves a system throughput improvement of up to 39.40\% and a memory reduction of up to 34.65%, compared to state-of-the-art approaches. Our source code is available at https://github.com/CGCL-codes/SimLLM.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 22471

Loading