Learned Prefix Caching for Efficient LLM Inference

Dongsheng Yang; Austin Li; Kai Li; Wyatt Lloyd

Learned Prefix Caching for Efficient LLM Inference

Dongsheng Yang, Austin Li, Kai Li, Wyatt Lloyd

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language model inference, prefilling, prefix caching, learned caching algorithm

TL;DR: It proposes a new learned eviction algorithm that predicts the conversation continuation probability to guide LLM prefix cache eviction.

Abstract: Prefix caching is a key technique for reducing Large Language Model (LLM) inference costs. However, the prevalent least-recently-used (LRU) eviction algorithm has a large gap to the optimal algorithm. This paper introduces LPC, the first learned method to perform LLM prefix cache eviction. LPC leverages conversational content analysis to provide predictive guidance for eviction, determining which conversations are likely to continue. These insights, combined with last access timestamps, inform more effective cache management. Extensive evaluations across three real-world datasets demonstrate that LPC achieves 18-47% reductions in required cache sizes for equivalent hit ratios and has an 11% improvement in LLM prefilling throughput in an emulated environment.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 8753

Loading