Enhancing Large Language Model Inference Efficiency via Lookahead Cache Filtering

Jie Ou, Yueming Chen, Shuaihong Jiang, Wenhong Tian

Published: 2025, Last Modified: 14 Mar 2026ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The large Key-Value (KV) cache is a significant challenge in deploying Large Language Models (LLMs). Current research addressing these issues employs cache compression techniques, which we find suffer from information loss and the "lost-in-the-middle" problem. We propose the Lookahead Cache Filtering (LCF), which retains the full cache in host memory to keep complete details, using the Important Key-Value Lookahead Prediction by Approximate Sorting and the Gather-based Matrix Multiplication on CPU, to filter important KV cache to reduce the overhead of loading cache to GPU and improve the throughput of inference. Finally, we use the Multi-Scale Pyramid Information Fusion mechanism to enhance information fusion to further improve the effectiveness of inference results. Experiments demonstrate that LCF effectively maintains accuracy without introducing additional latency while reducing memory requirements. Our code will be available on GitHub1.

External IDs:dblp:conf/icassp/OuCJT25