FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference

FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference

ACL ARR 2025 May Submission363 Authors

11 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The Key-Value (KV) cache reading latency increases significantly with context lengths, hindering the efficiency of long-context LLM inference. To address this, previous works propose retaining a small fraction of KV cache based on token importance. For example, KV eviction uses static heuristics to retain tokens, while KV retrieval dynamically selects query-relevant tokens for more adaptive cache management. However, we observe that important tokens are often sparsely distributed across the long context. This sparsity makes existing page-level KV retrieval inaccurate, as each page may include irrelevant tokens and miss critical ones. In this work, we propose Fier, a **Fi**ne-Grained and **E**fficient KV cache **R**etrieval method. Fier uses 1-bit quantized keys to estimate the importance of each token, resulting in efficient and precise retrieval. Experiments show that Fier matches full KV performance using only 11\% of the cache budget across various long-context tasks, reducing decoding latency by 1.2$\times$ to 1.5$\times$.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: LLM Efficiency, quantization, NLP in resource-constrained settings

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 363

Loading