PRKV:Page Restruct KV Cache for High Accuracy and Efficiency LLM Generation

11 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Long-Context LLM Inference, KV Cache Optimization
TL;DR: Hybrid KV selection for achieving high accuracy
Abstract: As the key-value(KV) cache size scales with context length, accessing large KV cache each step and substantial GPU memory demand challenge us to deploy LLMs with long contexts.Various sparse attention methods have been proposed and offloading-based KV retrieval preserves entire KV cache in CPU memory and dynamically retrieves most relevant KV pairs for each decoding step, which performs higher quality and effectively reduces GPU memory consumption than other line works. However, exiting KV retrieval performs page-level to reduce estimation overhead, which introduces inaccurate KV selection and significant re- trieval overhead. We propose PRKV, a framework that both-optimizes algorithm and system for page-level KV retrieval with KV offloading. On the algorithm side, PRKV introduces hybrid KV selection that combines both static and dynamic KV selection strategies. On the system side, PRKV employs contiguous memory in- dexing and batched transfer optimizations to improve retrieval efficiency. Exper- iments demonstrate that PRKV improve accuracy across various scenarios and models, delivering up to 6.75× speedup compared to SOTA KV retrieval methods.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 3974
Loading