Keywords: Long-Context LLM Inference, KV Cache Optimization
TL;DR: Hybrid KV selection for achieving high accuracy
Abstract: As the key-value(KV) cache size scales with context length, accessing large KV
cache each step and substantial GPU memory demand challenge us to deploy
LLMs with long contexts.Various sparse attention methods have been proposed
and offloading-based KV retrieval preserves entire KV cache in CPU memory
and dynamically retrieves most relevant KV pairs for each decoding step, which
performs higher quality and effectively reduces GPU memory consumption than
other line works. However, exiting KV retrieval performs page-level to reduce
estimation overhead, which introduces inaccurate KV selection and significant re-
trieval overhead. We propose PRKV, a framework that both-optimizes algorithm
and system for page-level KV retrieval with KV offloading. On the algorithm side,
PRKV introduces hybrid KV selection that combines both static and dynamic KV
selection strategies. On the system side, PRKV employs contiguous memory in-
dexing and batched transfer optimizations to improve retrieval efficiency. Exper-
iments demonstrate that PRKV improve accuracy across various scenarios and
models, delivering up to 6.75× speedup compared to SOTA KV retrieval methods.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 3974
Loading