Keywords: Retrieval-augemented generation, Positional encoding, KV Cache Reuse
Abstract: Key-value (KV) caching accelerates inference in large language models (LLMs) by reusing computations from previously generated tokens. Its importance becomes even greater in long-context applications such as retrieval-augmented generation (RAG) and in-context learning (ICL). However, conventional KV caching embeds positional information directly into the cache, limiting its reusability. Existing solutions face a memory-compute trade-off. Specifically, they either restrict reuse to prefixes or require expensive memory materialization for position adjustment.
We introduce Lazy-Attention, a novel attention mechanism that kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV reuse. By fusing positional adjustment into the attention kernel on the fly, Lazy-Attention resolves the materialization bottleneck, allowing a single physical KV copy to serve multiple logical requests at arbitrary positions.
Leveraging two optimized kernels tailored for prefilling and decoding, Lazy-Attention achieves significant efficiency improvements: under skewed document distributions, it reduces time-to-first-token (TTFT) by 1.37$\times$ and increases inference throughput by 1.40$\times$ compared to the state-of-the-art Block Attention, while maintaining comparable output quality.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 9804
Loading