Lazy-Attention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

Haocheng Xia; Mihir Pamnani; Hanxi Fang; Supawit Chockchowwat; Yongjoo Park

Lazy-Attention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

Haocheng Xia, Mihir Pamnani, Hanxi Fang, Supawit Chockchowwat, Yongjoo Park

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Retrieval-augemented generation, Positional encoding, KV Cache Reuse

Abstract: Key-value (KV) caching accelerates inference in large language models (LLMs) by reusing computations from previously generated tokens. Its importance becomes even greater in long-context applications such as retrieval-augmented generation (RAG) and in-context learning (ICL). However, conventional KV caching embeds positional information directly into the cache, limiting its reusability. Existing solutions face a memory-compute trade-off. Specifically, they either restrict reuse to prefixes or require expensive memory materialization for position adjustment. We introduce Lazy-Attention, a novel attention mechanism that kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV reuse. By fusing positional adjustment into the attention kernel on the fly, Lazy-Attention resolves the materialization bottleneck, allowing a single physical KV copy to serve multiple logical requests at arbitrary positions. Leveraging two optimized kernels tailored for prefilling and decoding, Lazy-Attention achieves significant efficiency improvements: under skewed document distributions, it reduces time-to-first-token (TTFT) by 1.37$\times$ and increases inference throughput by 1.40$\times$ compared to the state-of-the-art Block Attention, while maintaining comparable output quality.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 9804

Loading