TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text

TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text

ACL ARR 2025 May Submission6293 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Current Retrieval-Augmented Generation (RAG) systems concatenate and process numerous retrieved document chunks for prefill which requires a large volume of computation, therefore leading to significant latency in time-to-first-token (TTFT). To reduce the computation overhead as well as TTFT, we introduce TurboRAG, a hybrid offline–online paradigm that (i) pre‑computes chunk‑level key-value (KV) caches, (ii) stitches them together at inference time using independent–attention and reordered‑RoPE techniques, and (iii) preserves answer quality without changing the model architecture. Hence, online computation of KV caches is eliminated during inference. Our approach is applicable to most existing large language models and their applications without any requirement in modification of models and inference systems. Experimental results across a suite of RAG benchmarks demonstrate that TurboRAG reduces TTFT by up to 9.4x compared to the conventional RAG systems (on an average of 8.6x), but reserving comparable performance to the standard RAG systems.

Paper Type: Long

Research Area: Generation

Research Area Keywords: retrieval-augmented generation

Contribution Types: Approaches low compute settings-efficiency

Languages Studied: English, Chinese

Submission Number: 6293

Loading