Keywords: Large Language Models, Retrieval-Augmented Generation, Real-time Retrieval, Parallel Processing, Draft & Verification, Latency Reduction
Abstract: Retrieval-augmented generation (RAG) leverages external knowledge bases to enhance the quality of answers produced by large language models (LLMs). However, retrieving relevant documents from large-scale databases can be time-consuming, and existing RAG methods primarily focus on improving accuracy while often overlooking latency. In this paper, we introduce \textit{Staged Parallel Speculation (SPS)}, a training-free RAG framework that achieves substantial latency reduction without sacrificing answer quality. Unlike prior approaches that rely on task-specific training or model modifications, SPS is a plug-and-play method that requires no changes to the underlying models. Our framework enables the inference and retrieval systems to run in parallel during staged retrieval, thereby eliminating frequent pauses in the inference process and significantly accelerating generation. Furthermore, at each retrieval-generation stage, SPS first uses a model to generate multiple candidate answer chunks in parallel and then selects the most reliable output based on self-consistency among the candidates, thereby further improving answer quality. Extensive experiments across multiple benchmark datasets show that SPS consistently surpasses training-free RAG baselines by achieving higher accuracy with 57\% lower latency, while still reaching 96\% of the performance of finetuning-based methods, making it a practical choice for deployment in latency-sensitive applications such as agentic systems, enterprise knowledge management, or healthcare support.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 21398
Loading