Accelerating Iterative Retrieval-augmented Language Model Serving with Speculation

Zhihao Zhang; Alan Zhu; Lijie Yang; Yihua Xu; Lanting Li; Phitchaya Mangpo Phothilimthana; Zhihao Jia

Accelerating Iterative Retrieval-augmented Language Model Serving with Speculation

Zhihao Zhang, Alan Zhu, Lijie Yang, Yihua Xu, Lanting Li, Phitchaya Mangpo Phothilimthana, Zhihao Jia

Published: 02 May 2024, Last Modified: 25 Jun 2024ICML 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper introduces RaLMSpec, a framework that accelerates iterative retrieval-augmented language model (RaLM) with *speculative retrieval* and *batched verification*. RaLMSpec further introduces several important systems optimizations, including prefetching, optimal speculation stride scheduler, and asynchronous verification. The combination of these techniques allows RaLMSPec to significantly outperform existing systems. For document-level iterative RaLM serving, evaluation over three LLMs on four QA datasets shows that RaLMSpec improves over existing approaches by $1.75$-$2.39\times$, $1.04$-$1.39\times$, and $1.31$-$1.77\times$ when the retriever is an exact dense retriever, approximate dense retriever, and sparse retriever respectively. For token-level iterative RaLM (KNN-LM) serving, RaLMSpec is up to $7.59\times$ and $2.45\times$ faster than existing methods for exact dense and approximate dense retrievers, respectively.

Submission Number: 3351

Loading