READER: Retrieval-Assisted Drafter for Efficient LLM Inference

READER: Retrieval-Assisted Drafter for Efficient LLM Inference

ICLR 2026 Conference Submission19394 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Speculative decoding, LLM inference

TL;DR: We propose a lossless, retrieval-assisted speculative decoding method that accelerates large-batch LLM inference by leveraging self-repetitions and statistical search.

Abstract: Autoregressive Language Models instantiate a factorized likelihood over token sequences, yet their strictly sequential decoding process imposes an intrinsic lower bound on inference latency. This bottleneck has emerged as a central obstacle to the scalable deployment of large-scale generative models. Existing acceleration techniques partially mitigate token-level latency — by relying on auxiliary draft models or introducing an additional training phase — but fail to address the dominant memory and communication costs. We present READER (Retrieval-Assisted Drafter for Efficient LLM Inference), a provably lossless speculative decoding framework that bypasses the training of the auxiliary draft model. READER formalizes speculative decoding as a stochastic tree construction problem and exploits the empirical redundancy structure of natural language to generate high-probability candidate continuations. Our method revisits the problem of constructing draft trees, establishing substantial statistical improvements over stochastic draft-tree methods and providing a complexity-theoretic analysis that characterizes the optimality frontier of speculative decoding under bounded computation and memory resources. Beyond the single-sequence regime traditionally considered in prior work, we introduce a memory-optimal key-value cache-serving strategy that guarantees amortized sublinear overhead in the batch dimension, allowing READER to scale to realistic inference workloads. Comprehensive experiments demonstrate up to 6.13× wall-clock speedup on single-prompt inference and up to 5.92× on batched inference — consistently surpassing prior speculative decoding baselines — while preserving exact output equivalence, with even more pronounced gains in retrieval-augmented generation pipelines. Our results close a key gap between theoretical parallelism limits and practical LLM inference, suggesting a new standard for efficient deployment.

Primary Area: generative models

Submission Number: 19394

Loading