RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding

16 Sept 2025 (modified: 05 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Speculative Decoding, Inference Acceleration, Training-Free Optimization
TL;DR: RACER is a lightweight, training-free speculative decoding method that combines retrieval anchors with logit cues to accelerate LLM inference by 2.2–2.8× while outperforming prior methods.
Abstract: Autoregressive decoding in Large Language Models (LLMs) generates one token per step, causing high inference latency. Speculative decoding (SD) mitigates this through a guess‑and‑verify strategy, but existing training-free variants face trade‑offs: retrieval‑based drafts break when no exact match exists, while logits‑based drafts lack structural guidance. We propose **RACER** (**R**etrieval‑**A**ugmented **C**ont**e**xtual **R**apid Speculative Decoding), a lightweight and training‑free framework that integrates retrieved exact patterns with logit‑driven future cues. This unification supplies both reliable anchors and flexible extrapolation, yielding richer speculative drafts. Experiments on Spec-Bench, HumanEval, and MGSM demonstrate that RACER consistently accelerates inference, achieving a speedup of $2.2{\sim}2.8\times$ compared to autoregressive decoding, and outperforms prior training-free methods, offering a scalable, plug-and-play solution for efficient LLM decoding. Our source code is available at [this anonymous repository](https://anonymous.4open.science/r/racer_anonymous-9464).
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 6530
Loading