ARC-Decode: Risk-Bounded Acceptance for Sampling-Based Speculative Decoding

02 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, Speculative decoding, Sampling-based decoding, Inference acceleration, Risk-bounded acceptance
TL;DR: ARC-Decode is a training-free method that accelerates speculative decoding under sampling via entropy-aware pruning and risk-bounded acceptance, increasing accept length to achieve up to 1.6× speedup with negligible quality degradation.
Abstract: As larger language models deliver stronger capabilities, their autoregressive inference becomes increasingly expensive. *Speculative decoding* accelerates generation by letting a fast draft process propose tokens that the target model verifies in parallel. Yet under sampling ($T>0$), observed speedups consistently lag behind those under greedy decoding: verification expends compute on low-value branches, and the *classical lossless verification rule* rejects drafts that would induce only negligible changes in the next-step conditional distribution. A key limitation under sampling is this \textbf{over-rejection of low-risk drafts, which depresses acceptance rates and limits acceleration.} To address this gap, we propose **ARC-Decode** (**A**cceptance with **R**isk **C**ontrol), a training-free method that augments speculative decoding and requires no extra forward passes. Our method ensures *relaxed* acceptance while guaranteeing that accepting non–top-1 drafts causes only negligible next-step distributional shifts, as measured by Jensen–Shannon divergence. ARC-Decode combines (i) confidence-based pre-verification filtering that preserves high-probability branches while enforcing prefix closure and leaf safety, and (ii) a risk-bounded acceptance criterion using an analytic upper bound on the next-step distribution shift from embedding and logit differences. Integrated into the state-of-the-art EAGLE-3 pipeline, ARC-Decode increases accept length per cycle and reduces verification compute, achieving up to **1.6×** end-to-end speedup over EAGLE-3 under sampling with negligible quality change across benchmarks.
Primary Area: generative models
Submission Number: 1041
Loading