Keywords: Speculative Decoding, Large Language Models, High-Throughput Inference
Abstract: Speculative decoding is a widely adopted method for accelerating autoregressive generation by drafting multiple candidate tokens and verifying them jointly with the target model. While effective in small-batch settings, it has been considered impractical under large-batch inference due to the belief that such regimes are compute-bound. Motivated by recent system-level findings that memory bandwidth, not compute, remains the dominant bottleneck in large-batch inference, we revisit the feasibility of speculative decoding under high-throughput conditions. We introduce \emph{$\gamma$-tolerance}, a latency-based criterion that characterizes when speculative decoding provides tangible speedup, and empirically validate that acceleration remains attainable across practical batch sizes and system configurations. Building on this insight, we derive a revised success condition for speculative decoding and demonstrate that most existing drafter architectures violate it due to poor trade-offs between accuracy and efficiency. To address this, we identify Multi-Token Prediction with Gated LoRA as a promising approach and develop a high-performance implementation. Our system achieves up to $2.37{\times}$ speedup at batch size 256 without requiring long-context prompts or architectural changes to the target model, demonstrating that speculative decoding can be both feasible and effective in large-batch inference.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 24970
Loading