Rethinking the High-Throughput LLM Inference: An Opportunity for Speculative Decoding

Sihwan Park; Gyouk Chu; Joonhyung Park; Eunho Yang

Rethinking the High-Throughput LLM Inference: An Opportunity for Speculative Decoding

Sihwan Park, Gyouk Chu, Joonhyung Park, Eunho Yang

20 Sept 2025 (modified: 27 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Speculative Decoding, Large Language Models, High-Throughput Inference

Abstract: Speculative decoding is a widely adopted method for accelerating autoregressive generation by drafting multiple candidate tokens and verifying them jointly with the target model. While effective in small-batch settings, it has been considered impractical under large-batch inference due to the belief that such regimes are compute-bound. Motivated by recent system-level findings that memory bandwidth, not compute, remains the dominant bottleneck in large-batch inference, we revisit the feasibility of speculative decoding under high-throughput conditions. We introduce \emph{$\gamma$-tolerance}, a latency-based criterion that characterizes when speculative decoding provides tangible speedup, and empirically validate that acceleration remains attainable across practical batch sizes and system configurations. Building on this insight, we derive a revised success condition for speculative decoding and demonstrate that most existing drafter architectures violate it due to poor trade-offs between accuracy and efficiency. To address this, we identify Multi-Token Prediction with Gated LoRA as a promising approach and develop a high-performance implementation. Our system achieves up to $2.37{\times}$ speedup at batch size 256 without requiring long-context prompts or architectural changes to the target model, demonstrating that speculative decoding can be both feasible and effective in large-batch inference.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 24970

Loading