Batch Speculative Decoding Done Right

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: speculative decoding, batch speculative decoding, llm inference
TL;DR: Correct batch speculative decoding = proper position/mask/KV-cache synchronization for ragged tensors + cross-batch scheduling to eliminate wasted overhead.
Abstract: Speculative decoding speeds up LLM inference by using a small draft model to propose multiple tokens that a target model verifies in parallel. Extending this idea to batches is essential for production serving, but it introduces the ragged tensor problem: sequences in the same batch accept different numbers of draft tokens, breaking right-alignment and corrupting position IDs, attention masks, and KV-cache state. We show that several existing batch implementations violate output equivalence—the fundamental requirement that speculative decoding must produce identical token sequences to standard autoregressive generation. These violations occur precisely due to improper handling of the ragged tensor problem. In response, we (1) characterize the synchronization requirements that guarantee correctness, (2) present a correctness-first batch speculative decoding \oursb that exposes realignment as consuming 40\% of overhead, and (3) introduce \oursx, which maintains a sliding pool of sequences and dynamically forms same-length groups, to reduce the realignment overhead while preserving per-sequence speculative speedups. On SpecBench dataset, across Vicuna-7B/68M, Qwen3-8B/0.6B, and GLM-4-9B/0.6B pairs, our approach achieves up to 3× throughput improvement at batch size 8 compared to batch size 1, with efficient scaling through batch size 8, while maintaining 95\% output equivalence. Our method requires no custom kernels and integrates cleanly with existing inference stacks.
Supplementary Material: zip
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 21406
Loading