Keywords: Speculative Decoding, Efficient Inference, distributed inference, systems for ML
TL;DR: We introduce Mirror Speculative Decoding, a method that overlaps draft and target execution across GPUs and NPUs to cut inference latency while preserving exact acceptance.
Abstract: Speculative decoding accelerates LLM inference with draft lookahead, but its effectiveness is bottlenecked by autoregressive draft generation: larger drafts improve acceptance yet also increase speculation latency overhead, capping speedup. Existing approaches such as Medusa, Hydra, EAGLE partially address draft inefficiency, but ultimately trade acceptance rates for reduced draft latency, or preserve acceptance at the cost of added overheads that limit scaling.
Modern SoCs increasingly integrate heterogeneous accelerators, most commonly GPUs and NPUs with complementary throughput and efficiency characteristics, yet existing approaches are accelerator-agnostic and usually place both draft and target on the same type of device, which leaves cross-accelerator parallelism unused. We introduce Mirror Speculative Decoding (Mirror-SD), which breaks the latency--acceptance tradeoff by launching branch-complete rollouts from early-exit signals in parallel with the target’s suffix and by explicitly mapping computation across heterogeneous accelerators. In this design, the draft speculates forward token continuations for target to verify, while the target speculates correction paths for the draft, creating a bidirectional speculative process. To further reduce draft speculation latency overhead while preserving acceptance semantics, we pair Mirror-SD with speculative streaming (SS) so the draft emits multiple tokens per step. This dual strategy of combining parallel heterogeneous execution and SS pushes speculative decoding closer to its ideal regime of high acceptance with negligible speculation overhead. On SpecBench with server-scale models from 14B to 66B parameters, Mirror-SD consistently delivers realistic end-to-end gains, achieving 2.8$\times$--5.8$\times$ wall-time speedups across diverse tasks representing 30\% average relative improvement over the strongest baseline, EAGLE3. By eliminating serial bottlenecks and exploiting multi-accelerator SoCs, Mirror-SD establishes a practical low-latency regime for large-scale LLM serving. We plan to release code and draft model checkpoints to facilitate reproducibility and further research upon acceptance.
Primary Area: optimization
Submission Number: 24016
Loading