Speculative Streaming: Efficient and Scalable Speculative Decoding with Multi-Stream Attention

Speculative Streaming: Efficient and Scalable Speculative Decoding with Multi-Stream Attention

ACL ARR 2025 May Submission2289 Authors

19 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Speculative decoding is a prominent technique for accelerating LLM inference by leveraging an auxiliary draft model, but its effectiveness is limited by the autoregressive nature of draft generation, where acceptance rates depend on the draft model’s size. Scaling the draft model improves acceptance but also increases speculation latency, limiting overall speedup. Furthermore, fine-tuning both the draft and target models is often necessary to achieve high acceptance rates, adding complexity to inference systems as the number of downstream tasks grows. Single-model approaches like Medusa generate speculative tokens non-autoregressively but lack token dependencies, limiting effectiveness. Alternatives like Hydra and Eagle incorporate token dependencies but rely on dedicated heads, making speculation independent of the base model and limiting the extent to which stronger base models can improve speculation. We introduce a novel speculative decoding method that integrates speculative draft generation directly within the target model using multi-stream attention. This improves acceptance rates by introducing interdependencies between speculative tokens while ensuring non-autoregressive draft generation with minimal overhead. As target models scale in size and quality, speculative generation improves naturally with our method, unlike prior approaches. Furthermore, our approach is both parameter- and FLOP-efficient, requiring over 1000$\times$ fewer additional parameters than Medusa, making it highly suitable for resource-constrained devices. We design our method to operate in two modes: (1) Lossless mode, a plug-and-play method that preserves the output of any pre-trained model; and (2) Shared mode, optimizing both speedup and downstream performance. We demonstrate a 2–3.5$\times$ speedup across diverse tasks, including summarization, translation, question answering, mathematical reasoning, SQL generation, and retrieval-augmented generation (RAG).

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: Speculative Decoding, LLM Efficiency, parameter-efficient-training, NLP in resource-constrained settings

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 2289

Loading