Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios

Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios

ACL ARR 2025 May Submission5583 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Speculative decoding accelerates LLM inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current approaches typically rely on smaller causal models to autoregressively sample draft tokens, often enhanced with prefix trees to explore multiple potential drafts. However, these methods face significant performance degradation as batch size increases, due to reduced surplus computational capacity for speculative decoding. To address this limitation, we propose SpecFormer, a novel architecture that integrates unidirectional and bidirectional attention mechanisms. SpecFormer combines the autoregressive model’s ability to extract information from the entire input sequence with the parallel generation benefits of non-autoregressive models. This design eliminates the reliance on large prefix trees and achieves consistent acceleration, even in large-batch scenarios. Through lossless speculative decoding experiments across models of various scales, we demonstrate that SpecFormer sets a new standard for scaling LLM inference with lower training demands and reduced computational costs.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: Efficient/Low-Resource Methods for NLP, Language Modeling

Languages Studied: English

Keywords: Efficient/Low-Resource Methods for NLP, Language Modeling

Submission Number: 5583

Loading