Abstract: Speculative decoding accelerates LLM inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current approaches typically rely on smaller causal models to autoregressively sample draft tokens, often enhanced with prefix trees to explore multiple potential drafts. However, these methods face significant performance degradation as batch size increases, due to reduced surplus computational capacity for speculative decoding. To address this limitation, we propose SpecFormer, a novel architecture that integrates unidirectional and bidirectional attention mechanisms. SpecFormer combines the autoregressive model’s ability to extract information from the entire input sequence with the parallel generation benefits of non-autoregressive models. This design eliminates the reliance on large prefix trees and achieves consistent acceleration, even in large-batch scenarios. Through lossless speculative decoding experiments across models of various scales, we demonstrate that SpecFormer sets a new standard for scaling LLM inference with lower training demands and reduced computational costs.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Efficient/Low-Resource Methods for NLP, Language Modeling
Languages Studied: English
Keywords: Efficient/Low-Resource Methods for NLP, Language Modeling
Submission Number: 5583
Loading