Abstract: Inference latency stands as a critical bottleneck in the large-scale deployment of Large Language Models (LLMs). Speculative decoding methods have recently shown promise in accelerating inference without compromising the output distribution. However, existing work typically relies on a dualistic draft-verify framework and lacks rigorous theoretical grounding. In this paper, we introduce a novel \emph{polybasic} speculative decoding framework, underpinned by a comprehensive theoretical analysis. Specifically, we prove a fundamental theorem that characterizes the optimal inference time for multi-model speculative decoding systems, shedding light on how to extend beyond the dualistic approach to a more general polybasic paradigm. Through our theoretical investigation of multi-model token generation, we expose and optimize the interplay between model capabilities, acceptance lengths, and overall computational cost. Our framework supports both standalone implementation and integration with existing speculative techniques, leading to accelerated performance in practice. Experimental results across multiple model families demonstrate that our approach yields speedup ratios ranging from $3.31\times$ to $4.01\times$ for LLaMA2-Chat 7B, up to $3.87 \times$ for LLaMA3-8B, up to $4.43 \times$ for Vicuna-7B and up to $3.85 \times$ for Qwen2-7B---all while preserving the original output distribution. We release our theoretical proofs and implementation code to facilitate further investigation into polybasic speculative decoding.
Lay Summary: Large language models (LLMs) face high inference latency, hindering real-world deployment. Current speculative decoding methods, which use a draft model to propose tokens and a target model to verify them, are limited by their two-model setup and lack rigorous theoretical guidance, capping speedup potential. We propose polybasic speculative decoding, a framework employing multiple interconnected draft models guided by a new theoretical analysis. We derive optimal inference time equations, establish conditions for efficiently adding models, and prove that speculative sampling stabilizes token acceptance. Experiments across major LLMs (e.g., LLaMA, Vicuna) show speedups of 3.16–4.43× without altering output quality. The theory enables systematic model selection and system design, advancing beyond heuristic approaches. This accelerates LLMs for applications like translation, reasoning, and chatbots while preserving reliability, making high-quality AI more accessible.
Primary Area: Deep Learning->Large Language Models
Keywords: speculative decoding
Submission Number: 8716
Loading