BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated.
Abstract: Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under both stochastic and adversarial reward settings. By deriving an information-theoretic impossibility result, it is shown that the regret performance of UCBSpec is optimal up to universal constants. Finally, extensive empirical experiments with LLaMA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts.
Lay Summary: Large Language Models (LLMs), such as ChatGPT, generate text by predicting one word at a time, which can be slow. A method called speculative decoding speeds this up by using a lightweight “draft” model to guess several words ahead, then verifying them with the full model. However, current speculative decoding methods often use a fixed setup for all prompts, regardless of the task — whether it’s writing code or generating stories. This limits their effectiveness. Our work introduces BanditSpec, a smarter and more adaptive speculative decoding framework inspired by multi-armed bandit algorithms — a type of decision-making strategy that balances exploration and exploitation. BanditSpec dynamically learns which decoding configuration works best for each prompt as generation progresses, without requiring extra training. We design two algorithms, UCBSpec and EXP3Spec, that select the best setup in real time. Experiments with popular models like LLaMA3 and Qwen2 show that BanditSpec significantly improves text generation speed and closely matches the performance of the best possible fixed setup (the "oracle"). In summary, BanditSpec makes LLMs faster by learning how to guess better -- on the fly.
Link To Code: https://github.com/sail-sg/BanditSpec
Primary Area: Deep Learning->Large Language Models
Keywords: LLM Inference Acceleration, Speculative Decoding, Training-free, Hyperparameter Selection, Bandit Algorithms
Submission Number: 6530
Loading