Keywords: Speculative Decoding, Vocabulary Trimming, Inference Efficiency, Draft Model, Large Language Models
Abstract: Speculative decoding accelerates inference for Large Language Models by using a lightweight draft model to propose candidate tokens that are verified in parallel by a larger target model. Prior work shows that draft models dominate speculative decoding latency due to sequential generation and vocabulary-dependent LM head cost. This creates a trade-off: larger draft vocabularies improve coverage and agreement but increase latency, while smaller vocabularies reduce latency at the risk of missing required tokens.
We address this trade-off through vocabulary trimming for draft models, motivated by the observation that domain-specific workloads use only a small fraction of the full vocabulary. We formulate draft vocabulary selection as a constrained optimization problem that balances token coverage and draft latency. Coverage is computed over assistant responses in the training data, while latency is estimated using an architecture-aware FLOPs proxy for the language modeling head. We optimize a utility function with a Tree-structured Parzen Estimator to efficiently explore the coverage–latency Pareto frontier under a minimum coverage constraint.
Experiments demonstrate consistent throughput improvements while reducing draft vocabularies by up to 97%. On domain-specific tasks, we achieve up to 16% latency reduction and 20% throughput improvement, and up to 6.7% throughput gains on diverse out-of-distribution benchmarks.
Paper Type: Long
Research Area: Efficient Methods for NLP
Research Area Keywords: LLM Efficiency, distillation
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 151
Loading