Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding

Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding

ACL ARR 2026 March Submission151 Authors

08 Mar 2026 (modified: 07 Jun 2026)ACL ARR 2026 March SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Speculative Decoding, Vocabulary Trimming, Inference Efficiency, Draft Model, Large Language Models

Abstract: Speculative decoding accelerates inference for Large Language Models by using a lightweight draft model to propose candidate tokens that are verified in parallel by a larger target model. Prior work shows that draft models dominate speculative decoding latency due to sequential generation and vocabulary-dependent LM head cost. This creates a trade-off: larger draft vocabularies improve coverage and agreement but increase latency, while smaller vocabularies reduce latency at the risk of missing required tokens. We address this trade-off through vocabulary trimming for draft models, motivated by the observation that domain-specific workloads use only a small fraction of the full vocabulary. We formulate draft vocabulary selection as a constrained optimization problem that balances token coverage and draft latency. Coverage is computed over assistant responses in the training data, while latency is estimated using an architecture-aware FLOPs proxy for the language modeling head. We optimize a utility function with a Tree-structured Parzen Estimator to efficiently explore the coverage–latency Pareto frontier under a minimum coverage constraint. Experiments demonstrate consistent throughput improvements while reducing draft vocabularies by up to 97%. On domain-specific tasks, we achieve up to 16% latency reduction and 20% throughput improvement, and up to 6.7% throughput gains on diverse out-of-distribution benchmarks.

Paper Type: Long

Research Area: Efficient Methods for NLP

Research Area Keywords: LLM Efficiency, distillation

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 151

Loading