Keywords: speculative decoding, knowledge distillation, multi-trajectory supervision, inference acceleration, large language models
Abstract: The efficacy of speculative decoding(SD) is fundamentally constrained by the alignment between the draft and target models. Existing distillation approaches for SD rely on single-trajectory supervision, which induces exposure bias and degrades acceptance rates at inference time. To address this, we introduce \textbf{DistillBeam}, a framework that optimizes draft-target alignment via multi-trajectory distillation. By aggregating supervision from multiple high-probability teacher trajectories, DistillBeam approximates the target model's full structural support, thereby mitigating sequence drift. We further tackle the prohibitive storage overhead of multi-beam distillation by demonstrating that aggressive Top-$K$ truncation ($K=50$) reduces offline storage by 99.9\% without degrading alignment. Extensive evaluation across 20 languages reveals that DistillBeam achieves wall-clock speedups of 35-65\% over autoregressive decoding, with particularly strong gains in morphologically rich languages where baseline methods struggle.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: Efficient/Low-Resource Methods for NLP, Generation, Machine Translation
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: French, Spanish, German, Portuguese, Italian, Chinese, Japanese, Korean, Arabic, Turkish, Hindi, Bengali, Tamil, Urdu, Telugu, Kannada, Malayalam, Marathi, Gujarati, Punjabi
Submission Number: 8506
Loading