TAPS: Task Aware Proposal Distributions for Speculative Sampling

Published: 30 May 2026, Last Modified: 30 May 2026SPIGM @ ICMLEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Speculative decoding; task-aware proposal distributions; draft model training; acceptance length; confidence routing; merged-tree verification; LLM inference acceleration.
TL;DR: Speculative decoding depends on draft training data: task-matched drafters specialize, mixed data improves robustness, and inference-time composition beats weight averaging.
Abstract: Speculative decoding accelerates autoregressive generation by using a lightweight draft model to propose future tokens that a larger target model verifies in parallel. While recent work has improved draft architectures and verification procedures, the role of the draft training distribution remains less understood. In this work, we study task aware proposal distributions for speculative decoding by training lightweight HASS and EAGLE-2 drafters on MathInstruct, ShareGPT, and mixed data variants, and evaluating them on MT-Bench, GSM8K, MATH-500, and SVAMP. Our detailed analysis focuses on Meta-Llama-3-8B-Instruct, with additional scaling results for Qwen3-1.7B, Qwen3-4B, and Vicuna-13B v1.3. Measured by acceptance length under lossless verification, we find that draft training data induces clear specialization: MathInstruct trained drafters are strongest on GSM8K and MATH-500, while ShareGPT trained drafters are strongest on MT-Bench. Mixed data training improves robustness, but larger mixtures do not uniformly dominate across decoding temperatures. We further compare ways of combining specialized drafters and find that naive checkpoint averaging performs poorly, while confidence routing and merged tree verification preserve specialization more effectively at inference time. Finally, confidence provides a clearer routing signal than entropy, although entropy remains useful for diagnosing rejected tokens. Our claims concern proposal alignment under lossless verification; speedups are reported as supporting system measurements rather than as the primary optimization target. Overall, speculative decoding quality depends on the match between draft training data and downstream workload, and specialized drafters are better combined at inference time than through naive weight space averaging.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 27
Loading