Keywords: Speculative Decoding, Inference Acceleration
Abstract: The draft-then-verify decoding paradigm, introduced by speculative decoding methods, has demonstrated remarkable performance in alleviating the memory-bound bottleneck and accelerating the inference speed of Large Language Models (LLMs) while maintaining the quality of generated content. Recent studies show that the intrinsic robustness of LLMs can be exploited in a training-free and architecture-agnostic manner, suggesting that auxiliary models or structural modifications are not strictly necessary for draft generation. However, existing methods fail to fully leverage this robustness, leading to substantial redundant and repeated computations. Building on this insight, we propose Progressive Tree Drafting (PTD), a new inference acceleration strategy that further extends this line of work. PTD organizes the drafting process into a progressively updated tree structure, where controlled perturbations are injected to guide generation and a stepwise pruning mechanism enabling the model to produce coherent yet diverse drafts at manageable computational cost. By efficiently coordinating the drafting and verification stages, PTD achieves up to 2$\times$ decoding speedup across different open-source models and benchmarks. Our code is available at https://anonymous.4open.science/r/PTD-D354.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 16377
Loading