DEAGLE: Token Tree with Dynamic Depth Will Further Benefit the Speculative Decoding

Tianyi Shen; Shiao Zhuang; Huijuan Xu; Nilesh Ahuja; Vijaykrishnan Narayanan

DEAGLE: Token Tree with Dynamic Depth Will Further Benefit the Speculative Decoding

Tianyi Shen, Shiao Zhuang, Huijuan Xu, Nilesh Ahuja, Vijaykrishnan Narayanan

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sepculative Decoding, Large Language Models, Decoding Acceleration, Eagle Decoding

TL;DR: This paper proposed an approach that make the draft eagle model can generate token tree with dynamic depth based on input context.

Abstract: Large Language Models (LLMs) have shown remarkable capabilities in text generation, but they also suffer from high token-by-token latency due to the nature of autoregressive decoding. Speculative decoding (SD) mitigates this by using the draft-then-verify framework, making it possible to generate multiple tokens in a single LLM forward pass. However, existing state-of-the-art SD frameworks typically generate token trees with a fixed depth, which brings unnecessary computation and suboptimal speedup across diverse datasets. In this work, we introduce DEAGLE, a lightweight and training-free extension to EAGLE-3 that enables adaptive-depth speculative decoding through context-aware token-tree monitoring. We provide the first formal proof that draft model confidence serves as an unbiased estimator of token-level acceptance, generalizing empirical observations from prior EAGLE-2 work to EAGLE-3. Furthermore, we show that the product of draft confidences along a token path, the survival probability, can be a good heuristic for full-branch acceptance. Based on this insight, DEAGLE introduces a voting-based early stopping mechanism that monitors the survival probability sum of the top-k leaves, survival momentum, and the expected accept length for the whole token tree (estimated via survival probability expectation). These factors are jointly used to determine when to stop tree expansion. DEAGLE can be integrated into EAGLE-3 without retraining or architectural changes. Experiments on Vicuna 13b, Llama3-8b, and Llama3-70b demonstrate that DEAGLE achieves further speedup over EAGLE-3 and enables more robust acceleration across different datasets and token tree depths.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 8137

Loading