TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees

TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees

ACL ARR 2026 January Submission5398 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: speculative decoding, inference acceleration, large language models

Abstract: Speculative decoding (SD) has become a standard technique for accelerating LLM inference without sacrificing output quality. Recent advances in speculative decoding have shifted from sequential chain-based drafting to tree-structured generation, where the draft model constructs a tree of candidate tokens to explore multiple possible drafts in parallel. However, existing tree-based SD methods typically build a **fixed-width, fixed-depth** draft tree, which fails to adapt to the varying difficulty of tokens and contexts. As a result, the draft model cannot dynamically adjust the tree structure to early stop on difficult tokens and extend generation for simple ones. To address these challenges, we introduce **TALON**, a ***training-free, budget-driven*** adaptive tree expansion framework that can be plugged into existing tree-based methods. Unlike static methods, **TALON** constructs the draft tree iteratively until a fixed token budget is met, using a hybrid expansion strategy that adaptively allocates the node budget to each layer of the draft tree. This framework naturally shapes the draft tree into a **"deep-and-narrow"** form for deterministic contexts and a **"shallow-and-wide"** form for uncertain branches, effectively optimizing the trade-off between exploration width and generation depth under a given budget. Extensive experiments across 5 models and 6 datasets demonstrate that **TALON** consistently outperforms state-of-the-art EAGLE-3, achieving up to 5.16× end-to-end speedup over auto-regressive decoding.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: LLM efficiency, Language modeling

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 5398

Loading