TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification

ICLR 2026 Conference Submission5859 Authors

15 Sept 2025 (modified: 21 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Acceleration, Speculative Decoding, Proxy verifier
TL;DR: We propose a ternary speculative decoding method that employs a proxy verifier to assist the target model in verification stage, thereby further accelerating standard speculative decoding.
Abstract: The enhanced reasoning capabilities of Large Language Models (LLMs) have led to longer response sequences, yet the inference efficiency is fundamentally limited by their serial, autoregressive generation. Speculative decoding (SD) offers a powerful solution, providing significant speed-ups through its lightweight drafting and parallel verification mechanism. While existing work has made considerable progress in improving the draft model's generation efficiency and alignment, this paper boosts SD from a new angle: the verification cost. We propose TriSpec, a novel ternary SD framework with proxy verification. At its core, TriSpec introduces a lightweight proxy model that handles the initial verification. This proxy significantly reduces computational cost by approving easily verifiable draft sequences and only engages the full target model when encountering uncertain tokens, thus striking an optimal balance between efficiency and quality. TriSpec can be integrated with state-of-the-art SD methods like EAGLE-3 to further reduce verification costs, achieving greater acceleration. Extensive experiments on the Qwen3 and DeepSeek-R1-Distill-Qwen/LLaMA families show that TriSpec achieves up to 30\% speedup over standard SD, with up to 50\% fewer target model invocations while maintaining comparable accuracy.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 5859
Loading