Hierarchical Speculative Decoding through Training-Free Slim-Verifier

16 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Hierarchical Speculative Decoding, Inference Efficiency, Large Language Models, Multi-tier Verification
TL;DR: Three-tier speculative decoding with an intermediate slim-verifier for faster, stable LLM inference.
Abstract: Speculative decoding (SD) addresses the high inference costs of large language models by having lightweight drafters generate candidates for large verifiers to validate in parallel. Current draft-verify methods use binary decisions: accept or fully recompute. We find that current binary verification creates inefficiency: many rejected tokens could be verified correctly with a slim model rather than a full verifier. This motivates our Training-Free Slim-Verifier to handle tokens requiring moderate verification resources, reducing expensive large-model calls. We propose Hierarchical Verification for Speculative Decoding (HVSD), a three-tier training-free framework using a skip-layer slim-verifier. Draft tokens are processed hierarchically: direct acceptance for high-confidence cases, slim-verifier regeneration for medium-confidence cases, and full-model verification for uncertain cases. Across summarization, translation, reasoning, QA, and coding tasks on T5 and Gemma families, HVSD consistently lowers rejection rates (0.1–0.22) and achieves 10–20% speedup over state-of-the-art SDs. Compared to decoding without drafting, HVSD provides 2.5-3× acceleration while improving output quality. Our results establish multi-tier SD as a general paradigm for scalable and efficient LLM inference.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 7077
Loading