HoVer: Holistic Verification for Semantic-Aware Speculative Generation

Yuncheng Yao; Yuxuan Xia; Shenji Wan; Danyang Zhuo

HoVer: Holistic Verification for Semantic-Aware Speculative Generation

Yuncheng Yao, Yuxuan Xia, Shenji Wan, Danyang Zhuo

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models, Draft-and-verify inference, Speculative decoding, Efficient inference, Holistic verification, Error localization

Abstract: We introduce *HoVer*, a semantic-aware speculative generation framework that accelerates large language model (LLM) inference without retraining. *HoVer* employs **Holistic Verification**: a lightweight draft model generates a complete candidate output, and a larger base model then *verifies* it holistically and, if necessary, *revises* from the first detected error. Unlike token-level speculative decoding, which enforces distributional consistency one token at a time, *HoVer* operates at the *semantic* level, amortizing verification cost over longer spans of text. At the core of our design is a prefill-only, single-pass *prefix verification* mechanism that uses a custom attention mask to identify the earliest error across multiple prefixes simultaneously. This makes verification compute-bound with negligible KV-cache overhead and enables continuation from the last safe prefix instead of regenerating from scratch. Across model families, *HoVer* achieves $\sim$1.2$\times$--3.1$\times$ latency reduction with minimal accuracy loss across general and math benchmarks. The approach is orthogonal to token-level speculation and can be combined with it for further gains.

Primary Area: generative models

Submission Number: 9587

Loading