HoVer: Holistic Verification for Semantic-Aware Speculative Generation

ICLR 2026 Conference Submission9587 Authors

17 Sept 2025 (modified: 28 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large language models, Draft-and-verify inference, Speculative decoding, Efficient inference, Holistic verification, Error localization
Abstract: We introduce *HoVer*, a semantic-aware speculative generation framework that accelerates large language model (LLM) inference without retraining. *HoVer* employs **Holistic Verification**: a lightweight draft model generates a complete candidate output, and a larger base model then *verifies* it holistically and, if necessary, *revises* from the first detected error. Unlike token-level speculative decoding, which enforces distributional consistency one token at a time, *HoVer* operates at the *semantic* level, amortizing verification cost over longer spans of text. At the core of our design is a prefill-only, single-pass *prefix verification* mechanism that uses a custom attention mask to identify the earliest error across multiple prefixes simultaneously. This makes verification compute-bound with negligible KV-cache overhead and enables continuation from the last safe prefix instead of regenerating from scratch. Across model families, *HoVer* achieves $\sim$1.2$\times$--3.1$\times$ latency reduction with minimal accuracy loss across general and math benchmarks. The approach is orthogonal to token-level speculation and can be combined with it for further gains.
Primary Area: generative models
Submission Number: 9587
Loading