HIPPO: Accelerating Video Large Language Models Inference via Holistic-aware Parallel Speculative Decoding

HIPPO: Accelerating Video Large Language Models Inference via Holistic-aware Parallel Speculative Decoding

ACL ARR 2026 January Submission5590 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: video large language models, speculative decoding, inference acceleration

Abstract: Speculative decoding (SD) has emerged as a promising approach to accelerate LLM inference without sacrificing output quality. Existing SD methods tailored for video-LLMs primarily focus on pruning redundant visual tokens to mitigate the computational burden of massive visual inputs. However, existing methods do not achieve inference acceleration comparable to text-only LLMs. We observe from extensive experiments that this phenomenon mainly stems from two limitations: (i) their pruning strategies inadequately preserve visual semantic tokens, degrading draft quality and acceptance rates; (ii) even with aggressive pruning (e.g., 90% visual tokens removed), the draft model's remaining inference cost limits overall speedup. To address these limitations, we propose HIPPO, a general **h**olist**i**c-aware **p**arallel s**p**eculative dec**o**ding framework. Specifically, HIPPO proposes: (i) a semantic-aware token preservation method, which fuses global attention scores with local visual semantics to retain semantic information at high pruning ratios; (ii) a video parallel SD algorithm that decouples and overlaps draft generation and target verification phases. Experiments on four video-LLMs across six benchmarks demonstrate HIPPO's effectiveness, yielding up to $3.51 \times$ speedup compared to vanilla auto-regressive decoding.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: LLM efficiency, Language modeling

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 5590

Loading