HT-Sparse: Training-Free Query-Guided Head–Token Sparsification for Long-Video Multimodal Inference

Zhuangzhuang Cai; Yuxiang Ma; Ling Xie

HT-Sparse: Training-Free Query-Guided Head–Token Sparsification for Long-Video Multimodal Inference

Zhuangzhuang Cai, Yuxiang Ma, Ling Xie

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Long-video multimodal inference, Training-free, query-guided, sparse attention, head–token sparsification

TL;DR: HT-Sparse is a training-free, query-guided method that accelerates long-video multimodal inference with efficient head–token sparsification.

Abstract: Long-video multimodal inference is limited by the quadratic cost of dense attention, cumulative KV-cache growth during decoding, and cross-modal interference, while retraining sparsity-aware variants is often impractical. We present HT-Sparse, a training-free, query-guided hierarchical sparsification that performs joint head–token computation to reduce both latency and memory without parameter updates. The method comprises two components executed adaptively across layers: (i) query-conditioned head sparsification, which ranks attention heads via analytically stable saliency statistics to retain the most informative subspaces for the current query; (ii) cross-modal token sparsification, which selects salient visual tokens by query–vision attention, enabling efficient computation and persistent KV-cache savings. We further introduce joint head–token routing in selected layers: top-ranked heads attend to the full visual token set, whereas secondary heads operate on the reduced (selected) set, preserving semantics while amortizing compute and cache. Across long-video benchmarks, HT-Sparse delivers faster inference with reduced end-to-end latency and lower KV-cache memory, while achieving equal or higher accuracy, all on the same pretrained model with no fine-tuning. The approach is model-agnostic and plug-in deployable, offering a flexible route to scalable long-video reasoning.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 6749

Loading