On Exploring Visual Attention Shrinking for Accelerating VLMs for Video Understanding

Chang Liu; Jian Jia; Ye Ma; Quan Chen; Peng Jiang; Zhijie Deng

On Exploring Visual Attention Shrinking for Accelerating VLMs for Video Understanding

Chang Liu, Jian Jia, Ye Ma, Quan Chen, Peng Jiang, Zhijie Deng

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visual Language model, Inference Acceleration, Visual Attention Shrinking

Abstract: Vision-language models (VLMs) have shown promise in a variety of challenging video comprehension tasks. VLMs typically extract frames from the source video and take the corresponding encoded visual tokens as input. A rapid increase in the number of visual tokens, e.g., when handling lengthy videos, can swiftly lead to a long-context dilemma during the inference process of VLMs, posing an efficiency challenge for real-world applications. Given that significant redundant and task-irrelevant information may exist in the visual tokens across both spatial and temporal axes, we advocate removing less important visual tokens during the prefilling phase of the inference procedure to improve the computation and storage efficiency of VLMs. We first identify an interesting phenomenon termed as \emph{Visual Attention Shrinking (VAS)}, wherein certain visual tokens receive progressively diminishing attention during the processing stages of the model. This implies that the model itself knows what to care about and what to discard. With this understanding, we develop a robust algorithm to detect attention shrinking at each layer of the model using states from preceding layers. Based on the detection results, we perform token removal in both temporal and spatial axes. Our approach does not require parameterized modifications to the original VLM and is compatible with the prevalent KV cache strategy. Through extensive experiments across different VLMs, our approach witnesses an average speedup of $1.98\times$ in generating the first response token, utilizing only 47.2% of the visual tokens, without compromising the task performance. Additionally, when applied to the huge VILA1.5-40B, our method can achieve up to $4.16\times$ speedup compared to the vanilla model.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4275

Loading