LLM-VTP: LLM-Reasoned Visual Token Pruning for Efficient Multi-Modal Video Understanding

Xiaohu Huang; Hao Zhou; Kai Han

LLM-VTP: LLM-Reasoned Visual Token Pruning for Efficient Multi-Modal Video Understanding

Xiaohu Huang, Hao Zhou, Kai Han

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Understanding, Token Pruning

TL;DR: Apply LLM to identify informative visual tokens relevant to question tokens and prune useless ones.

Abstract: In this paper, we introduce LLM-VTP, a visual token pruning method designed to enhance the efficiency of multi-modal video understanding. Large Language Models (LLMs) have shown promising performance in video tasks due to their extended capabilities in comprehending visual modalities. However, the substantial redundancy in video data presents significant computational challenges for LLMs. To address this, we propose a training-free approach that leverages the inherent reasoning abilities of LLMs to selectively prune visual features based on question tokens, thereby optimizing model efficiency. We validate our method across multiple-choice, open-ended, and text-generation benchmarks. Our results demonstrate that LLM-VTP can prune 80\%-90\% of tokens while maintaining competitive performance. This highlights its superior effectiveness and efficiency compared to existing pruning methods. The source code will be released to facilitate future research.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6361

Loading