Tri-Factor Saliency: A Low-Dimensional Representation for Efficient and Diversity-Aware Video Token Pruning
Keywords: video token compression, video understanding
Abstract: The quadratic computational overhead of self-attention severely limits the application of Large Vision-Language Models (LVLMs) to long-form video. While training-free token pruning offers a promising avenue for acceleration, current methods still struggle for balancing the token diversity and pruning efficiency. Query-based approaches prune tokens irrelevant to a specific prompt, but consequently sacrifice the intrinsic diversity of the video content. Conversely, methods that preserve diversity by clustering or matching based on the raw, high-dimensional token features incur prohibitive computational costs, making them impractical for long video inputs.
In this work, we challenge the assumption that preserving diversity necessitates expensive computations in the original high-dimensional feature space. We hypothesize that a low-dimensional yet informative representation engineered for pruning can achieve comparable results with a fraction of the overhead. To validate this, we propose a framework that first projects the original token features into a highly informative 3D "saliency-space." This projection is achieved via our Tri-Factor Saliency (TFS) model, which computes three largely orthogonal sub-features from a local spatio-temporal neighborhood: (1) Dynamic Saliency, which captures the magnitude of movement; (2) Regional Saliency, which identifies coherent objects that stand out from their background; and (3) Focal Saliency, which pinpoints unpredictable, fine-grained details.
This low-dimensional representation enables subsequent entity-aware clustering and diversity-preserving stratified sampling to be performed with minimal computational cost. Our experiments show that this approach allows for the pruning of up to 75% of tokens while retaining 95% of the original model's performance on video understanding benchmarks. Our work demonstrates that a well-designed, low-dimensional perceptual projection can effectively replace expensive high-dimensional feature matching for video token pruning, charting a new course that achieves both high efficiency and strong diversity preservation.
Primary Area: optimization
Submission Number: 24848
Loading