Keywords: Scene-Guided, Motion-Aware, Token Pruning, VideoLLMs
Abstract: Video Large Language Models (VideoLLMs) struggle with the heavy computational cost of long or high-resolution videos due to massive visual token counts and the quadratic complexity of attention. Prior pruning approaches mainly rely on token importance or similarity, while largely overlooking video dynamics and the fact that different scenes exhibit different redundancy patterns. We introduce MoPrune, a training-free, scene-guided and motion-centric token pruning framework for accelerating VideoLLMs. MoPrune first segments videos into semantically coherent scenes to preserve temporal and motion consistency. Within each scene, it determines frame retention rates from intra-scene frame uniqueness. Finally, at the token level, MoPrune retains visually distinctive tokens and motion-salient tokens via a unified score, preserving both informative static details and dynamic regions. Extensive experiments across multiple VideoLLMs and public benchmarks demonstrate MoPrune’s superior efficiency–performance trade-offs. On LLaVA-OneVision, retaining 25\% of visual tokens matches or slightly improves the dense baseline, and retaining 15\% tokens preserves 99\% of the original performance. MoPrune is fully compatible with hardware-efficient techniques such as Flash Attention.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering, video processing
Languages Studied: English
Submission Number: 9865
Loading