MoPrune: Scene-Guided Motion-Aware Token Pruning for Efficient Video Large Language Models

MoPrune: Scene-Guided Motion-Aware Token Pruning for Efficient Video Large Language Models

ACL ARR 2026 January Submission9865 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Scene-Guided, Motion-Aware, Token Pruning, VideoLLMs

Abstract: Video Large Language Models (VideoLLMs) struggle with the heavy computational cost of long or high-resolution videos due to massive visual token counts and the quadratic complexity of attention. Prior pruning approaches mainly rely on token importance or similarity, while largely overlooking video dynamics and the fact that different scenes exhibit different redundancy patterns. We introduce MoPrune, a training-free, scene-guided and motion-centric token pruning framework for accelerating VideoLLMs. MoPrune first segments videos into semantically coherent scenes to preserve temporal and motion consistency. Within each scene, it determines frame retention rates from intra-scene frame uniqueness. Finally, at the token level, MoPrune retains visually distinctive tokens and motion-salient tokens via a unified score, preserving both informative static details and dynamic regions. Extensive experiments across multiple VideoLLMs and public benchmarks demonstrate MoPrune’s superior efficiency–performance trade-offs. On LLaVA-OneVision, retaining 25\% of visual tokens matches or slightly improves the dense baseline, and retaining 15\% tokens preserves 99\% of the original performance. MoPrune is fully compatible with hardware-efficient techniques such as Flash Attention.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: vision question answering, video processing

Languages Studied: English

Submission Number: 9865

Loading