DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding

DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding

ACL ARR 2025 February Submission3568 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract:

Typical video modeling methods, such as LLava, represent videos as visual tokens, which are then processed by the LLM backbone for effective video understanding. However, this approach would result in a massive number of visual tokens especially on long videos. A practical approach is to first extract relevant visual information from the large visual context before integrating it into the LLM backbone, thereby avoiding substantial computational overhead. In this work, we propose DynTok, a training-free strategy for \textbf{Dyn}amic video \textbf{Tok}en compression. DynTok adaptively splits visual tokens into groups and merges them within each group, achieving high compression in low-information-density areas, while retaining essential information. Our method reduces the token count to 44.4% of the original size while maintaining comparable performance. It benefits more from increasing of video frames and achieves 65.3% on Video-MME and 72.5% on MLVU. By leveraging this simple yet effective compression method, we uncover the redundancy in video token representations and provide insights into designing more efficient video modeling techniques.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: VLM; Video Understanding; Token reduction

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 3568

Loading