Typical video modeling methods, such as LLava, represent videos as visual tokens, which are then processed by the LLM backbone for effective video understanding. However, this approach would result in a massive number of visual tokens especially on long videos. A practical approach is to first extract relevant visual information from the large visual context before integrating it into the LLM backbone, thereby avoiding substantial computational overhead. In this work, we propose DynTok, a training-free strategy for \textbf{Dyn}amic video \textbf{Tok}en compression. DynTok adaptively splits visual tokens into groups and merges them within each group, achieving high compression in low-information-density areas, while retaining essential information. Our method reduces the token count to 44.4% of the original size while maintaining comparable performance. It benefits more from increasing of video frames and achieves 65.3% on Video-MME and 72.5% on MLVU. By leveraging this simple yet effective compression method, we uncover the redundancy in video token representations and provide insights into designing more efficient video modeling techniques.
Abstract:
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: VLM; Video Understanding; Token reduction
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 3568
Loading