ZoomVLM: A Tuning-Free Framework for Efficient Video Understanding via Adaptive Zooming in Vision-Language Models

Zhongzhi Yu; Zheng Wang; Zhenyang Chen; Chaojian Li; Hyewon Suh; Yonggan Fu; Dachuan Shi; Hongxu Yin; Jan Kautz; Pavlo Molchanov; Yingyan Celine Lin

ZoomVLM: A Tuning-Free Framework for Efficient Video Understanding via Adaptive Zooming in Vision-Language Models

Zhongzhi Yu, Zheng Wang, Zhenyang Chen, Chaojian Li, Hyewon Suh, Yonggan Fu, Dachuan Shi, Hongxu Yin, Jan Kautz, Pavlo Molchanov, Yingyan Celine Lin

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Language Model, Multi-modal

TL;DR: We propose a tuning-free framework that boosts video vision-language model efficiency without sacrificing accuracy by adaptively zooming parts based on attention.

Abstract: Recent advances in vision-language models (VLMs) have led to impressive progress in video understanding. However, despite their promising performance, existing state-of-the-art (SOTA) solutions require an excessive number of tokens (e.g., up to 6,272 tokens in the Llava-OneVision model) to represent input videos, leading to a non-negligible bottleneck in inference efficiency. Motivated by findings in human perception, where individuals first focus on high-level overviews and then zoom into specific areas for detailed information, we hypothesize that a similar approach can enhance the inference efficiency of VLMs by reducing the number of tokens needed to represent videos. Based on this hypothesis, we propose ZoomVLM, a tuning-free, plug-and-play efficient video processing framework for video VLMs. ZoomVLM first generates an overview of the entire video and then adaptively zooms in and out on different parts based on the content being generated. Our key insight is that the attention distributions in the Large Language Model (LLM) within the VLM can provide sensible guidance on where to focus (by allocating more tokens) and where to discard (by dropping tokens) during inference. Specifically, ZoomVLM integrates two key components: (1) a Video Overview Augmenter, which enables cost-effective high-level understanding by augmenting downsampled video overview with a few high-resolution keyframes; and (2) an Adaptive Token Adjustment, which predicts the importance of different video parts in the upcoming generation process and adjusts the number of tokens allocated to each part according to their importance. Extensive experiments and ablation studies across two challenging open-ended video understanding benchmarks and four models validate that ZoomVLM effectively improves inference efficiency by reducing the number of tokens and boosting throughput in terms of the number of generated tokens per second without degradation in achievable accuracy. Specifically, when applying ZoomVLM to Llava-Next-Video-7B-DPO, ZoomVLM achieves a 30\% higher token generation rate with a 0.259 improvement in the Video Detail Description score.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12635

Loading