VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

ACL ARR 2025 February Submission3635 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Building on advances in language models, Large Multimodal Models (LMMs) have significantly improved video understanding. However, current video LMMs rely on either image or video encoders, each with limitations; image encoders capture rich spatial details but lack temporal context, while video encoders provide temporal understanding but process sparse frames at lower resolutions. To this end, we introduce VideoGPT+, which integrates image and video encoders for detailed spatial understanding and global temporal modeling. The model processes videos in segments and applies adaptive pooling on extracted features, achieving state-of-the-art results on VCGBench, MVBench, Zero-shot QA, and Video-MME. Additionally, we develop a 112K video-instruction dataset using a novel semi-automatic annotation pipeline, further enhancing performance. To comprehensively evaluate video LMMs, we present VCGBench-Diverse, a benchmark covering 18 diverse video categories, including lifestyle, sports, and surveillance. With 4,354 QA pairs, it assesses dense video captioning, spatiotemporal understanding, and complex reasoning, ensuring a robust evaluation across video types. Our code, dataset, and models will be released publicly.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: video-conversation-model, large multi-modal model, multi-modal, video-conversation, image-and-video, phi-3-min, vision-language, video-chatbot
Contribution Types: Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 3635
Loading