MotionBoost: Bootstrapping Image-Language Models with Motion Awareness for Efficient Video Understanding

ACL ARR 2024 April Submission714 Authors

16 Apr 2024 (modified: 15 May 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We present a novel fine-tuning framework that improves the motion sensitivity and length adaptability of Vision-Language Pretraining Models (VLPs), which are currently constrained by their dependence on static images or fixed-length video segments due to data and computational limits. Our framework introduces two main components: the Temporal Prompt Sampler (TPS), which uses optical flow to selectively sample video content based on motion, and the Spatial Prompt Solver (SPS), which accurately captures the complex spatial interplay between visual and textual elements. We further propose a self-boost training approach to harmonize TPS and SPS. Our framework's effectiveness is validated through rigorous testing on various advanced videoQA tasks and a temporal question grounding task, showing marked improvements in performance, efficiency, and generality across various VLPs and LLMs.
Paper Type: Short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Temporal Prior, Motion aware, efficient, video-language understanding
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 714
Loading